About 82% of gitHub code are clones of other files

in large •  7 years ago 

Researchers from the University of California and Czech Republic Technical University published a recent article exploring gitHubs repositories, code, and files. They found that 82% of code published and shared on gitHub are clones of previously created files.
They looked at about 4.5 million gitHub repo's, these repos hold about 482 million files. Their results found that only 85 million files were unique, that comes out to approximately 17.63% of all the repos analyzed.

javaScript was one of the big culprits

The research group looked at projects written in C++, Java, javaScript, & Python. Out of these languages javaScript was the largest abuser with the most duplicated codes of about 94% of the files being exactly 100% identical cloned.

Their results further showed that C++ came in second place with about 73% of their repos cloned, and Python came in at 71%.

NPM to blame

NPM is very popular among developers with majority of developers using NPM as the #1 package manager. While this is the tool of choice among developers use this tool because it contains over 350,000 libraries, as a result this is why most of the javaScript is so large because they import reused code.

Using GIT

So most of the projects are duplicated and do not go through the fork, instead a lot of the code gets copied and pasted, sometimes entire libraries.
Here is a map of the code duplicates called "DejaVu" from gitHub: DejaVu
mySQL dumps can also be found here as well: mySQL_Dumps

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!