Researchers from the University of California and Czech Republic Technical University published a recent article exploring gitHubs repositories, code, and files. They found that 82% of code published and shared on gitHub are clones of previously created files.
They looked at about 4.5 million gitHub repo's, these repos hold about 482 million files. Their results found that only 85 million files were unique, that comes out to approximately 17.63% of all the repos analyzed.
javaScript was one of the big culprits
The research group looked at projects written in C++, Java, javaScript, & Python. Out of these languages javaScript was the largest abuser with the most duplicated codes of about 94% of the files being exactly 100% identical cloned.
Their results further showed that C++ came in second place with about 73% of their repos cloned, and Python came in at 71%.
NPM to blame
NPM is very popular among developers with majority of developers using NPM as the #1 package manager. While this is the tool of choice among developers use this tool because it contains over 350,000 libraries, as a result this is why most of the javaScript is so large because they import reused code.
Using GIT
So most of the projects are duplicated and do not go through the fork, instead a lot of the code gets copied and pasted, sometimes entire libraries.
Here is a map of the code duplicates called "DejaVu" from gitHub: DejaVu
mySQL dumps can also be found here as well: mySQL_Dumps