More Than Half of GitHub Is Duplicate Code, Researchers Find (theregister.co.uk)

← Back to Stories (view on slashdot.org)

More Than Half of GitHub Is Duplicate Code, Researchers Find (theregister.co.uk)

Posted by msmash on Thursday November 23, 2017 @11:53AM from the redundant dept.

Richard Chirgwin, writing for The Register: Given that code sharing is a big part of the GitHub mission, it should come at no surprise that the platform stores a lot of duplicated code: 70 per cent, a study has found. An international team of eight researchers didn't set out to measure GitHub duplication. Their original aim was to try and define the "granularity" of copying -- that is, how much files changed between different clones -- but along the way, they turned up a "staggering rate of file-level duplication" that made them change direction. Presented at this year's OOPSLA (part of the late-October Association of Computing Machinery) SPLASH conference in Vancouver, the University of California at Irvine-led research found that out of 428 million files on GitHub, only 85 million are unique. Before readers say "so what?", the reason for this study was to improve other researchers' work. Anybody studying software using GitHub probably seeks random samples, and the authors of this study argued duplication needs to be taken into account.

18 of 115 comments (clear)

Min score:

Reason:

Sort:

Dupes? by Tablizer · 2017-11-23 12:02 · Score: 3, Funny

You you don't don't say say.

--
Table-ized A.I.
1. Re:Dupes? by Zaiff+Urgulbunger · 2017-11-23 13:00 · Score: 2
  
  You you don't don't say say.
  You're forking kidding me!
Git submodules = hard by brian.stinar · 2017-11-23 12:09 · Score: 3, Interesting

Yeah, it can be rough to learn how to use Git submodules...
Honestly though, the few times I've directly integrated with someone else's code, it hasn't exactly been library-ready. There was a lot of massaging that had to be done the last time I did this, so a straight up duplication of their stuff was actually not a bad idea (AFTER I submitted them a PR to try and help manage this.) Their application wasn't designed as a library though, so I'm not sure what the right thing to do when you library-ify someone's code actually should be.
Re: no surprise by hackwrench · 2017-11-23 12:11 · Score: 2

I don't understand how you can come to that conclusion. Forking under your own account is the most natural way of interacting with the code base.
More Than Half of GitHub Is Duplicate Code, Resear by Baron_Yam · 2017-11-23 12:17 · Score: 4, Funny

Richard Chirgwin, writing for The Register:

Given that code sharing is a big part of the GitHub mission, it should come at no surprise that the platform stores a lot of duplicated code: 70 per cent, a study has found. An international team of eight researchers didn't set out to measure GitHub duplication. Their original aim was to try and define the "granularity" of copying -- that is, how much files changed between different clones -- but along the way, they turned up a "staggering rate of file-level duplication" that made them change direction. Presented at this year's OOPSLA (part of the late-October Association of Computing Machinery) SPLASH conference in Vancouver, the University of California at Irvine-led research found that out of 428 million files on GitHub, only 85 million are unique. Before readers say "so what?", the reason for this study was to improve other researchers' work. Anybody studying software using GitHub probably seeks random samples, and the authors of this study argued duplication needs to be taken into account.
How could more than half be duplicate? by tie_guy_matt · 2017-11-23 12:43 · Score: 2

If half of the code is duplicate does that mean it is just a duplicate of the other half? If so then how would you know what the duplicate is and what the original is? Unless you count the duplicate code in with the original code in which case only one quarter of the code is a duplicate of the other quarter. Or maybe in my post thanksgiving carb haze I am over thinking this?
1. Re: How could more than half be duplicate? by joelsherrill · 2017-11-23 13:06 · Score: 3, Informative
  
  Even then, the original code may not be on GitHub. Peojexts like GCC, RTEMS and FreeBSD have the original code somewhere other than GitHub. So all of the code there for these and other projects is not original.
why dont that make one common pool by FudRucker · 2017-11-23 12:45 · Score: 2

put all the code in there and link it to the associated github accounts, providing the code is 100% identical it should work, but they must consider forks and even one line of code in one file will make a lot of difference in the compiled software

--
Politics is Treachery, Religion is Brainwashing
1. Re:why dont that make one common pool by KiloByte · 2017-11-23 12:50 · Score: 2
  
  This could be a lot easier if you had content-addressable storage that refers to objects by their SHA1 hash.
  
  --
  The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
Excluding forks? by Zaiff+Urgulbunger · 2017-11-23 13:05 · Score: 2

Do they mean (obv. I didn't read TFA) code is duplicated in non-forked code, or are they just observing that lots of projects will be forked by other users in order that they can play with it and post their pull requests to them?

'cos if it's the latter, then that's kind of obvious isn't it?
1. Re:Excluding forks? by Aighearach · 2017-11-23 14:19 · Score: 2
  
  They're saying, if you do research on software using github for your data, you have to take file duplication into account in your formulas.
  The problem, IMO, is that a lot of the rest is duplicated from somewhere else, but only one time on github, so the data is still polluted by duplication.
Avoiding dependency hell by Anonymous Coward · 2017-11-23 13:26 · Score: 2, Interesting

I wonder how much is just people trying to avoid dependency hell?
Because let's face it, when I just want "that one bit" of some gargantuan framework / solve-all / codeball-from-hell then I'd rather spend five minutes of disentangling and integrating than a lifetime playing in "follow the library".
Pull requests by manu0601 · 2017-11-23 13:36 · Score: 4, Informative

No surprise here, this is how this stupid thing works: in order to submit a one-line bugfix, one have to fork the repository, patch, commit, pull request.
1. Re:Pull requests by manu0601 · 2017-11-24 00:02 · Score: 2
  
  You don't have to fork it on github unless you want to use github's internal mechanisms. You can submit patches using any of the other mechanisms too, like a PR to an external repo, or a git-send email and so on and so forth.
  I must be unlucky, but every time I did that, I was answered to send a pull request.
Re: no surprise by MightyYar · 2017-11-23 14:44 · Score: 5, Informative

And the only way to push a change back to a repository you don't control! You fork, push your change to your fork, then create a pull request. This is by design - I have no idea why this is in any way a surprise.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Makes sense by barbariccow · 2017-11-23 16:32 · Score: 2

Makes sense... it's called a fork. Several of my projects are forked more times than they contain files..
Re:The Facebook of code by glenebob · 2017-11-23 19:20 · Score: 3, Funny

horrible JavaScript
I found duplication in your post.
Quite descriptive of software development nowadays by CustomSolvers2 · 2017-11-23 20:32 · Score: 2

People don't care about analysing code properly, learning from it or even adequately adapting it to whatever other situation. In fact, I think that a big proportion of programming-related people aren't even able to analyse/understand random pieces of slightly complex code. There is an (ignorant) tendency towards ridiculously-specific specialisations and a systematic promotion of copy-pasting, absolute-truth-repetition and arbitrary, group-based assessments; and this is precisely why you see so many problems in software everywhere: many people with lots to say in the industry not doing it properly, not knowing how to do it properly and not even able to recognise who does (not) do it properly. Personally and after having been releasing my biggest open-source code so far during the last months, I will be notably reducing my activity on this front. It is very discouraging seeing how a so lost system misuses and misinterprets my work.

DISCLAIMER: I am the sole author of all my public code (including associated resources like comments, documentation, etc.), in the sense that I have developed it completely from scratch. Additionally, note that I release all of it as public domain and that's why I am not precisely concerned about random people using it or referring me. I am exclusively interested in knowledgeable programmers analysing it to get a good idea about my skills.

--
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.