More Than Half of GitHub Is Duplicate Code, Researchers Find (theregister.co.uk)

Dupes? by Tablizer · 2017-11-23 12:02 · Score: 3, Funny

You you don't don't say say.

Re:Dupes? by Shikaku · 2017-11-23 12:08 · Score: 1

Can we get one for Slashdot too?
Re: Dupes? by Anonymous Coward · 2017-11-23 12:39 · Score: 0

Shutup. You don't know how hard this job is.

-=Beau=-
Re:Dupes? by Zaiff+Urgulbunger · 2017-11-23 13:00 · Score: 2

You you don't don't say say.
You're forking kidding me!
Re: Dupes? by Anonymous Coward · 2017-11-23 15:50 · Score: 0

That's what SHE said!
Re:Dupes? by Anonymous Coward · 2017-11-23 18:10 · Score: 0

Sure but it will cut the number of posts in half.
Re:Dupes? by zifn4b · 2017-11-24 00:56 · Score: 1

You you don't don't say say.
Too bad there isn't a server de-dupe solution for Slashdot to optimize space usage

--
We'll make great pets
Re:Dupes? by Anonymous Coward · 2017-11-24 05:42 · Score: 0

I'm gonna go get the papers, get the papers.
Re:Dupes? by Anonymous Coward · 2017-11-24 05:56 · Score: 0

How is that funny? Like Rita radner funny. Did he amuse you.
Useless post I know, just showing my anon coward appreciation.
Re:Dupes? by Anonymous Coward · 2017-11-24 07:38 · Score: 0

You you don't don't say say

Deduplication needed. by Anonymous Coward · 2017-11-23 12:02 · Score: 0

It's good to have implemented a report of de-duplication of the files and directories in the cloud.

The Facebook of code by Anonymous Coward · 2017-11-23 12:07 · Score: 0

Other research has discovered that over half of Github comments are the thumbs-up emoji, and of the non-duplicate code 73% of it is horrible JavaScript.

Re: The Facebook of code by Anonymous Coward · 2017-11-23 13:59 · Score: 0

Youâ(TM)re thinking of Stackoverload.
Re:The Facebook of code by glenebob · 2017-11-23 19:20 · Score: 3, Funny

horrible JavaScript
I found duplication in your post.
Re:The Facebook of code by Zero__Kelvin · 2017-11-24 01:17 · Score: 1

That's redundancy, not duplication.

--
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
Re:The Facebook of code by jellomizer · 2017-11-24 03:59 · Score: 1

So if I made a Web Application on GIT Hub, and I attached my downloaded version of jquery to my project, would that count?
These Libraries (espectially Javascript) are in source code form, and being so general purpose, they probably take up just as much space if not more then the app that uses it.

--
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Re:The Facebook of code by ColdWetDog · 2017-11-24 04:40 · Score: 1

Duplicate redundancy is even better.

--
Faster! Faster! Faster would be better!

Git submodules = hard by brian.stinar · 2017-11-23 12:09 · Score: 3, Interesting

Yeah, it can be rough to learn how to use Git submodules...

Honestly though, the few times I've directly integrated with someone else's code, it hasn't exactly been library-ready. There was a lot of massaging that had to be done the last time I did this, so a straight up duplication of their stuff was actually not a bad idea (AFTER I submitted them a PR to try and help manage this.) Their application wasn't designed as a library though, so I'm not sure what the right thing to do when you library-ify someone's code actually should be.

Re:Git submodules = hard by Anonymous Coward · 2017-11-23 13:07 · Score: 1

Forks. That's the major reason for all the duplicate code. Actually, that's rather how git is supposed to work. The fact that it's only on the order of 19% unique files is surprising more that the number of unique files are so high. The other surprising part is just how badly we are when it comes to code that we're still working things as above. I can't count the number of times I've seen programs and realized I want to make a trivial change and how it's simply not possible without grabbing a bunch of code, making the change, and then recompiling. Worse, plenty of people on github rarely or never make compiled releases which only further exacerbates the issue of people making forks. This isn't to condemn github per se but to note that as github has become something of a "use the source" archive, it's in a lot of ways worse to the old way of an executable drop. At least an executable drop could be trivially copied and work without any sort of expertise.
Now, I'll change my mind the day we've gotten to the point we've got standard development environments one can trivially run in a VM. But, no, the exact mix of OS, compiler, libraries, and build tools is such a dance that it's always something of a joke to me that so much stuff is GPLv2 and yet fails to, you know, comply with it by providing the means to actually build the code. Shame, too.
PS - Yea, you don't have to comply if they're the author of all the code. But, again, a lot of the code is forks and they DO have to legally comply. Or, well, they would if enforcement was a thing.
Re:Git submodules = hard by Anonymous Coward · 2017-11-23 13:22 · Score: 0

Take a closer look. Most of it is node_modules. At one point the recommended way to use npm was to check in node_modules.
Re:Git submodules = hard by serviscope_minor · 2017-11-23 20:11 · Score: 1

Yeah, it can be rough to learn how to use Git submodules...
Or, maybe they're using subtrees :)
Honestly though, the few times I've directly integrated with someone else's code, it hasn't exactly been library-ready. There was a lot of massaging that had to be done the last time I did this, so a straight up duplication of their stuff was actually not a bad idea (AFTER I submitted them a PR to try and help manage this.) Their application wasn't designed as a library though, so I'm not sure what the right thing to do when you library-ify someone's code actually should be.
Also, in many cases, I've picked a single C++ header file from various places and dumped them in. Sub-anything is overkill for a single, never changing, non security critical file.

--
SJW n. One who posts facts.
Re:Git submodules = hard by Anonymous Coward · 2017-11-24 00:37 · Score: 0

That's the way it goes with inhouse projects as well. You clone project X trunk into branch Y for project Y. All the local changes made go into branch Y. Then when the testing is complete, all those changes are brought back into X. Sometimes, there might be some cleanup of X.
I've always liked the idea of converting as much source code of an application as possible into libraries, and to have these libraries arranged as a stack, so that they can be slotted in and out.
Re:Git submodules = hard by Anonymous Coward · 2017-11-24 02:35 · Score: 0

It is near impossible to make a reusable library until it has been reused a few times.
Perhaps the researchers should publish a list of the most loved (copied) modules.
Would it be a wide assortment, or a select few.
Re:Git submodules = hard by Anonymous Coward · 2017-11-24 10:09 · Score: 0

Let us know where it says anything about being able to build the code in the GPL
"The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable."
So loophole one is if you never distribute an executable, you're not required to provide the non-standard tools to make an executable nor the build scripts. Loophole two, you're not required to provide the standard tools to make an executable even if those main components are being updated all the time and for all practical purposes aren't actually available. Loophole three, as you note they word "build" isn't explicitly used nor does the text explicitly link sentence two and three except with a special exception which doesn't clarify that while compiler A was used to build your program, compiler B is what's actually the major component of the operating system on which the executable runs. And feel free to quibble over whether that makes Visual Studio stuff something that can't be compliant or not.

This comment is hereby released under the GPLv2. You may note that you're unable to build software using it. That does not invalidate the license.
You're the original author, so you're not bound to being complaint with the GPLv2 (which I noted). And effectively as the author, you can ignore non-compliance to your hearts content. So, your comment is moot.

And if you are having a problem with "standard development environments one can trivially run in a VM", no one else is. There's an online service that will generate the development environment directly from your goddamn docker files, and give you any number of said instances inside a web browser, or wherever else you want.
The point is that going to a given github repository, I don't even know what the "standard development environment" is precisely because there aren't a mere handful. So, it doesn't make any sense to speak in terms of "[my] goddamn docker files". But, whatever.

Get out of the business, your skills have ossified, and your intellect with them. Actually, without any insults, there does come a time when you're not willing or able to keep up with this shit, and if there is anything else that you would prefer to do rather than updating your skill set, you may want to take that leap. Code's not really better or worse these days, but it's different all the time. You, however, seem a bit antiquated.
Thanks for trying to not be insulting. I'll just jump in and mention you're a complete moron. Good luck compiling some 3DS program off github with "your goddamn docker files" to fix a trivial bug or just compile the fucking thing since the release repo is horribly out of date. Or that year old, third degree port of openssh which way or may not rely upon WSL (since MS seems to be supporting at least three version of SSH now). Or "insert yet another random program that has a bug" that you would like to fix but don't have the desire to spend hours trying to figure up whatever their fucking development environment actually is and then figure out if you can even spit out something on your own distro that works without running the resulting thing in its own vm.
Get with the times. Use LISP. :)

Re: no surprise by hackwrench · 2017-11-23 12:11 · Score: 2

I don't understand how you can come to that conclusion. Forking under your own account is the most natural way of interacting with the code base.

More Than Half of GitHub Is Duplicate Code, Resear by Baron_Yam · 2017-11-23 12:17 · Score: 4, Funny

Richard Chirgwin, writing for The Register:

Given that code sharing is a big part of the GitHub mission, it should come at no surprise that the platform stores a lot of duplicated code: 70 per cent, a study has found. An international team of eight researchers didn't set out to measure GitHub duplication. Their original aim was to try and define the "granularity" of copying -- that is, how much files changed between different clones -- but along the way, they turned up a "staggering rate of file-level duplication" that made them change direction. Presented at this year's OOPSLA (part of the late-October Association of Computing Machinery) SPLASH conference in Vancouver, the University of California at Irvine-led research found that out of 428 million files on GitHub, only 85 million are unique. Before readers say "so what?", the reason for this study was to improve other researchers' work. Anybody studying software using GitHub probably seeks random samples, and the authors of this study argued duplication needs to be taken into account.

jquery by Anonymous Coward · 2017-11-23 12:22 · Score: 0

I can't even imagine the number of times jquery is directly copied into a project

Re:jquery by Aighearach · 2017-11-23 14:10 · Score: 1

If a project has very few dependencies, they might be following best practices to include the exact libraries they are using with the distribution. It isn't a one-size-fits-all situation.

Re:More Than Half of GitHub Is Duplicate Code, Res by viperidaenz · 2017-11-23 12:25 · Score: 0, Redundant

Richard Chirgwin, writing for The Register :

Given that code sharing is a big part of the GitHub mission, it should come at no surprise that the platform stores a lot of duplicated code: 70 per cent, a study has found. An international team of eight researchers didn't set out to measure GitHub duplication. Their original aim was to try and define the "granularity" of copying -- that is, how much files changed between different clones -- but along the way, they turned up a "staggering rate of file-level duplication" that made them change direction. Presented at this year's OOPSLA (part of the late-October Association of Computing Machinery) SPLASH conference in Vancouver, the University of California at Irvine-led research found that out of 428 million files on GitHub, only 85 million are unique. Before readers say "so what?", the reason for this study was to improve other researchers' work. Anybody studying software using GitHub probably seeks random samples, and the authors of this study argued duplication needs to be taken into account.

Downplay much by lucm · 2017-11-23 12:26 · Score: 1

70% is a lot more than half. In this case the difference between half and 70% is a casual 129,000,000 duplicated files.

Kudos for not going in mega-clickbait mode, but still, "nearly 3/4 or more than 2/3" would be a better title.

--
lucm, indeed.

Re: Downplay much by Anonymous Coward · 2017-11-23 12:38 · Score: 0

A better title would be my penis in your butt. How you doin'!?!?
Re: Downplay much by Anonymous Coward · 2017-11-23 12:53 · Score: 0

Seconded! Rooting for you two.
Re:Downplay much by zifn4b · 2017-11-24 00:58 · Score: 1

70% is a lot more than half. In this case the difference between half and 70% is a casual 129,000,000 duplicated files.
Kudos for not going in mega-clickbait mode, but still, "nearly 3/4 or more than 2/3" would be a better title.
The files aren't duplicated with modern clustered file storage technology. They're only logically duplicated. That's why I don't see why this topic is of interest.

--
We'll make great pets
Re:Downplay much by Anonymous Coward · 2017-11-24 05:03 · Score: 0

sampling
Re: Downplay much by Anonymous Coward · 2017-11-24 06:31 · Score: 0

What? Are you a robot? Duplication of data on the site is the point.

GitHub has Become a Resume Builder by Anonymous Coward · 2017-11-23 12:28 · Score: 0

For the most part. There are many more crappy "me too" projects that were obviously knocked off from other well known projects for the sole purpose of cheating to make resumes look better. Really, this should surprise nobody given the level of cheating already going on in CS degree programs. If coding is what's needed to get a job in the future, you will always have people willing to fake it until they make it, at least until they finally get a real project and fail.

Re:GitHub has Become a Resume Builder by lucm · 2017-11-23 17:25 · Score: 1

There are many more crappy "me too"
^ trigger warning

--
lucm, indeed.

How could more than half be duplicate? by tie_guy_matt · 2017-11-23 12:43 · Score: 2

If half of the code is duplicate does that mean it is just a duplicate of the other half? If so then how would you know what the duplicate is and what the original is? Unless you count the duplicate code in with the original code in which case only one quarter of the code is a duplicate of the other quarter. Or maybe in my post thanksgiving carb haze I am over thinking this?

Re: How could more than half be duplicate? by joelsherrill · 2017-11-23 13:06 · Score: 3, Informative

Even then, the original code may not be on GitHub. Peojexts like GCC, RTEMS and FreeBSD have the original code somewhere other than GitHub. So all of the code there for these and other projects is not original.
Re:How could more than half be duplicate? by Zero__Kelvin · 2017-11-24 01:23 · Score: 1

" If so then how would you know what the duplicate is and what the original is?"
Somebody should invent time and some way of recording it when a file is checked in, along with who is doing the commit! Either you have never used git or spent no time thinking before you posted.

--
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun

why dont that make one common pool by FudRucker · 2017-11-23 12:45 · Score: 2

put all the code in there and link it to the associated github accounts, providing the code is 100% identical it should work, but they must consider forks and even one line of code in one file will make a lot of difference in the compiled software

--
Politics is Treachery, Religion is Brainwashing

Re:why dont that make one common pool by KiloByte · 2017-11-23 12:50 · Score: 2

This could be a lot easier if you had content-addressable storage that refers to objects by their SHA1 hash.

--
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
Re:why dont that make one common pool by Anonymous Coward · 2017-11-23 12:59 · Score: 0

You mean like the files in the .git subdirectory?
Re:why dont that make one common pool by angel'o'sphere · 2017-11-23 13:25 · Score: 1

You mean like a git repository?

--
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
Re: why dont that make one common pool by Anonymous Coward · 2017-11-23 17:50 · Score: 0

Hashes are not unique identifiers of files larger than the hash size. There will be collisions.
Re: why dont that make one common pool by zaphirplane · 2017-11-23 20:06 · Score: 1

Or the woo sh cvs
Re: why dont that make one common pool by Zero__Kelvin · 2017-11-24 02:37 · Score: 1

Files also have metadata. You calculate the hash and compare the metadata including size and creation date. If they are identical so are the files for all practical purposes.

--
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
Re: why dont that make one common pool by KiloByte · 2017-11-24 12:41 · Score: 1

Why would you care about creation date? Git doesn't preserve it, and even in tools that do, timestamps will usually be mangled when someone copies the code around (even cp doesn't default to -p). Thus, those 70% of copied files from the article would have almost all dates different even when byte-to-byte identical.
A non-broken hash is enough. SHA1 is somewhat broken: known collisions exist, but those are not very interesting, and git has a modified version of SHA1 that detects the attack and uses an incompatible hash on such files.

--
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
Re: why dont that make one common pool by Zero__Kelvin · 2017-11-24 14:23 · Score: 1

I agree, and certainly trust git to do it right. I was just attempting to add some extra extra assurance for the GP, that admittedly is unnecessary and as you say, upon further reflection it was a bad idea.

--
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun

libraries by Anonymous Coward · 2017-11-23 12:54 · Score: 0

I need library XYZ. I include the source in my commit. Alice needs library XYZ. She includes it in her commit. Bob needs library XYZ. He includes it... and so on.

Even if our hello world apps are different, the libraries to build them are identical.

Sure, lots of people will rely on separate installation of dependencies, but a lot of people won't.

70% doesn't surprise me.

Re: libraries by zaphirplane · 2017-11-23 20:05 · Score: 1

That is not how it should be done in Java (maven or gradle) Python pip in c/cpp you usually just expect the dependency tone handled by a human. I only know of Golang where people check in the vendored dep
I assume you mean JS, thou Iâ(TM)m not sure why with npm exsisting
Re: libraries by pjt33 · 2017-11-24 01:21 · Score: 1

in c/cpp you usually just expect the dependency tone handled by a human
Which is the biggest reason that I hate compiling other people's C. In the absence of a standard way to document dependencies, too many projects simply don't.

Excluding forks? by Zaiff+Urgulbunger · 2017-11-23 13:05 · Score: 2

Do they mean (obv. I didn't read TFA) code is duplicated in non-forked code, or are they just observing that lots of projects will be forked by other users in order that they can play with it and post their pull requests to them?

'cos if it's the latter, then that's kind of obvious isn't it?

Re:Excluding forks? by Aighearach · 2017-11-23 14:19 · Score: 2

They're saying, if you do research on software using github for your data, you have to take file duplication into account in your formulas.
The problem, IMO, is that a lot of the rest is duplicated from somewhere else, but only one time on github, so the data is still polluted by duplication.
Re:Excluding forks? by Anonymous Coward · 2017-11-23 21:13 · Score: 0

> They're saying, if you do research on software using github for your data, you have to take file duplication into account in your formulas.
Except that the critical point is that they'd be wrong if most duplication is due to forks.
Then it would be enough to take forks into account, which is MUCH less costly and github even provides you that information in an easily accessible way.
Re:Excluding forks? by Anonymous Coward · 2017-11-23 21:23 · Score: 0

Replying to myself:
This post is actually more useful, quoting a table on that topic: https://blog.acolyer.org/2017/11/20/dejavu-a-map-of-code-duplicates-on-github/
About 50% of project on github are forks.
They don't download/analyze those.
So the duplicates are from projects that are not forks of some other github project. Though I guess they had no way of excluding forks of a non-github project.
However I find it a bit worrying that the median number of commits per project is 4-6. That means more than half the things they analyzed is a best code-dumps, not an actual project that could be considered alive in any way.
Re:Excluding forks? by multi+io · 2017-11-24 03:10 · Score: 1

Do they mean (obv. I didn't read TFA) code is duplicated in non-forked code
Yes they do mean that. The summary should've mentioned this. From https://dl.acm.org/citation.cf...:

(abstract) [...] This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub representing over 428 million files written in Java, C++, Python, and JavaScript. [...]
Re:Excluding forks? by Anonymous Coward · 2017-11-24 04:41 · Score: 0

I did read the article.
A good chunk of dup code is JS code. Specifically node_modules. If you make an angular project and then do not set your git_ignore correctly and check in node_modules you can check in a good ~150 meg of external code for a simple hello world angular project. They also found a good chunk of the duplicate files they had were 0-2 bytes in length. Usually a carriage return/line feed depending on originating platform. Once they threw that out of the numbers it was more like 10-20% duplication. Usually older versions of the libraries before they were on github.
Re:Excluding forks? by Aighearach · 2017-11-24 13:56 · Score: 1

No, you're just not understanding my words. I already addressed your point. So maybe try saying it again but leave out the "except that" that purports to talk about something I didn't address.
I'm saying, the results would still be problematic. You're saying, "Good enough for me!" except phrasing it as if it is external to you.

Avoiding dependency hell by Anonymous Coward · 2017-11-23 13:26 · Score: 2, Interesting

I wonder how much is just people trying to avoid dependency hell?

Because let's face it, when I just want "that one bit" of some gargantuan framework / solve-all / codeball-from-hell then I'd rather spend five minutes of disentangling and integrating than a lifetime playing in "follow the library".

Developers find Shared libraries are useful? Amazi by Anonymous Coward · 2017-11-23 13:27 · Score: 0

Wow.

Pull requests by manu0601 · 2017-11-23 13:36 · Score: 4, Informative

No surprise here, this is how this stupid thing works: in order to submit a one-line bugfix, one have to fork the repository, patch, commit, pull request.

Re:Pull requests by Tony+Isaac · 2017-11-23 17:40 · Score: 1

It's true that git stores snapshots.
However, if you make a one-line change, it's not going to store new copies of every file in the repository. It only stores a new and old copy of the one file that changed.
https://git-scm.com/book/en/v2...
So yes, there is some duplication, but not the entire repository for each change.
Re:Pull requests by Tenebrousedge · 2017-11-23 18:19 · Score: 0

No shit, Sherlock. If you thought anyone here needs to be told that then I hope that you were drunk instead of assuming that everyone else here is at your intellectual level.

--
Those who advocate genocide deserve every protection afforded by law, and none afforded by common human decency.
Re:Pull requests by Anonymous Coward · 2017-11-23 19:28 · Score: 0

Since you are so smart, please answer GGP's post too, not only GP's.
Re:Pull requests by serviscope_minor · 2017-11-23 20:17 · Score: 1

No surprise here, this is how this stupid thing works: in order to submit a one-line bugfix, one have to fork the repository, patch, commit, pull request.
You don't have to fork it on github unless you want to use github's internal mechanisms. You can submit patches using any of the other mechanisms too, like a PR to an external repo, or a git-send email and so on and so forth.
It is however rather convenient.

--
SJW n. One who posts facts.
Re:Pull requests by Anonymous Coward · 2017-11-23 22:07 · Score: 0

Everyone here knows that. The user interface doesn't reflect the internals of the git system though, so if you're a researcher scraping 'how many projects on github' or 'lines of code on github' and fail to take forks into account then you're going to be way off.
Re:Pull requests by manu0601 · 2017-11-24 00:02 · Score: 2

You don't have to fork it on github unless you want to use github's internal mechanisms. You can submit patches using any of the other mechanisms too, like a PR to an external repo, or a git-send email and so on and so forth.
I must be unlucky, but every time I did that, I was answered to send a pull request.

Re:More Than Half of GitHub Is Duplicate Code, Res by Aighearach · 2017-11-23 14:08 · Score: 1

That's the hilarious part; duplicating code is also most of the purpose of github!!

Wetness detected in local river!

Or by no-body · 2017-11-23 14:25 · Score: 1

reused/recycled code. One would be stupid to event/develop everything from the very beginning yet again...
- haven't looked at the study though, no time..

Re: no surprise by MightyYar · 2017-11-23 14:44 · Score: 5, Informative

And the only way to push a change back to a repository you don't control! You fork, push your change to your fork, then create a pull request. This is by design - I have no idea why this is in any way a surprise.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.

how much was autotools related by Anonymous Coward · 2017-11-23 16:32 · Score: 0

I wonder if autotools was factored into this number as this gets pretty much copied to everything.

Makes sense by barbariccow · 2017-11-23 16:32 · Score: 2

Makes sense... it's called a fork. Several of my projects are forked more times than they contain files..

Re:Makes sense by CustomSolvers2 · 2017-11-24 02:10 · Score: 1

it's called a fork
Correct me if I am wrong, but isn't the whole point of a fork to perform some, ideally relevant enough modifications on the original code? By assuming that the conclusion of 70% of duplicate code was output by a sensible and reliable methodology, this would mean that most of the forked files are identical to their original version! So, what is the point of forking or having a repository exactly replicating the contents of another one? You can clone/download all what you wish and enjoy it on your own machine, but why having publicly accessible codes which have been basically developed by other people? This doesn't seem to be compatible with what open source, GitHub, etc. is expected to be. You should be creating something by your own or modifing/improving/extending what others did, but why having a repository with code which has nothing to do with you? There is another also quite negative issue which might somehow explain all this: inefficiently developed algorithms where the same code chunks are repeated over and over; everyone relies on this kind of quicker, easier approaches for temporary, intermediate, preliminary, etc. reasons, but why publicly sharing bad code? Nothing of this makes sense to me.

--
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.
Re:Makes sense by Anonymous Coward · 2017-11-24 03:18 · Score: 0

I personally know the people that wrote the paper and helped review the presentation, forks and branches are not included in this. These are world class researchers who did their diligence.

int main(void) by Anonymous Coward · 2017-11-23 18:13 · Score: 0

Geeze, can't you guys stop copying each other?

In other news 428 million Americans poop and pee. by Anonymous Coward · 2017-11-23 18:45 · Score: 0

Great minds think alike so perhaps helping speed up the journey might help. Coffee seems to work for me in both situations. :)

Quite descriptive of software development nowadays by CustomSolvers2 · 2017-11-23 20:32 · Score: 2

People don't care about analysing code properly, learning from it or even adequately adapting it to whatever other situation. In fact, I think that a big proportion of programming-related people aren't even able to analyse/understand random pieces of slightly complex code. There is an (ignorant) tendency towards ridiculously-specific specialisations and a systematic promotion of copy-pasting, absolute-truth-repetition and arbitrary, group-based assessments; and this is precisely why you see so many problems in software everywhere: many people with lots to say in the industry not doing it properly, not knowing how to do it properly and not even able to recognise who does (not) do it properly. Personally and after having been releasing my biggest open-source code so far during the last months, I will be notably reducing my activity on this front. It is very discouraging seeing how a so lost system misuses and misinterprets my work.

DISCLAIMER: I am the sole author of all my public code (including associated resources like comments, documentation, etc.), in the sense that I have developed it completely from scratch. Additionally, note that I release all of it as public domain and that's why I am not precisely concerned about random people using it or referring me. I am exclusively interested in knowledgeable programmers analysing it to get a good idea about my skills.

--
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.

Code isn't random by DrYak · 2017-11-23 21:49 · Score: 1

No matter what Perl looks to you (even if it is valid code written by your cat walking across the keyboard), not every random jumble of noise is valid code.

Yes, it is entirely possible that two files of size > sizeof(SHA1) (= 128 bits) will have the same hash.
But on the other hand, it's very likely that none of them is valid code, but gibberish.

Once you intersect both requirements (must share a hash and must be legit code) suddenly the probability drops a lot (because "must be code" is a very stringent criteria that drastically reduce the search space of possible files to a infinitesimal fraction).
At that point you're in "Shakespear-typing monkeys" territory. Yes, the probability is non-null. But at that point you're better off playing lottery until the collapse of the civilisation, you'd have better odds of winning.

As a matter of facts, "Shattered" the current known computed collision of SHA-1 is a pair random nonsensical blocks of gibberish. It can only be exploited in systems that can embed arbitrary blobs (attachments) and feature a turing-complete language (post-script) that can react upon the blob - PDF files.

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]

Re: no surprise by Anonymous Coward · 2017-11-23 23:08 · Score: 0

It is a surprise because they explicitly excluded forks from their dataset...

Ahhh by Anonymous Coward · 2017-11-23 23:17 · Score: 0

So we have 1000's of round wheels, instead of a few squared and octagon ones.

Business logic can be all over the place, but basic functions and patterns tend to be closely tied to the natural designs of computing and/or logic

The rule by Anonymous Coward · 2017-11-24 01:12 · Score: 0

That's funny. In nearly every legacy system or codebase I've inherited I've found that at least 75% of code is clearly useless at minimum. I call it the 75% rule. It's always possible to rewrite it as something 25% the size while at the same time making it faster, less resource intense, more portable, with less dependencies, more secure, more flexible, bug free, more defensive, more maintainable, more auditable, more readable, etc.

It gets worse than that though. There's a lot of duplication that isn't easily detectable by automation. For example unrolling a loop wont show up on a duplicate line measurement. Often people put what would be variables as language constructs such as procedure_a, procedure_b where you should really have procedure(letter).

This study has the potential to be flawed though because of the way git works (deduplication). A lot of things can throw it off, there are several ways to look at a problem.

Re:More Than Half of GitHub Is Duplicate Code, Res by Half-pint+HAL · 2017-11-24 02:22 · Score: 1

That's the hilarious part; duplicating code is also most of the purpose of github!!

Wetness detected in local river!

How about reading the point made in TFS?

The researchers did this study because Github is used as a source of data for identifying trends in computing. As they say, this duplication of code skews the results, and anyone wanting to draw serious conclusions from this data needs to account for this.

The important data isn't the headline, it's... well... the data. I'm hoping there will be less (virtual) printing of sensationalist "JavaScript is the best language in the world" headlines due to this prompting people to question the methodology.

--
Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'

Keeping the Stats Up by coofercat · 2017-11-24 02:24 · Score: 1

I'm doing my bit to keep the stats up though, There are no 'duplicates' of any of my code ;-)

Been there, done that by dallaylaen · 2017-11-24 02:26 · Score: 1

In an open-source project aimed at in-house usage - I don't want my "customer" to suffer denial of service just because a 3rd party neither of us controls (or the internet provider) went down.

I wonder what the proper procedure could be? Put it under /3rd-party? Add as a build-time dependency? Something else?..

--
WYSIWIG, but what you see might not be what you need

Re:Been there, done that by Aighearach · 2017-11-24 14:03 · Score: 1

I like the Ruby approach with Bundler. You can point to a remote repository, and then have it basically locked to a version and cached into the project. You get a lot of choices in how to manage it, and once it is set up it is easy for the developers working with it.
In C if you put in a directory and integrate it into make then you're doing good! Never ever ever believe it is "good enough" to just document how to build it. If it ships with the project, it has to build with the project.

Re: no surprise by MightyYar · 2017-11-24 03:04 · Score: 1

Where does it say that? I just re-read it and I'm pretty sure you made that up. There is a mention of copy and paste also contributing at the bottom, but that's it.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.

Re: no surprise by Anonymous Coward · 2017-11-24 03:14 · Score: 1

I helped review the presentation and personally know the people that put together the paper, I can confirm they excluded forks

So your PR is accessible, avoid tying to one drive by raymorris · 2017-11-24 03:15 · Score: 1

> You can clone/download all what you wish and enjoy it on your own machine, but why having publicly accessible codes which have been basically developed by other people

There are a couple major reasons to make your version of the project accessible on the internet. Maybe the most important is so that other people can see your pull requests. As an example, I used to do a lot of work on some software called Moodle, which is used by many schools. Moodle has a mature development process, so any changes to Moodle code are reviewed, commented on, and approved by at least two people other than the author. Typically three to five people comment on a pull request. It would be pretty hard for my peers to make suggestions about my proposed changes, or approve them, if they couldn't see them. Making the changes I propose available allows us all to work together - very much the spirit of open source.

Additionally, "enjoy on your own machine" brings up the question which of my machines? Primary desktop at work, where I type most the code? The development server where I test it? My laptop I use when I work at home? Having the source available on the internet is useful for the same reasons it's useful to be able to access your Gmail from anywhere, not just from one "local computer".

At my current job, our *company* has forks that our *team* works on before submitting a pull request upstream. Which local computer would you save our copy on that our whole team could see it and work with it?

Also, a few dozen schools use changes and additional modules I wrote which never made it into the official distribution. They aren't currently cross-platform enough for the project to include them because the main project runs on Mysql, MariaDB, Ms-SQL, Postgres, Oracle, and some others. It's still useful for my stuff to be accessible for those who want to use it. They'll just have to use either MySQL/Maria or MS SQL, or make their own adjustments to my code if they use Oracle.

To me the main reason is the first reason, though - it allows other people to see and comment on my change, review it, before the change is integrated into the official package everyone uses.

MOOC effect by Anonymous Coward · 2017-11-24 03:20 · Score: 0

All the online (or most anyway) MOOC courses use github. A lot of assignments involve forking a repo and modifying one or two files.

Re:So your PR is accessible, avoid tying to one dr by CustomSolvers2 · 2017-11-24 03:33 · Score: 1

Maybe the most important is so that other people can see your pull requests.

But this makes lots of sense. This is precisely the whole point of forking: actively and publicly contributing in others' code. It doesn't matter if the PR is accepted or not, you have already modified the original version. What doesn't make sense is forking something which you don't touch at all; perhaps temporarily and under very specific conditions, but not that being the general rule.

Additionally, "enjoy on your own machine" brings up the question which of my machines?

I meant this in case you weren't interested in modifying the original code (what the forks/PRs are for), but just in using it.

At my current job, our *company* has forks that our *team* works on before submitting a pull request upstream. Which local computer would you save our copy on that our whole team could see it and work with it?

Exactly the same than the previous scenarios: in the moment you perform whatever modification and save it in GitHub, the corresponding file stops being identical to the original one. It doesn't matter if you do a PR or not, that scenario shouldn't count as a duplicate anyway.

--
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.

Re: no surprise by Anonymous Coward · 2017-11-24 03:33 · Score: 0

While the article about the paper may neglect to mention this fact, the paper itself (linked from the article) is quite explicit: "We skipped forked projects as forks contain a large amount of code from the original projects, retaining those would skew our findings."

Re:Quite descriptive of software development nowad by CustomSolvers2 · 2017-11-24 03:59 · Score: 1

With "I will be notably reducing my activity on this front", I logically meant my public-source activity. I have already lots of public codes which can help anyone interested in (and capable of) understanding my programming skills and working attitude. I will most likely continue having a quite relevant programming-related online activity, but will not be wasting my time in over-commenting and making codes everyone-friendly to be ignored or cluelessly misassessed by those only knowing how to count stars/lines of code and to run ready-to-be-used programs. In any case, I do look forward to the current ridiculously-bad-for-everyone situation to gradually change and to modify my behaviour accordingly. Under equivalent conditions, I will always prefer to share/show/give/contribute/help than to keep everything to myself.

--
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.

Moodle is 5,000-10,000 files. Kernel is 24,000 by raymorris · 2017-11-24 04:03 · Score: 1

> corresponding file stops being identical

Yep, the two or three or four files I change are no longer identical. The other 4,997 files in the project haven't changed, they are identical in both versions (forks). GitHub, presents my version of the *project*. It doesn't only show the differences and force users to download from someone else's fork, then apply my changes. They can just download my version of the project. (GitHub can also show the differences, if that's what someone wants to see.)

That does NOT mean GitHub physically stores all those different copies on disk. It just presents my version of the project, including files that are the same as someone else's version.

Re:Quite descriptive of software development nowad by CustomSolvers2 · 2017-11-24 04:21 · Score: 1

And as far as I am clarifying issues which should be evident to virtually anyone, also note that "my public-source activity" represents a net loss for me (potentially beneficial in the long term, logically). It is a mere self-promotion where I don't earn a penny; in fact, I lose a lot via time/effort investment and having to worry about addressing the most clueless concerns of random idiots (the sensible, knowledgeable people truly interested in properly understanding, learning, contributing, etc., on technical aspects or, eventually, to properly use my activity to determine my suitability for whatever project seem to be a minority!). The idea of me only earning money via being hired as a (remotely-working) programmer to work on whatever development (although being quite picky with clients/projects) seems particularly difficult to understand for some people, despite my multiple repetitions in many places (+ the evidence that a remotely-working programmer is usually mostly interested in working as a programmer). This is another aspect which puzzles/bothers/tires me a lot about a big part of the online programming (or whatever you want to call it) community: how are these people exactly working/getting money if I have to systematically explain them so evident issues?! Anyway, I guess that this is more than enough on the evident over-clarification front for today.

--
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.

Re: no surprise by Anonymous Coward · 2017-11-24 04:28 · Score: 0

I just re-read it

You did NOT read the paper, which states in many places that forks were excluded. Here is one such passage, "We skipped forked projects as forks
contain a large amount of code from the original projects, retaining those would skew our findings."

Re:Moodle is 5,000-10,000 files. Kernel is 24,000 by CustomSolvers2 · 2017-11-24 04:33 · Score: 1

Yep, the two or three or four files I change are no longer identical. The other 4,997 files

This is explained in other comments in the thread: GitHub doesn't seem to internally care about the non-modified forked files (it only shows everything to the user) and, in any case, the counting methodology has to forcibly care about this issue, otherwise the proportion of duplicate files would be extremely high (easily over 99%) and not descriptive of the real usage of the platform. For example, I have a couple of forks to the public .NET repositories, each of them might contain millions of files and I only modified 2 or 3 (reasonably relevant modifications though); if you count all these files in my dupe counter, I would have over 99.99% duplicates just because of this what doesn't make sense.

--
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.

Identical != Duplicate by engineerErrant · 2017-11-24 05:24 · Score: 1

That's like calling identical twins "duplicate twins" and saying we should drop half of them in any study of population genetics.

If two code files are the same, that's not just noise - a person made that happen for some purpose. It makes no difference whether you find that "bad" or "sloppy" - it's a legitimate part of the in-use population.

Now, that doesn't mean some studies shouldn't still drop them - for example, if I'm studying the *writing* of code, I might want a sample of unique stretches of code that were directly written, not just copied or forked. It just means we shouldn't presume we're improving the work of "other researchers" by casting all these files as useless filler (and I'm guessing most people who are smart enough to research code have already thought of this and are either accounting for it in some way).

Most used programming languages by hvidstue · 2017-11-24 06:23 · Score: 1

Also this could affect the surveys of what programming languages are most used.
At worst the current surveys only shows in which language programmers do most copy-paste code.

also duped from stackoverflow by Anonymous Coward · 2017-11-24 07:47 · Score: 0

I wonder how much of that duplicate code is copied from stackoverflow?

95 characters by Anonymous Coward · 2017-11-24 08:40 · Score: 0

My own investigation shows that the majority of github code is made of only distinct 95 characters!
All the rest are dupes!!!

not surprising by bobmajdakjr · 2017-11-24 11:46 · Score: 1

a lot of bots and stupid people use the fork button to bookmark or make themselves look legit and most forks go nowhere. so yeah.

Re:More Than Half of GitHub Is Duplicate Code, Res by Aighearach · 2017-11-24 14:04 · Score: 1

Thanks for pointing that out, I had no idea that the word "wet" fails to describe the local river with the maximum known precision! Golly.

Re: no surprise by MightyYar · 2017-11-26 06:55 · Score: 1

Thanks for pointing that out - I followed the link to the abstract and then downloaded the paper, and you are correct. The Register article is misleading... lesson learned.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.

More surprising 30% are different by Anonymous Coward · 2017-11-26 22:32 · Score: 0

By design to make a one line change you fork a project, change that line, commit (for me usualy to a new branch) and send a pull request.

Now what surprises me is that 30% of files actually differ. Or is that just that a lot of forks didn't update to the latest version so what actually differs is the main project since it has moved on?

Re:More Than Half of GitHub Is Duplicate Code, Res by Half-pint+HAL · 2017-11-27 01:00 · Score: 1

Óbh-óbh.

Look, this is more like pointing out that you're measuring the total length of the world's rivers wrong when you measure the source of the Rio Negro and the Rio Amazon from source to sea, because for a fair portion of that length, the Rio Negro is the Amazon. If hydrological researchers were making such a fundamental error, someone would have to point it out.

But code researchers were making a completely analogous error, and it needed quantified. And now it is.

--
Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'

Re:More Than Half of GitHub Is Duplicate Code, Res by Aighearach · 2017-11-27 16:27 · Score: 1

It is kind of like that, except in your example there is one mistake that goes away when you apply the fix, and in the story, it is still really fuzzy and the remaining code might even still be mostly copied.

So it is like if you didn't have maps of the rivers, and didn't know which ones overlap, and so the data is complete crap, and then you find a fragmented map and now you know where some parts of a few of the rivers are. It is progress towards a good goal, but the data is still crap so far.

Re:So your PR is accessible, avoid tying to one dr by barbariccow · 2017-12-01 06:55 · Score: 1

If you're asking WHY do folks fork and NOT modify, it's to "lock" a version, and to be able to build in an automated way. Granted, git supports this via checking out a specific commit, but for some reason a LOT of folks find it better to fork it, and then clone off that fork. The only advantage I can think of is it protects you from the original deleting the project altogether.

So imagine if you're developing a commercial software that uses LibraryA. You write it to how LibraryA looked when you pulled it and developed it. You want to automate your build so that you pull the dependencies from a URL. So you fork it, and then you just always pull from that fork. If at any point in time you're allocated resources to do updates, then can merge into that branch.

So that's why someone would fork and not change it. Another part to which maybe you're referring is to why "so much cold would remain unchanged." Well, a pull request could be a single line, it could touch 20 files out of a project of 5000. Most code is likely going to remain a copy in any fork scenario.

Re:So your PR is accessible, avoid tying to one dr by CustomSolvers2 · 2017-12-01 07:48 · Score: 1

You want to automate your build so that you pull the dependencies from a URL

Curious! Inefficient and uncontrollable but the kind of thing which a big number people might do. I would never have thought about doing something like that myself; so, very helpful information, thanks! In fact, it kind of explains a weird issue which I have been seeing while streaming from the site of a major TV network in my country for some months (I think that it isn't there anymore). There were always problems/delays while connecting for the first time and, after that, regular pauses and reduction of quality. I started noticing that when that happened the given application was connecting to GitHub (you know in the lower part of the browser where you can see regular connections to ad providers and similar)! And I found it extremely crappy! Why not having your own (ideally local) copy! Or, at least, connecting to a site precisely meant to perform these actions, which isn't the case with GitHub! I even visited that repository and it was a library meant to simplify the implementation of streaming services, but it was quite small file!

--
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.

Re:So your PR is accessible, avoid tying to one dr by barbariccow · 2017-12-04 06:54 · Score: 1

Lots of projects use "generic build tools" and unfortunately, this may be the easiest and safest way to integrate AND get the project that's under-budgeted by months out the door.

Re:So your PR is accessible, avoid tying to one dr by CustomSolvers2 · 2017-12-04 07:17 · Score: 1

I personally don't have too much experience on this specific front, but I guess that, under very specific conditions, there is no harm in building from a source code in GitHub. On the other hand, doing something like what I described in my previous comment and letting an in-production streaming application systematically communicating with GitHub to get a small file seems gross incompetence; to not mention the fact that streaming is precisely closely related to the core business of that company. It is incompetence of the person who takes a ready-to-be-used code without properly understanding/debugging/adapting it; also of those who originally developed that code, for not having setup a better/more efficient alternative (and/or clear warnings/instructions); it is even incompetence of the managers mishiring/mismotivating/mispaying and pushing beyond what is logical to meet ridiculous milestones; even the tolerance of the viewers, accepting problems and errors as normal, might be partially blamed. All this seems wrong for many reasons, easily improvable and very difficult to be justified. At least, for me, for my expectations and the kind work I do/conditions I accept.

--
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.

Slashdot Mirror

More Than Half of GitHub Is Duplicate Code, Researchers Find (theregister.co.uk)

115 comments