Searchable C/C++ DB surpasses 275 million lines
Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."
The following "interesting statistics" come to mind:
You gotta get the variables searchable. Most critical for that last statistic. Also, I'm too lazy to learn Lucene Query Parser Syntax, so the statistics for "Natalie Portman" may include references to "portman."
the time from the frontpage acticle on /. to the death of your server?
How many lines consist of:
}
Find similarities with stuff like SCO.
How many lines contain expletives?
blog and junk
With all that code indexed, maybe we'll finally be able to figure out what the heck SCO's talking about.
But then again, probably not...
Online Starcraft RPG? At
Dietary fiber is like asynchronous IO-- Non-blocking!
What, you've created this wonderful piece of software and _now_ want to figure out what to do with it?
Am I missing something here?
No Comment.
. . . well program, sloccount. Of course, do some research and tweak the paramaters to get a reasonably accurate result though.
Your hair look like poop, Bob! - Wanker.
1. Lines per function
2. Comment / command ratio
3. Number of curse word variable names
word.
... of "foo" to "bar."
Orange whip? Orange whip? Three orange whips.
How about the % of them that would work on a lady in a bar? line 53256 "Hey pretty lady, are you an astronaut because your ass looks out of this world" ....oh....not those kinds of lines....*sigh* and I thought I was so close
-or so you'd think
"I'm currently looking for suggestions..."
How about a new server?
how many lines contain the word 'fcuk'
d'oh!
So, this is not a flame, but I'm curious about your choice of dbs.
I've used mysql for some small projects, but generally it does handle
millions of rows (although the upper limit on rows can be patched with
some additional behaviors). So, for big dbs, I use postgresql.
How did you decide to use mysql? (Was it that the project started,
and grew, or did you know it would handle large numbers of rows
from the start)?
Just curious. This is probably going to be viewed as a flame by many
(particularly those who don't really use dbs very much, but use them
enough to have strong opinions).
Stay tuned for our reaching 280 million lines, followed by 285, 290, 295, and 300. Expect a new Slashdot post soon, as we need to advertise!
Microsoft Sucks, F/OSS Rocks. I get mod points now right?
Most frequently searched items, number of searches per min. (or after /. per sec.)
I love random hex numbers! Just like this one, 09f911029d74e35bd84156c5635688c0.
Find out how many profane words are there in the source code comments.
Also, I would really like to find "patient 0" for sourcecode. For example, is there a common library or utility function (perhaps Hex2Ascii?) that *everybody* uses? Well, who wrote it first?
And in a similar vein, who are the "top 5-10-100" authors of open source code by use, reuse, KLOC, etc.. Not of too much use unless I were awarding the Nobel prize for programming, or perhaps creating a list of individuals for the RIAA to sue, after their done with their other useless lawsuits. :)
In the software engineering world, people will be interested in all sorts of code metrics such as cyclomatic complexity, operator/operand counts, lines of code per module, and such as well as object oriented metrics for the C++ code (depth of inheritance, for example). If you can marry these sorts of metrics with defect data (bugs) for each of the modules then you have a useful data repository for predicting defects in source code. Keeping around different versions of modules changed is also valuable here. If you can gather information on how long it took to produce the module and how long it took to correct defects in the module you are getting even better. If you make it easy to reuse the C and C++ modules...even better.
It's the quality of the search results that counts.
I was very impressed with Amazon, who for each book say which phrases and words were particularly unique to that book. (reminds me of that google game where try try and get any two words with only 1 hit).
So show code with coloured background to the lines, from green to red, green being 'normal every day boiler plate' code, red would mean this code must be more specialised, or written by some half-wit l33t h4x0r at least.
I forgot what they called it, but they had 3/4 visible stats based on the semantics of the stuff, probably more under the 'hood (omg lol).
word. Oh some adhesion stats would rock!
please type the word in this image: adhesion
random letters - if you are visually impaired, please email us at pater@slashdot.org
#hostfile 0.0.0.0 primidi.com 0.0.0.0 www.primidi.com 0.0.0.0 radio.weblogs.com
Write 'I must read at least the post before I comment' 275 million times and when you are finished you can use slashdot again.
blog and junk
Obviously we geeks really want to know is, how many "F" & "S" profanity words there are, amongst other useful and descriptive comments.
/. is good for you.
Number of non-comment, non-blank lines
Number/percentage of each C/C++ control structure (if, switch, variable assignment, etc.)
Average size of functions in lines and min/max.
Pete/Petri "damn, my chainsaw is clogged with 1's and 0's again." --clyde
Just hit refresh and the webserver won't get the HTTP_REFERRER (granted you'll have to manually delete the text file he serves you)
-everphilski-
1) randomly select 2000 lines of code
2) compile
3) execute
4) ???????
5) PROFIT!
As a programmer myself, though not that serious, the greatest number of lines of code I have written is 13,671 in a VB application processing costs for chemical analysis in a lab. This makes me wonder...How can one write over 200 million lines of code? How does one debug the beast? Believe me, even 1 million lines of code is a lot of code. How long does this thing take to compile? There are so many questions that just leave me to respect these programmers.
So that's what...3 or 4 programs worth?
When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
I'd like to know whether the word "woman" appears anywhere, and if so, in what projects.
Eh.
"Piter, too, is dead."
All the code was just /.'ed into oblivion. Time to start from the beginning all over again. :(
No sig for you!!
Being able to search so much source is also very useful. I was involved in a discussion a while back about the frequency of use of bessel functions in programs (I claimed rare). The handful of uses returned from your database helped back up my argument (dare I say prove it).
Keep up the good work!
...that is, a static analysis of a bunch of Java SourceForge projects. It does unused code and duplicate code detection... sometimes it finds some interesting things.
PMD home page is here, book site is here.
The Army reading list
"I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."
The obvious statistic would be: how many of these are copyrighted by CSO?
I'm currious, when people are looking for code, what do they do as a first resort? Maybe this should be a poll. Me, I'm a bit funny...
1) look in my library (books)
2) do a deja search
3) ask smarter people than me
4) do a web search (usually on specific sites)
/\/\icro/\/\uncher
I can only hope that this database has good metadata on which code fragments contain/don't contain various common species of exploits (buffer overflow, stack overflow, mal-formed input vulnerabilities, etc.). It would be nice to know which code fragments have all the needed input/size checking needed to be safe for exposure to the outside world and which are "for internal use only."
Two wrongs don't make a right, but three lefts do.
The most obvious statistic is "how many of these lines were stolen from SCO?"
Be a real patriot: Question authority. Think for yourself. Formulate your own conclusions.
How about the number of lines marked up with "TODO"?
-- yawn. --
Searching 225,816,744 lines of code...
.. how many times the same code appears with different function names (i.e. how plagued by NIH are you)? .. how many times the same function_name() appears with different code? .. how much of the code fails to compile?
; -- the corruption of government starts with its secrets. a truly free people keep no secrets. --
What is the most common controlling variable name in a for loop?
TANSTAAFL
The most important search term would be "functionality", ie. show me functions which do this or that.
Without the ability to find the needle, the big hay-stack you have collected will only give you huge bandwidth bills, and give us with very little that cannot be found elsewhere.
It is hosed.
u nctionTypeReturned&search=(&ignoredRandomNumber=11 33805159922.7798 Line Number 2, Column 1:Warning: mysql_connect() [function.mysql-connect]: Can't connect to MySQL server on '127.0.0.1' (4) in /home/csourcesearch.net/include/php/GraphXML.php on line 309
I tried searching. Here's what I got:
XML Parsing Error: junk after document element Location: http://csourcesearch.net/performSearch.php?type=F
^
http://www.thebricktestament.com/the_law/when_to_
So tab(i,j) is a function call with two arguments. But tab[i,j] is an invocation of the "comma operator", then a function call with one argument. The default "comma operator" ignores the first argument and returns the second. It once had some uses in C macros.
I've argued with the C++ committee about this. If "operator[]" had the same syntax as "operator()", we could have support for multidimensional arrays in C++. But there's a concern that somewhere, someone might have code that depends on the current semantics of the comma operator inside square brackets.
This new archive offers the opportunity to eliminate that possibility. So, do this search: Find, in non-comment standard C++ code, any occurences of a comma operator within square brackets. Eliminate any where there are parentheses within the square brackets enclosing the comma. Can you find any? In any production code? In any open-source project? Anywhere?
Source code search engines have been extremely helpful for me. I prefer www.koders.com, but there are quite a few other decent ones out there. What does this engine has to offer that the others don't? It seems like this one doesn't index code repositories but only indexes files local to the server. Neither does it allow you to click on words in the code and search for them. I also sorely miss bookmark friendly URL:s and free text queries. On the positive side, I note that your search engine is totally free from ads! Very nice! Although I wouldn't mind having to look at a few ads (which I might even click on) because running a search engine is expensive and a good source code search engine is a very useful service. I sincerly hope that we will see some upgrades of the site.
charge for a premium service that allows Computer Science and Software Engineering profs to perform a somewhat intelligent search of the code to see just how much of their students' code is lifted off the 'net ;)
------- "From bored to fanboy in 3.8 asian girls" ----------
Use statistics to construct what an "average" program looks like, and see what it does. :)
Compare functions looking for library routines that need to be created.
Look for common code structures that are not in libraries to create more libraries.
More libraries.
I was thinking of the immortal words of Socrates, who said: "I drank what?" - Chris Knight (Val Kilmer)- Real Genius
if( something = something ) ...
I wonder if the entire code-base complies... and if so, what comes out? Windows Vista, or some Linux/BSD merge?
See also Codase.com, another "Source Code Search Engine", which lets you search by method names, class names, variable names, free text, etc..
-Mark
"I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."
That one's easy. Just tell us how many bugs are hidden in the code, and give us a code/bug ratio.
Don't know, koders.com supports a lot more languages and also lets you narrow your search to specific licenses. The few extra lines of code just don't seem too do it, especially because such measures highly depend on the chosen method.
I'm surprised that Perl's CPAN archive doesn't have structured searching at smaller granularity than module name or freeform metadata. Maybe once the archives let us find code by content, we'll get version control databases that store each line in a record, each block as references in a separate table, maybe even referential integrity of variables as foreign keys. I'd love my editor to pull code from DB storage, padding whitespace only in the presentation layer per my preferences.
I'd really love to see datamining techniques for factoring, optimizing and profiling code. Not to mention the enforcement efficiencies for source license "due diligence" comparisons beyond grep. It's bizarre that programs are still so united with a hierarchical directory filesystem that scopes are enforced per-file, while class scopes have only lexical (not purely structural or referential) implementation. Relational math is rigorous enough that its direct combination with a compiler ought to produce even more revolutions than it would with an editor.
--
make install -not war
Run the whole shebang through a Markov Chain analyzer, then have it generate some new code. Hell, ought to work as well as anything else put out these days...
Poor means hoping the toothache goes away.
...How many libraries of Congress would all this code occupy?
You could calculate the percentage overlap between the 275 million lines of code and SCO's source code. For additional interest, you could plot that percentage as a function of time. You should see it go up right before every major new SCO filing.
He said 275 million lines of searchable code... not the length of his search program. Maybe you should read the post instead...
You can start by seeing how often people use gets(), strcpy(), strcat(), etc... Look for all the fun little common mistakes that people make.
-# of non-numerical constants /,#,; characters in code
-# of ( ),{ },\
-time spent debugging/compiling
-total hours spent in production
-gallons of coffee consumed
-hours of daylight seen
-# of relationships destroyed
He who knows best knows how little he knows. - Thomas Jefferson
I would love to see if different code styles could be analyzed to see how many peopel use what sort of syntax style. There is camelCase and under_scores but it seems possible to find more complicated trends that might allow reviews to statistically determine what practices really help to make code better.
As an old Big Iron grognard asked me many years ago...
"What prints your paycheck?"
How much of this open-source code DB is reusable? Are most of the lines things that have limited applications, or are most of them more general? I mean, if you have 275 million lines, but 175 million lines are code designed to solve one specific problem and can't be easily cross-applied, then it isn't as useful as the statement implies.
That said, congrats on the milestone, and looking forward to hearing of more!
For example, "Lines of code" / "Lines of commenting" will always produce "Inf"
I think the developers of eds are mad at courier
/ camel-imap4-summary.c
Search: function names containing shit
type: void
name: courier_imap_is_a_piece_of_shit
line: 17
file: evolution-data-server-1.2.3/camel/providers/imap4
You forgot: "5) Ask CowboyNeal".
:)
This sig rocks the casbah.
I'd love to see how one of my programs (stats below) compares
to the, uh, national average.
1222 if
638 return
482 static
413 for
399 int
217 const
201 else
194 void
128 char
115 case
112 break
55 default
43 sizeof
37 do
35 switch
27 enum
24 struct
23 while
15 float
14 typedef
10 auto
7 unsigned
6 extern
1 long
was the server /.ed or does he just need a new one?
As an american High School student, I'd like to officially apologize for my generation.
Why be an a**hole? The guy is wanting to offer a nice product, but can not afford the bandwidth hit of a /.. Now, you try to bypass his request (but you were wrong).
I mean, why not instead, do a coral link or a google cache link like he asks? After all, he is providing useful code.
(subject says it all ;))
News for Geeks in Austin, TX
How about searching for functions that don't check their parameters for overflow?
Now SCO will finally be able to find all the code that was stolen from them!
I've abandoned my search for truth; now I'm just looking for some useful delusions.
I'm dying to know... What percentage of the code is commentary?
And are there any haiku?
What I say does not represent the views of my employers, my friends, my cats, or myself.
Including those with incompatible licenses.
Related: having found similar code sections, follow trends in them over time. Find where two programs copied the same code, but one has failed to implement what might be a bug fix or improvement in another, by looking at changes to the code over time.
I'd like to see a comment search for "Nobody will see this".
Unfortunately, the site's running so slow, I guess nobody will.
tasks(723) drafts(105) languages(484) examples(29106)
How many GOTOs are in there?
I actually came across one recently. It was a real surprise considering the rest of the code was decently written. And it was pretty simple to remove.
... because every time I profile my code, it seems I end up spending a lot of time in my bubble sorts. In all that code, surely someone took the time to write a really fast bubble sort, right?
(sig) The last bug isn't fixed until the last user is dead. (/sig)
select count(*) from sourcecode where comments > 0
0 row(s) returned
plagerism at its finest
mod -1 lame
Implement a caching mechanism for the more used search results. This cache would be invalidated when you add more code.
Show a few lines in the search results after and before the searched text.
Is the search case-sensitive or not? Maybe just adding a option like this on the search results can be helpful (eg.: include names on Windose platform are insensitive).
Implement statistics for:
- most/less used function name
- most used word in comments
- most used dirty word
Count the lines of "dense" code. I mean do not include empty lines and lines that only have comments or a just a opening/closing brace.
... like this guy's site is right now, when YOU submitted the story about your site! If you're not prepared for slashdot traffic, don't submit the story.
This is a good opportunity to build complex statistics about the C++ grammar actually used in context. Learn from the NLP people! Parse the whole thing, and start finding common subtrees in the grammar used. Look at common lexical entries between subtrees, so we can make a tool that can help recognize errors by comparing against commonly used C++ grammar fragments. Or do function completion based on what kind of function you look like you're writing. See if you can do alignment with similar languages and do statistical source translation. If you keep information about comments used (and maybe apply some real NLP), you might even have a shot at automatically classifying functions based on their form, and documenting them with simple comments.
If that's too hard, try finding all n-grams instead, at least under some length. That's a lot more useful than just individual tokens or strings.
With a lot of data, you can do very cool things. Don't mess around with string frequency counting. C++ is simple compared to English, do something interesting.
The Signal/Noise ratio can be improved in two ways. Remaining silent is the OTHER way.
Less time wasted on google searching down the examples I need to check.. someone is my new hero.
Find most common functions so they can be moved to the kernel ;))))
Though I don't develop much in C++ currently, and haven't had the time to do anything Linux wise in years, I would love to have an identified location for security-bug free algorithms, etc. that I could use if I need to do more C++ work in the future.
...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...
The Code copied from here should be GPL ed. Damn the viral nature
This index doesn't even contain Boost (http://www.boost.org/) and Loki libraries!
It can't be called 'comprehensive' after that...
number of lines that contain both "should" and "probably"
only one reference to penis is 275 million lines of code??? whew thank god they didn't index my code.
... of lines claimed by SCO.
Consider, a page is 45 lines, an average book is 350 pages @ about 2" thick (ergo, about 15-16k books), a stack is roughly 12' wide by 6 shelves, double sided (864 books) and a row is about six stacks long (72' / 5,184 books). So, in a compactus, about 432sq/ft, to the 2,100,000sq/ft of the Madison building alone. The total linear capacity is 540 miles. Using the above assumptions, that's about 205 million books, so if printed, this repository would take up roughly 1/13,000th of the space. Imagine if your house is 2500sq/ft, the equivalent displacement would be a five-inch square. Would you even notice?
Gawd, I'm bored.
would be a nice feature to have, both average and per project/module basis.
Arash Partow's Philosophy: Be a person who knows what they don't know, and not a person who doesn't know.
Counting the number of "TODO"s and "XXX"s in "production" open source code could be interesting.
There are two types of people in this world: those that categorize other people and those that don't.
Maybe you could see how many times certain C faux pas exist. Things like the use of gets(), fflush(stdin), void main(), etc.
Thanks for all the comments, there are some great statistic ideas here.
:)
It will take a while to generate all the stats, potentially months.
Once they are finished I'll post the results somewhere.
In the mean time I hope some people find the site useful.
I didn't do it to make money or anything, I just want to help out other coders
Can't the C/C++ folks write a decent indexing package? Or does Java really rule?
I'm gonna search for comments containing combinations of the words "Stolen" "From" and "SCO", and blackmail you all to keep the results quiet.
guess Firefox isn't standard compliant then (there goes my karma from the grandparent post :) )
-everphilski-
I would like to see some character frequency/pattern information so that we might be able to create baselines for predictive input tools while programming. This would be a good step forward for things such as programming by voice, or other alternative input tools such as dasher.
~D
This sig has been enciphered with a one-time pad. It could say almost anything.
just search for:
#if 0
XXX
FIXME
A useful tool could be a list all open source programs that contain the string: for(i = 0; i *; i++). Then we will know which programs violate SCO patents. Brian.
Wow, thats amazing! You must have upwards of 3 or 4 functions in that thing.
What you want is Markov chains. For any given statement you can easily construct the set of statmenets that follow it, together with their number. We will observe that given for There are a lot of branches to the tree, with (i =0; being very common and // Woo woo
being quite rare.
We can easily generate more code by picking statements at random, but in proportion to their observed frequency.
But we can do better.
Given pairs of statements, we can generate a table where given the pair of statements
for( i=0;
we know the probability of
i !=
This because we're dealing with pairs from a large set, the branching will be quite low, ie it will be much more constrained, and thus a lot more like the original code.
This is analagous to being able to say in English, given "Space, the"
we mostly get "final" as the most probable next symbol, and now we have "the final", the next word wmay be|"frontier" but given that "Space" is not being thought about any more, we might have "cut".
Text generated like this is quite like English, and if you include punctuation, can generate things that are as grammatically correct as many people's posting. It can be quite funny. I have used it to merge Terry Pratchett and Microsoft marketing blurb, and certainly the MS stuff improves...
Same applies to C++.
It will take a little experimentation to find out the best length for the chain, but assuming that the input code is syntactically correct, so the output will often compile.
Quite what it will do is another question, but this sort of chain is rather like how people learn.
If you listen to babies, they start off making random noises, then they make sounds that have very roughly the same frequency of ocurrence as their parents language, then they burble in things that sound like English, but clearly are not. Their "singing" at 18 month is usually quite free of real words, but they've heard "twinkle twinlke little star" so often that the chain of "twinnle twinnle lipple sarr" is carved into their neural net.
Given that the spec here is for "useful" I propose this as an AI test.
Dominic Connor,Quant Headhunter
Might be sweet to hook your source code into a google desktop index and then deliver the google desktop search results through your web interface. Leverage the power of their search speed and reliability. --Kalen
directory of tech articles
Read this.
http://exorsus.net/slashdot.txt
given to slashdot to mirror content as necessary for the purpose of
providing its users access to the information on the site. Slashdot should
not attempt to bypass the referer block. Use of the google cache page for
the site is acceptable as long as the page(s) concerned have no more than 1
image.
This policy is employed for the sole purpose of avoiding a huge bandwidth
bill that I would have to pay out of my own pocket. Anyone who would like
this restriction to go away is more than welcome to send me bucketloads of
cash.
Fuck you buddy. Protect your own interests as you need to, but don't be a cocksmith about it.
That Apache web server is just wonderful!
I recently did a search on some of our codebase here at work to see how many times the above keywords remained in shipping code. I was a little surprised to see how many cases there were in our code. I think sometimes, maybe even most of the time we as programmers over use these words.
Pete
What's a sig? Pete Brubaker
...how many comments contain the letters JSB??
/* RIP jsb */
I can still remember my 1000 level C course where an example of "poor commentting" was presented where the only comment was
never bring a twinkie to a food fight.
jab must be one of the 133t coders who knows what this qualifier does
AFAIK, it means it isn't static, so it should be cleaned off the stack when the scope ends.
Number of gotos per project would probably be amusing.
You can force the conversion with
blah[ location(5), 5] = 10;
but that's not useful except to see what's happening.
You can't overload the built-in operators for built-in types. So overloading, outside of an object, "operator,(int, int)" won't work either.
Hence the need for a straightforward solution.
It would be interested to see what projects share same/similar chunks of code. This could be used to move similar code into libraries where things are redundant.
What are the conditional compiles? Are they being used or abused?
#if GCC_1_8 or #if DEBUG vs #if 0
#FOREVER for(;;) vs #INC (x) x++
How much #define hell is there in codebases (ever see a VxWorks BSP?)
- hardware specific defines in C verses base classes and inheritence in C++
- #def that can be retouched at run time or used as constant without using const
- how many #ifdefs #undefs with other included code #define again burried.
- symbol replacement
Or how about casting! how much was done implicitly or explicitly? And could it have been avoided.
That "woosh" sound you hear is the wink emoticon zooming over your head, joke in tow.
I know PHP is a great web language and that it probably isn't the cause of the slowdown. Heck, even Yahoo! uses it these days.
I was attempting (unsuccessfully, it seems) to make fun of the purists who insist that robust web applications must run on something compiled in order to reach acceptable performance under high load.
like strcpy, vprintf, strcat, scanf would be interesting. It would be a basic buffer overflow fuzzer.
Microsoft aggravates my tourettes syndrome.
I know it's a legimate keyword. There's also absolutely no reason to use it. It's the default in the only place where it is even valid.
It'd be like typing "unsigned long int" instead of "unsigned long".
I call BS on this. No reasonable code uses "auto" nearly as many times as typedef.
http://lkml.org/lkml/2005/8/20/95
I, and I'm sure not the only one, tend to use switch statements specifically for the fall through, and a series of if-else-ifs when I don't need fall-through. I'd like to know how many times a switch statement contains fall through, versus how many times each case has a seperate break. The ratio of case statements to break statements would be interesting as well.
Perhaps the Slashdotting of this site might be a good reason for the administrator to create an AJAX interface. That way the server would not have to process so many pages and focus on queries.
275 million lines of C .. so what is that roughly translated to?
probably about 27,500,000,000 lines of quick basic?
or roughly 10 lines of Perl.
The number of comments that include the string(s) "magic", "magically", "Then a miracle occurs...", "[You|We] keep using that word; I do not think it means what you think it means." or "This is not the code you are looking for, move along." - ?
http://www.advogato.org/article/85.html
which links to the open-source metrics:
http://orbiten.org/ofss/01.html
which is dead but is still on the archive:
http://orbiten.org/ofss/01.html">The link doesn't work!@!#@!@@!
Here is the first table Table 1: Top 10 authors ranked by contribution of code Author % of total free software foundation, inc 11.231 sun microsystems, inc 1.848 the regents of the university of california 1.359 gordon matzigkeit 1.216 paul houle 1.042 thomas g. lane 0.782 the massachusetts institute of technology 0.762 ulrich drepper 0.559 lyle johnson 0.528 peter miller 0.525
Table 1: Top 10 authors ranked by contribution of code Author % of total free software foundation, inc 11.231 sun microsystems, inc 1.848 the regents of the university of california 1.359 gordon matzigkeit 1.216 paul houle 1.042 thomas g. lane 0.782 the massachusetts institute of technology0.762 ulrich drepper 0.559 lyle johnson 0.528 peter miller 0.525 more...P2P Anonymous Distributed Web Search: http://www.yacy.net/
auto is a throwback to B days (the language immediately before C). B had no data types (no int, float, double, etc) but did have storage types: auto, static, and extrn.
... } ... } ... } ... }
auto was necessary in B for local variables, as a plain variable name by itself was a valid expression statement (as it is in C), not a declaration (IIRC).
1. foo() { auto bar;
2. foo() { static bar;
3. foo() { extrn bar;
4. foo() { bar;
All mean something different in B: the first three instances of bar are declarations, the fourth is an expression statement (and if I remember my B correctly, it is invalid as the first statement of foo(), because bar hasn't been declared one of auto, static, or extrn yet in this function).
In C, auto is completely redundant. Except, perhaps, in comments.
Ah, B. The days when programmers were programmers and data was data, and you could perform any operation you liked on any variable. Want to divide a pointer to a string by 3? Go ahead. Self-disciplined programmers don't need training wheels. Just a choice between auto, static and extrn.
I am anarch of all I survey.
"people who used functionX also used functionY".
It would bring to light what libraries are often used together.
Stats using bayesian filters like spam filters would be incredibly cool. Especially if the source code is somewhat parsed for what it means and not only used textually.
how many times have the licenses changed?
also, maybe stuff like code overlap between projects?
Find a job you like and you will never work a day in your life.
1) Number of GOTO statements 2) Number of comments that match (nearly) exactly the code they explain ex: string name; //name
3) Phone Numbers
Thanks for all the sanity checks! The very simple lex program I whipped .c files, and not .h files which explains the lack of
up to extract reserved words was way too simple. First, a variable like
"automobile" was causing a false positive for "auto". Second, I had
only run on
typedefs. Finally, while I remembered to strip the comments, I'd forgotten
to take care of quotations thus getting false positives from things like
char *foo = "By default run on auto pilot"; Fixing all this gives much
saner results.
950 if
626 return
482 static
331 for
272 const
269 void
213 else
132 char
113 case
112 break
89 typedef
82 extern
44 int
43 sizeof
41 enum
39 struct
35 switch
31 default
23 while
11 unsigned
5 float
3 signed
2 short
1 long
1 double
1 do
// we like profanity in open source software, please read the following words carefully: // fuck, pussy, dick, sperm, motherfucker // // hope our source code will now be censored by all governments that suck.
# "if (cond) {" vs. "if (cond)\n{"
/pet peeve
I'm all for coding readability, but placing a function's open bracket on a new line is so fucking irritating and unnecessary.
http://www.codase.com/ a new search engine, seems to have better user interface and performance. It also has a smart query search system to deal with complex queries,
B var+t%3B+thread.start()%3B+println%3B+%7D&scope=jo in%2Fjoin&lang=*&project=
quoted from their website:
"For the first time, to find relevant code, developers can simply type into a search box about the same code as they do in their daily development work. The Codase smart query system processes the input and then builds an internal query to feed into the search engine. Through this free style format, complex combinations of multiple search terms can be easily entered. For example, to find any main method that contains variable t and function calls of thread.start() and println, this query can be used: main() { var t; thread.start(); println; }",
http://www.codase.com/search/smart?join=main()+%7
Great Work!!!
FYI - there is a whole community of researchers that are interested in studying such large software repositories
http://msr.uwaterloo.ca/
(International Workshop on Mining Software Repositories)
May be you can write something and submit it over there or at least advertise your data set to that community.
This would become really useful if it could save people ever having to write anything programmed before. Say I want to factor a number. So I write a couple test cases:
8 -> [2,2,2]
9 -> [3,3]
12 -> [4,3]
and then I can use your database to find all the functions that will succesfully pass my test. If it got good it could even combine functions until it got something working. This would be the holy grail of IDE's - write unit tests, click search, polish up - done. The tricky part here is how to efficently search all those functions.
The first step should be to allow searching by argument types, side effects and global access, and returns values. Then add the test cases.
Hopefully someday.
-- Devin Bayer
Yeah. Unless he uses PHP as the database engine, it has nothing to do with it.
Probably Lucene's fault.
What 500,000 line library would be the most beneficial to the corpus? Beneficial means the ability to reduce the complexity/size of the resulting corpus. You can think of this as a compression problem.
for example, if:
int a[MAX]
sum = 0;
for (i=0; iMAX; ++i)
sum += a[i]
occurred in enough of the source, we could reduce this to:
int a[MAX];
sum = r0 (a, MAX);
thus simplifying the code. As a follow on project, write a cron job to submit patches to all the authors of the code in question to use said library, and upload the library to sourceforge.
While I admit that PHP isn't the source of the slowdown, I'd hardly consider it the ideal web language. Too many problems with scoping, function naming, etc., etc.
Or perhaps it's just my tastes. I personally prefer not to have to worry about using a variable which is in another code block.
How many lines did IBM illegally contribute to Linux ;-)
I don't want to be an asshole but what about putting the license of the displayed code ?
IMHO the actual display looks like this is public domain snippets
The C and C++ Code Counter, http://cccc.sourceforge.net/, has some interesting statistics that could be generated on a per project basis. Perhaps you could encorporate the stats from that project.