Searchable C/C++ DB surpasses 275 million lines

← Back to Stories (view on slashdot.org)

Searchable C/C++ DB surpasses 275 million lines

Posted by Hemos on Monday December 5, 2005 @05:27AM from the interesting-applications dept.

Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."

34 of 328 comments (clear)

Some statistics to get you started by Anonymous Coward · 2005-12-05 05:28 · Score: 5, Funny
I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code.

The following "interesting statistics" come to mind:
- Percentage of functions named "deepThroat" (0%)
- Number of comments mentioning a "girlfriend" (11) or "wife" (29) to "Natalie Portman" (41)
- How many variables named "penis" are of type "long" versus type "short" (unknowable!)
You gotta get the variables searchable. Most critical for that last statistic. Also, I'm too lazy to learn Lucene Query Parser Syntax, so the statistics for "Natalie Portman" may include references to "portman."
useful statistic by kunzy · 2005-12-05 05:30 · Score: 5, Funny

the time from the frontpage acticle on /. to the death of your server?
1. Re:useful statistic by Sembiance · 2005-12-05 05:33 · Score: 5, Funny
  
  Well, it's been about 2 minutes on slashdot... my site is already dead. So uhm... 2 minutes?
2. Re:useful statistic by Baricom · 2005-12-05 06:11 · Score: 4, Funny
  
  So uhm... 2 minutes?
  
  Sounds like you should have written it in C++ instead of a laggard language like PHP ;).
My vote is for... by Anonymous Coward · 2005-12-05 05:31 · Score: 5, Insightful

How many lines consist of:
}
1. Re:My vote is for... by epiphani · 2005-12-05 05:43 · Score: 4, Interesting
  
  Same type of thing, but indenting styles. K&R vs. BSD, ect. I'm curious how that breaks up.
  
  (Partial to BSD style myself..)
  
  --
  .
2. Re:My vote is for... by mebollocks · 2005-12-05 06:33 · Score: 5, Funny
  
  I dunno, maybe you could find the algorithm on the net somewhere? ...if only there was some kinda searchable code database of some sort...
3. Re:My vote is for... by baadger · 2005-12-05 07:57 · Score: 5, Interesting
  Theres an idea right there, how about some stats showing popularity of various coding conventions?
  
  Variables: under_score vs. camelCase
  
  Tabs vs. spaces
  
  "if (cond) {" vs. "if (cond)\n{"
  
  How many coders bother enclosing single conditionally executed statements with {}
  
  How many coders bother producing comments directly before or after function definitions, describing function implementation?
  
  Lines of comments to lines of code ratios
  
  Number of functions to lines of code ratios for various projects?
  
  Number of projects making use of global variables?
  
  C, to C++, to C# (if your engine covers it) project ratio
  
  etc
Similarity checking by roguerez · 2005-12-05 05:31 · Score: 5, Funny

Find similarities with stuff like SCO.
Interesting stats by sparkes · 2005-12-05 05:32 · Score: 4, Interesting

How many lines contain expletives?

--
blog and junk
1. Re:Interesting stats by moosesocks · 2005-12-05 07:05 · Score: 4, Informative
  
  How many lines contain expletives?
  
  for your reading pleasure.... the linux kernel fuck count
  
  --
  -- If you try to fail and succeed, which have you done? - Uli's moose
Statistics: by duckpoopy · 2005-12-05 05:32 · Score: 5, Interesting

1. Lines per function
2. Comment / command ratio
3. Number of curse word variable names

--
word.
ratio by FreeBSDbigot · 2005-12-05 05:33 · Score: 5, Funny

... of "foo" to "bar."

--
Orange whip? Orange whip? Three orange whips.
1. Re:ratio by ahem · 2005-12-05 07:24 · Score: 4, Funny
  
  From google:
  
  Search -- foo -> Results 1 - 10 of about 26,600,000 for foo. (0.06 seconds)
  Search -- bar -> Results 1 - 10 of about 385,000,000 for bar [definition]. (0.16 seconds)
  Search -- foo bar -> Results 1 - 10 of about 7,900,000 for foo bar. (0.12 seconds)
  
  'bar' wins. This intuitively makes sense, as who would want to go to the 'foo' for a drink, or eat an 'energy foo'? Could you imagine a lawyer being 'dis-fooed'?
  
  --
  Not A Sig
Suggestion by lbmouse · 2005-12-05 05:33 · Score: 5, Funny

"I'm currently looking for suggestions..."

How about a new server?
Choice of db? by Anonymous Coward · 2005-12-05 05:35 · Score: 4, Interesting

So, this is not a flame, but I'm curious about your choice of dbs.
I've used mysql for some small projects, but generally it does handle
millions of rows (although the upper limit on rows can be patched with
some additional behaviors). So, for big dbs, I use postgresql.

How did you decide to use mysql? (Was it that the project started,
and grew, or did you know it would handle large numbers of rows
from the start)?

Just curious. This is probably going to be viewed as a flame by many
(particularly those who don't really use dbs very much, but use them
enough to have strong opinions).
1. Re:Choice of db? by Sembiance · 2005-12-05 08:18 · Score: 4, Informative
  
  I've used MySQL in the past for some projects at work, where the number of rows were several hundred million and ran with no problems so I knew it was capable of large row numbers.
  
  I initially used their FULLTEXT indexing as well, but it dies a horrible death with a large number of rows or search terms. (The developers that live in #mysql on Freenode confirmed this)
  
  So I had to hand off searching to Lucene, which worried me a great deal (being java) but as folks tell me 'Java is not slow'.
  They are right, Java is very fast at handling the searching and I've been very impressed.
  Most searches in the Java database only take one or two seconds.
  The MySQL query/join for additional info take another 4 or 5 seconds.
  
  Most searches take about 8 seconds to come up, even under no load.
  
  I simply don't have enough RAM to keep the necessary MySQL indexes in RAM and use index only queries.
Statistics TM (c) by chunews · 2005-12-05 05:38 · Score: 5, Interesting

It would be interesting to see the number of different copyright notices contained within all that source code, and then to present the notices in groups, like GPL GPL2, etc..
Also, I would really like to find "patient 0" for sourcecode. For example, is there a common library or utility function (perhaps Hex2Ascii?) that *everybody* uses? Well, who wrote it first?
And in a similar vein, who are the "top 5-10-100" authors of open source code by use, reuse, KLOC, etc.. Not of too much use unless I were awarding the Nobel prize for programming, or perhaps creating a list of individuals for the RIAA to sue, after their done with their other useless lawsuits. :)
Interesting Statistics by iso-cop · 2005-12-05 05:39 · Score: 5, Interesting

In the software engineering world, people will be interested in all sorts of code metrics such as cyclomatic complexity, operator/operand counts, lines of code per module, and such as well as object oriented metrics for the C++ code (depth of inheritance, for example). If you can marry these sorts of metrics with defect data (bugs) for each of the modules then you have a useful data repository for predicting defects in source code. Keeping around different versions of modules changed is also valuable here. If you can gather information on how long it took to produce the module and how long it took to correct defects in the module you are getting even better. If you make it easy to reuse the C and C++ modules...even better.
Re:And then... by Sembiance · 2005-12-05 05:39 · Score: 5, Interesting

Advertise? No, I'm just a single coder doing this for fun and hope that some people will find it useful.
Amazon style statistics by tod_miller · 2005-12-05 05:39 · Score: 4, Interesting

I was very impressed with Amazon, who for each book say which phrases and words were particularly unique to that book. (reminds me of that google game where try try and get any two words with only 1 hit).

So show code with coloured background to the lines, from green to red, green being 'normal every day boiler plate' code, red would mean this code must be more specialised, or written by some half-wit l33t h4x0r at least.

I forgot what they called it, but they had 3/4 visible stats based on the semantics of the stuff, probably more under the 'hood (omg lol).

word. Oh some adhesion stats would rock!

please type the word in this image: adhesion
random letters - if you are visually impaired, please email us at pater@slashdot.org

--
#hostfile 0.0.0.0 primidi.com 0.0.0.0 www.primidi.com 0.0.0.0 radio.weblogs.com
Hit Refresh by everphilski · 2005-12-05 05:44 · Score: 4, Informative

Just hit refresh and the webserver won't get the HTTP_REFERRER (granted you'll have to manually delete the text file he serves you)

-everphilski-
Measurements I have made by derek_farn · 2005-12-05 05:47 · Score: 4, Insightful

Source code usage measurements contain many surprises (ie, developers don't always write what people think they do). Some statistics I have collected, on a smaller code base, are available here. The source of the tools used to exract much of the data (at least for those tables and figure I produced) is available here (C only at the moment).
Being able to search so much source is also very useful. I was involved in a discussion a while back about the frequency of use of bessel functions in programs (I claimed rare). The handful of uses returned from your database helped back up my argument (dare I say prove it).
Keep up the good work!
Sounds kind of like the PMD scoreboard... by tcopeland · 2005-12-05 05:48 · Score: 4, Interesting

...that is, a static analysis of a bunch of Java SourceForge projects. It does unused code and duplicate code detection... sometimes it finds some interesting things.

PMD home page is here, book site is here.

--
The Army reading list
Re:What? Millions of code? by tgd · 2005-12-05 05:49 · Score: 4, Informative

Its a searchable database OF code from other products, containing 275 million lines you can search across.

Its not a searchable database written in 275 million lines of code.
Please check for this: comma in brackets in C++ by Animats · 2005-12-05 05:58 · Score: 5, Interesting

C++, for historical reasons dating back to C, has wierd semantics for commas in brackets. The operator precedence for commas is different inside of "()" and "[]".
So tab(i,j) is a function call with two arguments. But tab[i,j] is an invocation of the "comma operator", then a function call with one argument. The default "comma operator" ignores the first argument and returns the second. It once had some uses in C macros.
I've argued with the C++ committee about this. If "operator[]" had the same syntax as "operator()", we could have support for multidimensional arrays in C++. But there's a concern that somewhere, someone might have code that depends on the current semantics of the comma operator inside square brackets.
This new archive offers the opportunity to eliminate that possibility. So, do this search: Find, in non-comment standard C++ code, any occurences of a comma operator within square brackets. Eliminate any where there are parentheses within the square brackets enclosing the comma. Can you find any? In any production code? In any open-source project? Anywhere?
1. Re:Please check for this: comma in brackets in C++ by chris+macura · 2005-12-05 06:29 · Score: 4, Informative
  
  Yes, they are. But from an OOP standpoint, it's impossible to create a datastructure that "knows" you're using the [] operator twice. So if you overload the [] operator in an array structure, to get multi-dimensional arrays, you have to nest single dimensions arrays, which is almost always inefficient because the rows (or columns, depending on whether you're row major, or column major) are lying around the RAM (depending on where they were allocated) , rather than a continous chunk like with C. In other words, you can't do something like this in C++: class SmartArray { public: SmartArray(int height, int width); int operator(const int &x, const int &y) const; // ... }; ... SmartArray a(5, 5); a[12, 13];
How about a potential buffer overflow index? by raddan · 2005-12-05 06:07 · Score: 4, Informative

You can start by seeing how often people use gets(), strcpy(), strcat(), etc... Look for all the fun little common mistakes that people make.
stats we'd like to see... by digitaldc · 2005-12-05 06:08 · Score: 4, Funny

-# of non-numerical constants
-# of ( ),{ },\ /,#,; characters in code
-time spent debugging/compiling
-total hours spent in production
-gallons of coffee consumed
-hours of daylight seen
-# of relationships destroyed

--
He who knows best knows how little he knows. - Thomas Jefferson
Code Styles by ionrock · 2005-12-05 06:09 · Score: 5, Interesting

I would love to see if different code styles could be analyzed to see how many peopel use what sort of syntax style. There is camelCase and under_scores but it seems possible to find more complicated trends that might allow reviews to statistically determine what practices really help to make code better.
histogram of C reserved words by jab · 2005-12-05 06:14 · Score: 5, Interesting

I'd love to see how one of my programs (stats below) compares to the, uh, national average. 1222 if 638 return 482 static 413 for 399 int 217 const 201 else 194 void 128 char 115 case 112 break 55 default 43 sizeof 37 do 35 switch 27 enum 24 struct 23 while 15 float 14 typedef 10 auto 7 unsigned 6 extern 1 long
1. Re:histogram of C reserved words by plabtfall · 2005-12-05 07:25 · Score: 5, Funny
  
  Yeah, me too: 2431 int 1802 goto
or "// FIXME" by StandardDeviant · 2005-12-05 06:20 · Score: 4, Funny

(subject says it all ;))

--

News for Geeks in Austin, TX
Don't mess around, learn from NLP folks by Xofer+D · 2005-12-05 06:37 · Score: 4, Insightful

This is a good opportunity to build complex statistics about the C++ grammar actually used in context. Learn from the NLP people! Parse the whole thing, and start finding common subtrees in the grammar used. Look at common lexical entries between subtrees, so we can make a tool that can help recognize errors by comparing against commonly used C++ grammar fragments. Or do function completion based on what kind of function you look like you're writing. See if you can do alignment with similar languages and do statistical source translation. If you keep information about comments used (and maybe apply some real NLP), you might even have a shot at automatically classifying functions based on their form, and documenting them with simple comments.

If that's too hard, try finding all n-grams instead, at least under some length. That's a lot more useful than just individual tokens or strings.

With a lot of data, you can do very cool things. Don't mess around with string frequency counting. C++ is simple compared to English, do something interesting.

--
The Signal/Noise ratio can be improved in two ways. Remaining silent is the OTHER way.