Searchable C/C++ DB surpasses 275 million lines

← Back to Stories (view on slashdot.org)

Searchable C/C++ DB surpasses 275 million lines

Posted by Hemos on Monday December 5, 2005 @05:27AM from the interesting-applications dept.

Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."

8 of 328 comments (clear)

Min score:

Reason:

Sort:

Statistics: by duckpoopy · 2005-12-05 05:32 · Score: 5, Interesting

1. Lines per function
2. Comment / command ratio
3. Number of curse word variable names

--
word.
Statistics TM (c) by chunews · 2005-12-05 05:38 · Score: 5, Interesting

It would be interesting to see the number of different copyright notices contained within all that source code, and then to present the notices in groups, like GPL GPL2, etc..
Also, I would really like to find "patient 0" for sourcecode. For example, is there a common library or utility function (perhaps Hex2Ascii?) that *everybody* uses? Well, who wrote it first?
And in a similar vein, who are the "top 5-10-100" authors of open source code by use, reuse, KLOC, etc.. Not of too much use unless I were awarding the Nobel prize for programming, or perhaps creating a list of individuals for the RIAA to sue, after their done with their other useless lawsuits. :)
Interesting Statistics by iso-cop · 2005-12-05 05:39 · Score: 5, Interesting

In the software engineering world, people will be interested in all sorts of code metrics such as cyclomatic complexity, operator/operand counts, lines of code per module, and such as well as object oriented metrics for the C++ code (depth of inheritance, for example). If you can marry these sorts of metrics with defect data (bugs) for each of the modules then you have a useful data repository for predicting defects in source code. Keeping around different versions of modules changed is also valuable here. If you can gather information on how long it took to produce the module and how long it took to correct defects in the module you are getting even better. If you make it easy to reuse the C and C++ modules...even better.
Re:And then... by Sembiance · 2005-12-05 05:39 · Score: 5, Interesting

Advertise? No, I'm just a single coder doing this for fun and hope that some people will find it useful.
Please check for this: comma in brackets in C++ by Animats · 2005-12-05 05:58 · Score: 5, Interesting

C++, for historical reasons dating back to C, has wierd semantics for commas in brackets. The operator precedence for commas is different inside of "()" and "[]".
So tab(i,j) is a function call with two arguments. But tab[i,j] is an invocation of the "comma operator", then a function call with one argument. The default "comma operator" ignores the first argument and returns the second. It once had some uses in C macros.
I've argued with the C++ committee about this. If "operator[]" had the same syntax as "operator()", we could have support for multidimensional arrays in C++. But there's a concern that somewhere, someone might have code that depends on the current semantics of the comma operator inside square brackets.
This new archive offers the opportunity to eliminate that possibility. So, do this search: Find, in non-comment standard C++ code, any occurences of a comma operator within square brackets. Eliminate any where there are parentheses within the square brackets enclosing the comma. Can you find any? In any production code? In any open-source project? Anywhere?
Code Styles by ionrock · 2005-12-05 06:09 · Score: 5, Interesting

I would love to see if different code styles could be analyzed to see how many peopel use what sort of syntax style. There is camelCase and under_scores but it seems possible to find more complicated trends that might allow reviews to statistically determine what practices really help to make code better.
histogram of C reserved words by jab · 2005-12-05 06:14 · Score: 5, Interesting

I'd love to see how one of my programs (stats below) compares to the, uh, national average. 1222 if 638 return 482 static 413 for 399 int 217 const 201 else 194 void 128 char 115 case 112 break 55 default 43 sizeof 37 do 35 switch 27 enum 24 struct 23 while 15 float 14 typedef 10 auto 7 unsigned 6 extern 1 long
Re:My vote is for... by baadger · 2005-12-05 07:57 · Score: 5, Interesting
Theres an idea right there, how about some stats showing popularity of various coding conventions?
- Variables: under_score vs. camelCase
- Tabs vs. spaces
- "if (cond) {" vs. "if (cond)\n{"
- How many coders bother enclosing single conditionally executed statements with {}
- How many coders bother producing comments directly before or after function definitions, describing function implementation?
- Lines of comments to lines of code ratios
- Number of functions to lines of code ratios for various projects?
- Number of projects making use of global variables?
- C, to C++, to C# (if your engine covers it) project ratio
etc