Slashdot Mirror


Searchable C/C++ DB surpasses 275 million lines

Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."

8 of 328 comments (clear)

  1. Statistics: by duckpoopy · · Score: 5, Interesting

    1. Lines per function
    2. Comment / command ratio
    3. Number of curse word variable names

    --
    word.
  2. Statistics TM (c) by chunews · · Score: 5, Interesting
    It would be interesting to see the number of different copyright notices contained within all that source code, and then to present the notices in groups, like GPL GPL2, etc..

    Also, I would really like to find "patient 0" for sourcecode. For example, is there a common library or utility function (perhaps Hex2Ascii?) that *everybody* uses? Well, who wrote it first?

    And in a similar vein, who are the "top 5-10-100" authors of open source code by use, reuse, KLOC, etc.. Not of too much use unless I were awarding the Nobel prize for programming, or perhaps creating a list of individuals for the RIAA to sue, after their done with their other useless lawsuits. :)

  3. Interesting Statistics by iso-cop · · Score: 5, Interesting

    In the software engineering world, people will be interested in all sorts of code metrics such as cyclomatic complexity, operator/operand counts, lines of code per module, and such as well as object oriented metrics for the C++ code (depth of inheritance, for example). If you can marry these sorts of metrics with defect data (bugs) for each of the modules then you have a useful data repository for predicting defects in source code. Keeping around different versions of modules changed is also valuable here. If you can gather information on how long it took to produce the module and how long it took to correct defects in the module you are getting even better. If you make it easy to reuse the C and C++ modules...even better.

  4. Re:And then... by Sembiance · · Score: 5, Interesting

    Advertise? No, I'm just a single coder doing this for fun and hope that some people will find it useful.

  5. Please check for this: comma in brackets in C++ by Animats · · Score: 5, Interesting
    C++, for historical reasons dating back to C, has wierd semantics for commas in brackets. The operator precedence for commas is different inside of "()" and "[]".

    So tab(i,j) is a function call with two arguments. But tab[i,j] is an invocation of the "comma operator", then a function call with one argument. The default "comma operator" ignores the first argument and returns the second. It once had some uses in C macros.

    I've argued with the C++ committee about this. If "operator[]" had the same syntax as "operator()", we could have support for multidimensional arrays in C++. But there's a concern that somewhere, someone might have code that depends on the current semantics of the comma operator inside square brackets.

    This new archive offers the opportunity to eliminate that possibility. So, do this search: Find, in non-comment standard C++ code, any occurences of a comma operator within square brackets. Eliminate any where there are parentheses within the square brackets enclosing the comma. Can you find any? In any production code? In any open-source project? Anywhere?

  6. Code Styles by ionrock · · Score: 5, Interesting

    I would love to see if different code styles could be analyzed to see how many peopel use what sort of syntax style. There is camelCase and under_scores but it seems possible to find more complicated trends that might allow reviews to statistically determine what practices really help to make code better.

  7. histogram of C reserved words by jab · · Score: 5, Interesting

    I'd love to see how one of my programs (stats below) compares
    to the, uh, national average.

       1222 if
        638 return
        482 static
        413 for
        399 int
        217 const
        201 else
        194 void
        128 char
        115 case
        112 break
         55 default
         43 sizeof
         37 do
         35 switch
         27 enum
         24 struct
         23 while
         15 float
         14 typedef
         10 auto
          7 unsigned
          6 extern
          1 long

  8. Re:My vote is for... by baadger · · Score: 5, Interesting
    Theres an idea right there, how about some stats showing popularity of various coding conventions?

    • Variables: under_score vs. camelCase
    • Tabs vs. spaces
    • "if (cond) {" vs. "if (cond)\n{"
    • How many coders bother enclosing single conditionally executed statements with {}
    • How many coders bother producing comments directly before or after function definitions, describing function implementation?
    • Lines of comments to lines of code ratios
    • Number of functions to lines of code ratios for various projects?
    • Number of projects making use of global variables?
    • C, to C++, to C# (if your engine covers it) project ratio

    etc