Slashdot Mirror


Searchable C/C++ DB surpasses 275 million lines

Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."

17 of 328 comments (clear)

  1. Some statistics to get you started by Anonymous Coward · · Score: 5, Funny
    I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code.


    The following "interesting statistics" come to mind:

    • Percentage of functions named "deepThroat" (0%)
    • Number of comments mentioning a "girlfriend" (11) or "wife" (29) to "Natalie Portman" (41)
    • How many variables named "penis" are of type "long" versus type "short" (unknowable!)


    You gotta get the variables searchable. Most critical for that last statistic. Also, I'm too lazy to learn Lucene Query Parser Syntax, so the statistics for "Natalie Portman" may include references to "portman."
  2. useful statistic by kunzy · · Score: 5, Funny

    the time from the frontpage acticle on /. to the death of your server?

    1. Re:useful statistic by Sembiance · · Score: 5, Funny

      Well, it's been about 2 minutes on slashdot... my site is already dead. So uhm... 2 minutes?

  3. My vote is for... by Anonymous Coward · · Score: 5, Insightful

    How many lines consist of:
    }

    1. Re:My vote is for... by mebollocks · · Score: 5, Funny

      I dunno, maybe you could find the algorithm on the net somewhere? ...if only there was some kinda searchable code database of some sort...

    2. Re:My vote is for... by baadger · · Score: 5, Interesting
      Theres an idea right there, how about some stats showing popularity of various coding conventions?

      • Variables: under_score vs. camelCase
      • Tabs vs. spaces
      • "if (cond) {" vs. "if (cond)\n{"
      • How many coders bother enclosing single conditionally executed statements with {}
      • How many coders bother producing comments directly before or after function definitions, describing function implementation?
      • Lines of comments to lines of code ratios
      • Number of functions to lines of code ratios for various projects?
      • Number of projects making use of global variables?
      • C, to C++, to C# (if your engine covers it) project ratio

      etc
  4. Similarity checking by roguerez · · Score: 5, Funny

    Find similarities with stuff like SCO.

  5. Statistics: by duckpoopy · · Score: 5, Interesting

    1. Lines per function
    2. Comment / command ratio
    3. Number of curse word variable names

    --
    word.
  6. ratio by FreeBSDbigot · · Score: 5, Funny

    ... of "foo" to "bar."

    --
    Orange whip? Orange whip? Three orange whips.
  7. Suggestion by lbmouse · · Score: 5, Funny

    "I'm currently looking for suggestions..."

    How about a new server?

  8. Statistics TM (c) by chunews · · Score: 5, Interesting
    It would be interesting to see the number of different copyright notices contained within all that source code, and then to present the notices in groups, like GPL GPL2, etc..

    Also, I would really like to find "patient 0" for sourcecode. For example, is there a common library or utility function (perhaps Hex2Ascii?) that *everybody* uses? Well, who wrote it first?

    And in a similar vein, who are the "top 5-10-100" authors of open source code by use, reuse, KLOC, etc.. Not of too much use unless I were awarding the Nobel prize for programming, or perhaps creating a list of individuals for the RIAA to sue, after their done with their other useless lawsuits. :)

  9. Interesting Statistics by iso-cop · · Score: 5, Interesting

    In the software engineering world, people will be interested in all sorts of code metrics such as cyclomatic complexity, operator/operand counts, lines of code per module, and such as well as object oriented metrics for the C++ code (depth of inheritance, for example). If you can marry these sorts of metrics with defect data (bugs) for each of the modules then you have a useful data repository for predicting defects in source code. Keeping around different versions of modules changed is also valuable here. If you can gather information on how long it took to produce the module and how long it took to correct defects in the module you are getting even better. If you make it easy to reuse the C and C++ modules...even better.

  10. Re:And then... by Sembiance · · Score: 5, Interesting

    Advertise? No, I'm just a single coder doing this for fun and hope that some people will find it useful.

  11. Please check for this: comma in brackets in C++ by Animats · · Score: 5, Interesting
    C++, for historical reasons dating back to C, has wierd semantics for commas in brackets. The operator precedence for commas is different inside of "()" and "[]".

    So tab(i,j) is a function call with two arguments. But tab[i,j] is an invocation of the "comma operator", then a function call with one argument. The default "comma operator" ignores the first argument and returns the second. It once had some uses in C macros.

    I've argued with the C++ committee about this. If "operator[]" had the same syntax as "operator()", we could have support for multidimensional arrays in C++. But there's a concern that somewhere, someone might have code that depends on the current semantics of the comma operator inside square brackets.

    This new archive offers the opportunity to eliminate that possibility. So, do this search: Find, in non-comment standard C++ code, any occurences of a comma operator within square brackets. Eliminate any where there are parentheses within the square brackets enclosing the comma. Can you find any? In any production code? In any open-source project? Anywhere?

  12. Code Styles by ionrock · · Score: 5, Interesting

    I would love to see if different code styles could be analyzed to see how many peopel use what sort of syntax style. There is camelCase and under_scores but it seems possible to find more complicated trends that might allow reviews to statistically determine what practices really help to make code better.

  13. histogram of C reserved words by jab · · Score: 5, Interesting

    I'd love to see how one of my programs (stats below) compares
    to the, uh, national average.

       1222 if
        638 return
        482 static
        413 for
        399 int
        217 const
        201 else
        194 void
        128 char
        115 case
        112 break
         55 default
         43 sizeof
         37 do
         35 switch
         27 enum
         24 struct
         23 while
         15 float
         14 typedef
         10 auto
          7 unsigned
          6 extern
          1 long

    1. Re:histogram of C reserved words by plabtfall · · Score: 5, Funny

      Yeah, me too:

          2431 int
          1802 goto