Slashdot Mirror


Searchable C/C++ DB surpasses 275 million lines

Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."

16 of 328 comments (clear)

  1. My vote is for... by Anonymous Coward · · Score: 5, Insightful

    How many lines consist of:
    }

    1. Re:My vote is for... by Triple+Click · · Score: 2, Insightful

      Depends whether you do this:

      if (cond) {
      }

      or this:

      if (cond)
      {

      }

  2. Re:Statistics: by gronofer · · Score: 2, Insightful

    4. The number of times the wheel has been reinvented.

  3. Re:Slashdot Block by lowrydr310 · · Score: 2, Insightful
    This policy is employed for the sole purpose of avoiding a huge bandwidth bill that I would have to pay out of my own pocket. Anyone who would like this restriction to go away is more than welcome to send me bucketloads of cash.

    If you don't want to pay a big bandwidth bill then don't run a webserver.

  4. The basics and more by PetriBORG · · Score: 2, Insightful
    Start with the basics, and then move on..
    1. Whitespace to code ratio
    2. Counts for each of the dirty 7
    3. Line counts that just contained () or {} or []
    4. A list of projects the code is from
    5. And then more interestingly, I'd like to run some sort of program on it to find similarities in code, to see how much one code base overlaps with another. It would be interesting to see if OSS actually does share code between projects or if its all NIH (not invented here).
    --
    Pete/Petri "damn, my chainsaw is clogged with 1's and 0's again." --clyde
  5. Measurements I have made by derek_farn · · Score: 4, Insightful
    Source code usage measurements contain many surprises (ie, developers don't always write what people think they do). Some statistics I have collected, on a smaller code base, are available here. The source of the tools used to exract much of the data (at least for those tables and figure I produced) is available here (C only at the moment).

    Being able to search so much source is also very useful. I was involved in a discussion a while back about the frequency of use of bessel functions in programs (I claimed rare). The handful of uses returned from your database helped back up my argument (dare I say prove it).

    Keep up the good work!

  6. Re:Slashdot Block by b4k3d+b34nz · · Score: 2, Insightful

    Why would anybody WANT to pay a big bandwidth bill? It's called being smart so that he doesn't get the shaft when he has to pay his utilities this month.

    --
    Grammar Lesson: you're is a contraction of "you are"; your means you possess something; yore means days gone by.
  7. Yet another source code search engine? by Anonymous Coward · · Score: 1, Insightful

    Source code search engines have been extremely helpful for me. I prefer www.koders.com, but there are quite a few other decent ones out there. What does this engine has to offer that the others don't? It seems like this one doesn't index code repositories but only indexes files local to the server. Neither does it allow you to click on words in the code and search for them. I also sorely miss bookmark friendly URL:s and free text queries. On the positive side, I note that your search engine is totally free from ads! Very nice! Although I wouldn't mind having to look at a few ads (which I might even click on) because running a search engine is expensive and a good source code search engine is a very useful service. I sincerly hope that we will see some upgrades of the site.

  8. best_idea_ever by l33t-gu3lph1t3 · · Score: 3, Insightful

    charge for a premium service that allows Computer Science and Software Engineering profs to perform a somewhat intelligent search of the code to see just how much of their students' code is lifted off the 'net ;)

    --
    ------- "From bored to fanboy in 3.8 asian girls" ----------
  9. Re:Wtf? by Digital+Vomit · · Score: 2, Insightful
    What better reason than to create such a program other than "why not"?

    A person who is a true programmer in his soul doesn't ask himself "why". Oftentimes the sheer joy of creating something from nothing is enough.

    --
    Modern copyright is theft of culture from everyone and it retards the progress of the useful arts and sciences.
  10. Re:Please check for this: comma in brackets in C++ by Vorondil28 · · Score: 3, Insightful

    I've argued with the C++ committee about this. If "operator[]" had the same syntax as "operator()", we could have support for multidimensional arrays in C++.

    I'm no C++ expert, but isn't int array[row][col] a multidimensional array?

    --
    This sig rocks the casbah.
  11. Re:Slashdot Block by gstoddart · · Score: 2, Insightful
    This policy is employed for the sole purpose of avoiding a huge bandwidth bill that I would have to pay out of my own pocket. Anyone who would like this restriction to go away is more than welcome to send me bucketloads of cash.

    If you don't want to pay a big bandwidth bill then don't run a webserver.

    That's a little harsh don't you think?

    It's one thing to run a site and have reasonable expectations of having "enough" bandwidth for your projected traffic, and it's another thing to pay for a slashdotting on an ongoing basis.

    This person has decided they don't really want to be linked from Slashdot.

    It's hardly an all-or-nothing thing ... for my personal web-site, the several gigs of traffic I'm allowed per month are more than adequate. But I'm sure as hell not going to pay extra to have enough on the off-beat chance that everyone in the world suddenly wants to see my site.
    --
    Lost at C:>. Found at C.
  12. don't complain about getting ./ed... by Anonymous Coward · · Score: 1, Insightful

    ... like this guy's site is right now, when YOU submitted the story about your site! If you're not prepared for slashdot traffic, don't submit the story.

  13. Don't mess around, learn from NLP folks by Xofer+D · · Score: 4, Insightful

    This is a good opportunity to build complex statistics about the C++ grammar actually used in context. Learn from the NLP people! Parse the whole thing, and start finding common subtrees in the grammar used. Look at common lexical entries between subtrees, so we can make a tool that can help recognize errors by comparing against commonly used C++ grammar fragments. Or do function completion based on what kind of function you look like you're writing. See if you can do alignment with similar languages and do statistical source translation. If you keep information about comments used (and maybe apply some real NLP), you might even have a shot at automatically classifying functions based on their form, and documenting them with simple comments.

    If that's too hard, try finding all n-grams instead, at least under some length. That's a lot more useful than just individual tokens or strings.

    With a lot of data, you can do very cool things. Don't mess around with string frequency counting. C++ is simple compared to English, do something interesting.

    --
    The Signal/Noise ratio can be improved in two ways. Remaining silent is the OTHER way.
  14. Re:Please check for this: comma in brackets in C++ by Old+Wolf · · Score: 3, Insightful

    You can do exactly that -- just write a(12,13) instead of a[12,13].
    This is a great counterexample to the GP. Changing the meaning
    of the comma within square brackets would gain NOTHING and would
    mean every existing compiler is now wrong.

    The existing C array type is bad enough as it is, why make it
    even more unwieldy by introducing a new variant? C++ is already
    on the right track: discourage C arrays, and encourage container
    classes that have things like bounds checking and automatic
    memory allocation.

  15. Re:Please check for this: comma in brackets in C++ by chris+macura · · Score: 2, Insightful

    That's the whole point of the complaint. Inconsistentcy between [] and ().