Searchable C/C++ DB surpasses 275 million lines
Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."
How many lines consist of:
}
Being able to search so much source is also very useful. I was involved in a discussion a while back about the frequency of use of bessel functions in programs (I claimed rare). The handful of uses returned from your database helped back up my argument (dare I say prove it).
Keep up the good work!
charge for a premium service that allows Computer Science and Software Engineering profs to perform a somewhat intelligent search of the code to see just how much of their students' code is lifted off the 'net ;)
------- "From bored to fanboy in 3.8 asian girls" ----------
I've argued with the C++ committee about this. If "operator[]" had the same syntax as "operator()", we could have support for multidimensional arrays in C++.
I'm no C++ expert, but isn't int array[row][col] a multidimensional array?
This sig rocks the casbah.
This is a good opportunity to build complex statistics about the C++ grammar actually used in context. Learn from the NLP people! Parse the whole thing, and start finding common subtrees in the grammar used. Look at common lexical entries between subtrees, so we can make a tool that can help recognize errors by comparing against commonly used C++ grammar fragments. Or do function completion based on what kind of function you look like you're writing. See if you can do alignment with similar languages and do statistical source translation. If you keep information about comments used (and maybe apply some real NLP), you might even have a shot at automatically classifying functions based on their form, and documenting them with simple comments.
If that's too hard, try finding all n-grams instead, at least under some length. That's a lot more useful than just individual tokens or strings.
With a lot of data, you can do very cool things. Don't mess around with string frequency counting. C++ is simple compared to English, do something interesting.
The Signal/Noise ratio can be improved in two ways. Remaining silent is the OTHER way.
You can do exactly that -- just write a(12,13) instead of a[12,13].
This is a great counterexample to the GP. Changing the meaning
of the comma within square brackets would gain NOTHING and would
mean every existing compiler is now wrong.
The existing C array type is bad enough as it is, why make it
even more unwieldy by introducing a new variant? C++ is already
on the right track: discourage C arrays, and encourage container
classes that have things like bounds checking and automatic
memory allocation.