Searchable C/C++ DB surpasses 275 million lines
Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."
How many lines consist of:
}
4. The number of times the wheel has been reinvented.
If you don't want to pay a big bandwidth bill then don't run a webserver.
Pete/Petri "damn, my chainsaw is clogged with 1's and 0's again." --clyde
Being able to search so much source is also very useful. I was involved in a discussion a while back about the frequency of use of bessel functions in programs (I claimed rare). The handful of uses returned from your database helped back up my argument (dare I say prove it).
Keep up the good work!
Why would anybody WANT to pay a big bandwidth bill? It's called being smart so that he doesn't get the shaft when he has to pay his utilities this month.
Grammar Lesson: you're is a contraction of "you are"; your means you possess something; yore means days gone by.
charge for a premium service that allows Computer Science and Software Engineering profs to perform a somewhat intelligent search of the code to see just how much of their students' code is lifted off the 'net ;)
------- "From bored to fanboy in 3.8 asian girls" ----------
A person who is a true programmer in his soul doesn't ask himself "why". Oftentimes the sheer joy of creating something from nothing is enough.
Modern copyright is theft of culture from everyone and it retards the progress of the useful arts and sciences.
I've argued with the C++ committee about this. If "operator[]" had the same syntax as "operator()", we could have support for multidimensional arrays in C++.
I'm no C++ expert, but isn't int array[row][col] a multidimensional array?
This sig rocks the casbah.
That's a little harsh don't you think?
It's one thing to run a site and have reasonable expectations of having "enough" bandwidth for your projected traffic, and it's another thing to pay for a slashdotting on an ongoing basis.
This person has decided they don't really want to be linked from Slashdot.
It's hardly an all-or-nothing thing
Lost at C:>. Found at C.
This is a good opportunity to build complex statistics about the C++ grammar actually used in context. Learn from the NLP people! Parse the whole thing, and start finding common subtrees in the grammar used. Look at common lexical entries between subtrees, so we can make a tool that can help recognize errors by comparing against commonly used C++ grammar fragments. Or do function completion based on what kind of function you look like you're writing. See if you can do alignment with similar languages and do statistical source translation. If you keep information about comments used (and maybe apply some real NLP), you might even have a shot at automatically classifying functions based on their form, and documenting them with simple comments.
If that's too hard, try finding all n-grams instead, at least under some length. That's a lot more useful than just individual tokens or strings.
With a lot of data, you can do very cool things. Don't mess around with string frequency counting. C++ is simple compared to English, do something interesting.
The Signal/Noise ratio can be improved in two ways. Remaining silent is the OTHER way.
You can do exactly that -- just write a(12,13) instead of a[12,13].
This is a great counterexample to the GP. Changing the meaning
of the comma within square brackets would gain NOTHING and would
mean every existing compiler is now wrong.
The existing C array type is bad enough as it is, why make it
even more unwieldy by introducing a new variant? C++ is already
on the right track: discourage C arrays, and encourage container
classes that have things like bounds checking and automatic
memory allocation.
That's the whole point of the complaint. Inconsistentcy between [] and ().