Searchable C/C++ DB surpasses 275 million lines

← Back to Stories (view on slashdot.org)

Searchable C/C++ DB surpasses 275 million lines

Posted by Hemos on Monday December 5, 2005 @05:27AM from the interesting-applications dept.

Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."

6 of 328 comments (clear)

Min score:

Reason:

Sort:

Hit Refresh by everphilski · 2005-12-05 05:44 · Score: 4, Informative

Just hit refresh and the webserver won't get the HTTP_REFERRER (granted you'll have to manually delete the text file he serves you)

-everphilski-
Re:What? Millions of code? by tgd · 2005-12-05 05:49 · Score: 4, Informative

Its a searchable database OF code from other products, containing 275 million lines you can search across.

Its not a searchable database written in 275 million lines of code.
How about a potential buffer overflow index? by raddan · 2005-12-05 06:07 · Score: 4, Informative

You can start by seeing how often people use gets(), strcpy(), strcat(), etc... Look for all the fun little common mistakes that people make.
Re:Please check for this: comma in brackets in C++ by chris+macura · 2005-12-05 06:29 · Score: 4, Informative

Yes, they are. But from an OOP standpoint, it's impossible to create a datastructure that "knows" you're using the [] operator twice. So if you overload the [] operator in an array structure, to get multi-dimensional arrays, you have to nest single dimensions arrays, which is almost always inefficient because the rows (or columns, depending on whether you're row major, or column major) are lying around the RAM (depending on where they were allocated) , rather than a continous chunk like with C. In other words, you can't do something like this in C++: class SmartArray { public: SmartArray(int height, int width); int operator(const int &x, const int &y) const; // ... }; ... SmartArray a(5, 5); a[12, 13];
Re:Interesting stats by moosesocks · 2005-12-05 07:05 · Score: 4, Informative

How many lines contain expletives?

for your reading pleasure.... the linux kernel fuck count

--
-- If you try to fail and succeed, which have you done? - Uli's moose
Re:Choice of db? by Sembiance · 2005-12-05 08:18 · Score: 4, Informative

I've used MySQL in the past for some projects at work, where the number of rows were several hundred million and ran with no problems so I knew it was capable of large row numbers.

I initially used their FULLTEXT indexing as well, but it dies a horrible death with a large number of rows or search terms. (The developers that live in #mysql on Freenode confirmed this)

So I had to hand off searching to Lucene, which worried me a great deal (being java) but as folks tell me 'Java is not slow'.
They are right, Java is very fast at handling the searching and I've been very impressed.
Most searches in the Java database only take one or two seconds.
The MySQL query/join for additional info take another 4 or 5 seconds.

Most searches take about 8 seconds to come up, even under no load.

I simply don't have enough RAM to keep the necessary MySQL indexes in RAM and use index only queries.