Searchable C/C++ DB surpasses 275 million lines

← Back to Stories (view on slashdot.org)

Searchable C/C++ DB surpasses 275 million lines

Posted by Hemos on Monday December 5, 2005 @05:27AM from the interesting-applications dept.

Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."

8 of 328 comments (clear)

Min score:

Reason:

Sort:

Re:Statistics: by Anonymous Coward · 2005-12-05 05:40 · Score: 3, Informative

From the stats page if you cannot get to it...

Overall Stats
Number of Packages: 10,931
Total Number of Files: 1,151,819
Total Lines of Code (No comments, no blank lines): 283,119,081
Total of All Lines: 420,355,464
Total Number of Functions: 7,782,468
Total Number of Functions Called: 69,500,700
Total Number of Macros: 9,947,564
Total Number of Classes: 209,361
Total Number of Comments: 38,125,107
Total Number of Structures: 554,178
Total Number of Unions: 19,687
Total Number of Includes: 5,904,187
Hit Refresh by everphilski · 2005-12-05 05:44 · Score: 4, Informative

Just hit refresh and the webserver won't get the HTTP_REFERRER (granted you'll have to manually delete the text file he serves you)

-everphilski-
Re:What? Millions of code? by tgd · 2005-12-05 05:49 · Score: 4, Informative

Its a searchable database OF code from other products, containing 275 million lines you can search across.

Its not a searchable database written in 275 million lines of code.
How about a potential buffer overflow index? by raddan · 2005-12-05 06:07 · Score: 4, Informative

You can start by seeing how often people use gets(), strcpy(), strcat(), etc... Look for all the fun little common mistakes that people make.
Re:Please check for this: comma in brackets in C++ by chris+macura · 2005-12-05 06:29 · Score: 4, Informative

Yes, they are. But from an OOP standpoint, it's impossible to create a datastructure that "knows" you're using the [] operator twice. So if you overload the [] operator in an array structure, to get multi-dimensional arrays, you have to nest single dimensions arrays, which is almost always inefficient because the rows (or columns, depending on whether you're row major, or column major) are lying around the RAM (depending on where they were allocated) , rather than a continous chunk like with C. In other words, you can't do something like this in C++: class SmartArray { public: SmartArray(int height, int width); int operator(const int &x, const int &y) const; // ... }; ... SmartArray a(5, 5); a[12, 13];
Re:Interesting stats by moosesocks · 2005-12-05 07:05 · Score: 4, Informative

How many lines contain expletives?

for your reading pleasure.... the linux kernel fuck count

--
-- If you try to fail and succeed, which have you done? - Uli's moose
Re:Choice of db? by Sembiance · 2005-12-05 08:18 · Score: 4, Informative

I've used MySQL in the past for some projects at work, where the number of rows were several hundred million and ran with no problems so I knew it was capable of large row numbers.

I initially used their FULLTEXT indexing as well, but it dies a horrible death with a large number of rows or search terms. (The developers that live in #mysql on Freenode confirmed this)

So I had to hand off searching to Lucene, which worried me a great deal (being java) but as folks tell me 'Java is not slow'.
They are right, Java is very fast at handling the searching and I've been very impressed.
Most searches in the Java database only take one or two seconds.
The MySQL query/join for additional info take another 4 or 5 seconds.

Most searches take about 8 seconds to come up, even under no load.

I simply don't have enough RAM to keep the necessary MySQL indexes in RAM and use index only queries.
Proposed workaround doesn't work by Animats · 2005-12-05 08:31 · Score: 3, Informative
Yes, that compiles and runs, but it doesn't do what you think it does. Put in some debug print to see what's actually happening, which is this:
- "5,5" is evaluated using the built-in definition of ",", returning "5". The no-conversion built-in operator comma has higher priority than the conversion sequence involving a conversion to "location", then the use of the overloaded comma operator. So the built-in comma operator is used. See the discussion in the C++ ARM, section 13.2, "Argument matching": which says "consider an exact match better than any conversion".
- "5" is converted to type "location" by the constructor for "location", resulting in a "location" object with "dimension=1" and "coordinates[0]=5".
- This "location" object is passed to "operator[]", which then accesses "coordinates[1]", an uninitialized value, which it then uses as a subscript, returning a reference to a arbitrary memory location. So, instead of returning "&blah.matrix[5][5]", it returns "&blah.matrix[???][5]". The example program seems to run in VC++ only because that part of memory happens to be 0 at startup, so this returns "&blah.matrix[0][5]". In other circumstances, it might cause a crash.
- "10" is stored into the wrong location of "blah",or outside it, due to the bad reference generated above.. This is where the buffer overflow occurs.
You can force the conversion with
blah[ location(5), 5] = 10;
but that's not useful except to see what's happening.
You can't overload the built-in operators for built-in types. So overloading, outside of an object, "operator,(int, int)" won't work either.
Hence the need for a straightforward solution.