Searchable C/C++ DB surpasses 275 million lines
Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."
The following "interesting statistics" come to mind:
You gotta get the variables searchable. Most critical for that last statistic. Also, I'm too lazy to learn Lucene Query Parser Syntax, so the statistics for "Natalie Portman" may include references to "portman."
the time from the frontpage acticle on /. to the death of your server?
How many lines consist of:
}
Find similarities with stuff like SCO.
How many lines contain expletives?
blog and junk
With all that code indexed, maybe we'll finally be able to figure out what the heck SCO's talking about.
But then again, probably not...
Online Starcraft RPG? At
Dietary fiber is like asynchronous IO-- Non-blocking!
1. Lines per function
2. Comment / command ratio
3. Number of curse word variable names
word.
... of "foo" to "bar."
Orange whip? Orange whip? Three orange whips.
"I'm currently looking for suggestions..."
How about a new server?
So, this is not a flame, but I'm curious about your choice of dbs.
I've used mysql for some small projects, but generally it does handle
millions of rows (although the upper limit on rows can be patched with
some additional behaviors). So, for big dbs, I use postgresql.
How did you decide to use mysql? (Was it that the project started,
and grew, or did you know it would handle large numbers of rows
from the start)?
Just curious. This is probably going to be viewed as a flame by many
(particularly those who don't really use dbs very much, but use them
enough to have strong opinions).
Also, I would really like to find "patient 0" for sourcecode. For example, is there a common library or utility function (perhaps Hex2Ascii?) that *everybody* uses? Well, who wrote it first?
And in a similar vein, who are the "top 5-10-100" authors of open source code by use, reuse, KLOC, etc.. Not of too much use unless I were awarding the Nobel prize for programming, or perhaps creating a list of individuals for the RIAA to sue, after their done with their other useless lawsuits. :)
In the software engineering world, people will be interested in all sorts of code metrics such as cyclomatic complexity, operator/operand counts, lines of code per module, and such as well as object oriented metrics for the C++ code (depth of inheritance, for example). If you can marry these sorts of metrics with defect data (bugs) for each of the modules then you have a useful data repository for predicting defects in source code. Keeping around different versions of modules changed is also valuable here. If you can gather information on how long it took to produce the module and how long it took to correct defects in the module you are getting even better. If you make it easy to reuse the C and C++ modules...even better.
Advertise? No, I'm just a single coder doing this for fun and hope that some people will find it useful.
I was very impressed with Amazon, who for each book say which phrases and words were particularly unique to that book. (reminds me of that google game where try try and get any two words with only 1 hit).
So show code with coloured background to the lines, from green to red, green being 'normal every day boiler plate' code, red would mean this code must be more specialised, or written by some half-wit l33t h4x0r at least.
I forgot what they called it, but they had 3/4 visible stats based on the semantics of the stuff, probably more under the 'hood (omg lol).
word. Oh some adhesion stats would rock!
please type the word in this image: adhesion
random letters - if you are visually impaired, please email us at pater@slashdot.org
#hostfile 0.0.0.0 primidi.com 0.0.0.0 www.primidi.com 0.0.0.0 radio.weblogs.com
My apologies then. As a regular Slashdotter it is forbidden for me to RTFA.
Microsoft Sucks, F/OSS Rocks. I get mod points now right?
Pete/Petri "damn, my chainsaw is clogged with 1's and 0's again." --clyde
Just hit refresh and the webserver won't get the HTTP_REFERRER (granted you'll have to manually delete the text file he serves you)
-everphilski-
1) randomly select 2000 lines of code
2) compile
3) execute
4) ???????
5) PROFIT!
I'd like to know whether the word "woman" appears anywhere, and if so, in what projects.
Eh.
"Piter, too, is dead."
Being able to search so much source is also very useful. I was involved in a discussion a while back about the frequency of use of bessel functions in programs (I claimed rare). The handful of uses returned from your database helped back up my argument (dare I say prove it).
Keep up the good work!
...that is, a static analysis of a bunch of Java SourceForge projects. It does unused code and duplicate code detection... sometimes it finds some interesting things.
PMD home page is here, book site is here.
The Army reading list
Its a searchable database OF code from other products, containing 275 million lines you can search across.
Its not a searchable database written in 275 million lines of code.
So tab(i,j) is a function call with two arguments. But tab[i,j] is an invocation of the "comma operator", then a function call with one argument. The default "comma operator" ignores the first argument and returns the second. It once had some uses in C macros.
I've argued with the C++ committee about this. If "operator[]" had the same syntax as "operator()", we could have support for multidimensional arrays in C++. But there's a concern that somewhere, someone might have code that depends on the current semantics of the comma operator inside square brackets.
This new archive offers the opportunity to eliminate that possibility. So, do this search: Find, in non-comment standard C++ code, any occurences of a comma operator within square brackets. Eliminate any where there are parentheses within the square brackets enclosing the comma. Can you find any? In any production code? In any open-source project? Anywhere?
charge for a premium service that allows Computer Science and Software Engineering profs to perform a somewhat intelligent search of the code to see just how much of their students' code is lifted off the 'net ;)
------- "From bored to fanboy in 3.8 asian girls" ----------
See also Codase.com, another "Source Code Search Engine", which lets you search by method names, class names, variable names, free text, etc..
-Mark
Don't know, koders.com supports a lot more languages and also lets you narrow your search to specific licenses. The few extra lines of code just don't seem too do it, especially because such measures highly depend on the chosen method.
A person who is a true programmer in his soul doesn't ask himself "why". Oftentimes the sheer joy of creating something from nothing is enough.
Modern copyright is theft of culture from everyone and it retards the progress of the useful arts and sciences.
You can start by seeing how often people use gets(), strcpy(), strcat(), etc... Look for all the fun little common mistakes that people make.
-# of non-numerical constants /,#,; characters in code
-# of ( ),{ },\
-time spent debugging/compiling
-total hours spent in production
-gallons of coffee consumed
-hours of daylight seen
-# of relationships destroyed
He who knows best knows how little he knows. - Thomas Jefferson
I would love to see if different code styles could be analyzed to see how many peopel use what sort of syntax style. There is camelCase and under_scores but it seems possible to find more complicated trends that might allow reviews to statistically determine what practices really help to make code better.
For example, "Lines of code" / "Lines of commenting" will always produce "Inf"
I'd love to see how one of my programs (stats below) compares
to the, uh, national average.
1222 if
638 return
482 static
413 for
399 int
217 const
201 else
194 void
128 char
115 case
112 break
55 default
43 sizeof
37 do
35 switch
27 enum
24 struct
23 while
15 float
14 typedef
10 auto
7 unsigned
6 extern
1 long
(subject says it all ;))
News for Geeks in Austin, TX
This is a good opportunity to build complex statistics about the C++ grammar actually used in context. Learn from the NLP people! Parse the whole thing, and start finding common subtrees in the grammar used. Look at common lexical entries between subtrees, so we can make a tool that can help recognize errors by comparing against commonly used C++ grammar fragments. Or do function completion based on what kind of function you look like you're writing. See if you can do alignment with similar languages and do statistical source translation. If you keep information about comments used (and maybe apply some real NLP), you might even have a shot at automatically classifying functions based on their form, and documenting them with simple comments.
If that's too hard, try finding all n-grams instead, at least under some length. That's a lot more useful than just individual tokens or strings.
With a lot of data, you can do very cool things. Don't mess around with string frequency counting. C++ is simple compared to English, do something interesting.
The Signal/Noise ratio can be improved in two ways. Remaining silent is the OTHER way.
Counting the number of "TODO"s and "XXX"s in "production" open source code could be interesting.
There are two types of people in this world: those that categorize other people and those that don't.
No, no, no.
You do not use lines 1..N on the same lady until it works. It's not like breaking encryption -- you don't get to try all the possible keys.
I have friends who have done this, and they swear it's a percentage game. Choose one line you like, and try it on women 1..N until it does work, or you get tired of getting told to sod off. Apparently, with the right combination of variables, any line can be verified to work under some circumstances.
Truthfully, I don't know how anyone can set out with the knowledge they're going to get told to drop dead 70-100 times/night, but I guess if you can live with that kind of failure rate on an ongoing basis, you'll eventually get the success rate you wanted.
Now go forth young geek, and attempt to multiply.
Lost at C:>. Found at C.
I'm just a single coder
-1, Redundant
This is Slashdot, of course we're all single.
I recently did a search on some of our codebase here at work to see how many times the above keywords remained in shipping code. I was a little surprised to see how many cases there were in our code. I think sometimes, maybe even most of the time we as programmers over use these words.
Pete
What's a sig? Pete Brubaker
You can force the conversion with
blah[ location(5), 5] = 10;
but that's not useful except to see what's happening.
You can't overload the built-in operators for built-in types. So overloading, outside of an object, "operator,(int, int)" won't work either.
Hence the need for a straightforward solution.
I just did it for fun, and hopefully some people might get some use out of it.
:)
:)
This engine understands the code at a C/C++ syntax level, unlike koders.com so you can better search for what your after (comments, functions, macros, classes, etc).
Also this engine DOES allow you to click on words in the code, but only includes and function or macro calls.
There are several things that are not that great about my site, it's a little slow, doesn't support free text searching nor variable searching, and you can't copy search URL's for pasting (uses XMLHttp and form POST's).
But it's just me doing this thing, and I have limited time and most importantly limited money/hardware.
My wish is for google to do their own but index a LOT more code and have it be fast and friendly
They certainly have the resources to do it and would be a great tool for coders to use. Maybe this will help fill a gap in the mean time
That "woosh" sound you hear is the wink emoticon zooming over your head, joke in tow.
I know PHP is a great web language and that it probably isn't the cause of the slowdown. Heck, even Yahoo! uses it these days.
I was attempting (unsuccessfully, it seems) to make fun of the purists who insist that robust web applications must run on something compiled in order to reach acceptable performance under high load.
auto is a throwback to B days (the language immediately before C). B had no data types (no int, float, double, etc) but did have storage types: auto, static, and extrn.
... } ... } ... } ... }
auto was necessary in B for local variables, as a plain variable name by itself was a valid expression statement (as it is in C), not a declaration (IIRC).
1. foo() { auto bar;
2. foo() { static bar;
3. foo() { extrn bar;
4. foo() { bar;
All mean something different in B: the first three instances of bar are declarations, the fourth is an expression statement (and if I remember my B correctly, it is invalid as the first statement of foo(), because bar hasn't been declared one of auto, static, or extrn yet in this function).
In C, auto is completely redundant. Except, perhaps, in comments.
Ah, B. The days when programmers were programmers and data was data, and you could perform any operation you liked on any variable. Want to divide a pointer to a string by 3? Go ahead. Self-disciplined programmers don't need training wheels. Just a choice between auto, static and extrn.
I am anarch of all I survey.