Large, Free, and Interesting SQL-ready Datasets?

← Back to Stories (view on slashdot.org)

Large, Free, and Interesting SQL-ready Datasets?

Posted by Cliff on Tuesday July 6, 2004 @07:35AM from the drop-in-data dept.

Jon H asks: "I'd like to teach myself various platforms or technologies, involving accessing databases. The problem is, my ideas for projects to learn on usually are boring, toy projects, involving lots of boring data entry in order to create a useful database. Things like personal library databases. This doesn't particularly interest me. It would be much easier if I had a big, interesting dataset which I could load into an SQL database without too much trouble. Then I could spend my time on the php, or WebObjects, or JBoss, or whatever. I'd like something more real than the usual toy demo databases. Something weighty, 20 megabytes and up, big enough for poor software design to cause performance issues which might not be seen in smaller databases. Ideally, it'd be in a form that could easily be loaded into an SQL database, perhaps even including a schema. Any links would be appreciated. Do such beasts exist?

15 of 73 comments (clear)

Min score:

Reason:

Sort:

Use a random number generator by jgardn · 2004-07-06 07:37 · Score: 2, Interesting

I am sure that you will get all the data you ever need from that.

As far as real world data, you'll have to collect it yourself. Start with incoming ethernet packets, and file them away into multiple tables. That'll give you a good dataset. You may also want to try your hand at incoming email, HTTP requests, etc...

--
The radical sect of Islam would either see you dead or "reverted" to Islam.
Automate the generation by Acidic_Diarrhea · 2004-07-06 07:39 · Score: 2, Interesting

Why not just write a simple script to populate your database? You might not want pure nonsense in there so you could use a dictionary file or something like that to grab random words. You could also use a fortune file to fill your DB. The way I see your need is that you just want a large dataset to see how your application will perform - this can easily be automated and manual data entry is clearly not needed since I see no reason you would need to have 'real' data.
If this isn't down to earth enough, write a simple application to fill your database with products from Ebay or Amazon - just change the product id and grab the resulting html, parsing out some identifying features and placing them in your DB.

--
I hate liberals. If you are a liberal, do not reply.
NIST databases by bmwm3nut · 2004-07-06 07:41 · Score: 5, Interesting

nist has a bunch of data. i remember a while ago downloading handwritten characters to make handwriting recognition software. they have data for just about everything, the chemistry data is probably some of the best to put in a relational database. check out: http://www.nist.gov/srd/index.htm
Census Data by HotNeedleOfInquiry · 2004-07-06 07:46 · Score: 4, Interesting

Here's all the data you'll ever need, free of charge from the gov. Some appears to be freely available and some is restricted. Have fun.

ftp://ftp2.census.gov/census_2000/datasets/

--
"Eve of Destruction", it's not just for old hippies anymore...
baseball stats by Gilk180 · 2004-07-06 07:47 · Score: 2, Interesting

Sports stats are always good.

Frankly, 20 MB is not going to give you performance issues. To realistically test the performance of your engine and your queries/schemas, you need at least enough data to fill main memory and cause disks to be used. Much more would be much better.
1. Re:baseball stats by Gabey · 2004-07-06 08:15 · Score: 3, Interesting
  
  For a more developer oriented version of this database, check out http://www.baseball-databank.org/
Plenty of data is right under your nose. by mcgroarty · 2004-07-06 07:49 · Score: 2, Interesting

Any collection of mailing lists or newsgroups are good candidates for inclusion. You've got a wealth of data in the headers, as well as a nice free-form body and a spaghetti maze of parent and child linkage.
Your web logs or even your system logs are good candidates as well, as are the package description and dependency databases for any given Linux distribution, and the bug reports for same. One cool project might be to load the Debian, RedHat, etc dependency databases and merge them together and report the differences. That's a good-sized project with the potential to benefit the free software community.
You owe the oracle * FROM wallet WHERE denomination > 20;
Machine Learning Databases by Tozog · 2004-07-06 08:02 · Score: 2, Interesting

You could probably write a script to use the data from the machine learning database collection from UCI.

Some are large, some are interesting, some are simple, but plenty of data.

http://www.ics.uci.edu/~mlearn/MLSummary.html
NOAA by QuantumRiff · 2004-07-06 08:15 · Score: 3, Interesting

Go to the NOAA Web Site and download all the weather data from your area going back many, many years.. its facinating to take the datasets and plot the ranges in temperature, humidity, etc..

--

What are we going to do tonight Brain?
Firewall output by Gadzinka · 2004-07-06 08:19 · Score: 2, Interesting

Catch all ``unusual'' packets on your firewall and log them. Lots of data and interesting things to do in order to find patterns in this aparent chaos.

I use iptables for this, but I'm sure you can do this with all the rest. You could even (as an excersize) try to log it directly to database. I just occasionally scan logs left by syslog-ng.

Robert

--
Bastard Operator From 193.219.28.162
Datasets by Pathwalker · 2004-07-06 08:21 · Score: 4, Interesting

The USGS has a huge database of Streamflow data online.
You can pull tables for rivers near you, and see how often they flood.

With a bit of work, you can pull all sorts of things out of the current tiger dataset - for example, there are about 4.8 million unique street/zipcode combinations in the US.
See how many streets near where you live are unique ( two streets just down the road from me - Kentvale and Uthers - appear to be unique).

There's lots of interesting data out there, keep poking around in .gov sites, and you'll find all sorts of stuff.
DAFIF data by orn · 2004-07-06 08:38 · Score: 3, Interesting

DAFIF data contains all sorts of aviation related airspace and airport information. Here's a link:

https://164.214.2.62/products/digitalaero/index.cf m

Make some neat tools for that and a zillion simmers (and lots of poor pilots like me) will love you forever. Check out X-Plane while you're at it. Or even better, the open-source Flight Gear could probably use your help!

--
1. 2.
NIH Human Genome by Matt+Perry · 2004-07-06 08:39 · Score: 2, Interesting

You could download the human denome database from the NCBI. All the files are here.

--
Slashdot: Failed Car Analogies. Amateur Lawyering. Anecdote Battles.
Bioinformatics by babbage · 2004-07-06 08:39 · Score: 3, Interesting

Have a look at the wonderful world of bioinformatics, where (hopefully) you should be able to find an array of academic institutions publishing their data for peer review.
To pick just the one place I'm vaguely familiar with, try Boston University's BMERC lab, which publishes both raw genomes and MySQL databases. BMERC's main genome.sql.gz file is 119,294,059 bytes (113 mb, compressed), which should be well into the "large dataset" category you're asking for :-)
There are surely many other schools publishing similar data if you poke around a bit.
Of course, at that point, you start to be bound by the problem domain. Sure, you have lots of data and that's all well & good, but what does it mean? What sorts of analysis can you usefully do on it? Without a biology background, maybe not much, but it's an interesting field and you should be able to give yourself enough of a crash course to make something useful out of it...
Have fun!

--
DO NOT LEAVE IT IS NOT REAL
Re:dmoz by vigilology · 2004-07-06 10:09 · Score: 2, Interesting

I second that. Use Catalog to pump the dmoz files into MySQL. This should give you a nice big database, well over 1GB.
Note: I think I had to use Catalog 1.01 because 1.02 didn't work.