Slashdot Mirror


Large, Free, and Interesting SQL-ready Datasets?

Jon H asks: "I'd like to teach myself various platforms or technologies, involving accessing databases. The problem is, my ideas for projects to learn on usually are boring, toy projects, involving lots of boring data entry in order to create a useful database. Things like personal library databases. This doesn't particularly interest me. It would be much easier if I had a big, interesting dataset which I could load into an SQL database without too much trouble. Then I could spend my time on the php, or WebObjects, or JBoss, or whatever. I'd like something more real than the usual toy demo databases. Something weighty, 20 megabytes and up, big enough for poor software design to cause performance issues which might not be seen in smaller databases. Ideally, it'd be in a form that could easily be loaded into an SQL database, perhaps even including a schema. Any links would be appreciated. Do such beasts exist?

10 of 73 comments (clear)

  1. USDA by L.+VeGas · · Score: 5, Informative

    One I've used for laughs is the USDA Nutrient Database. It gives you, well, nutrient information on just about any food you can think of. It's normalized, and just complicated enough to have fun with.

    You're going to have to google it yourself, though.

  2. NIST databases by bmwm3nut · · Score: 5, Interesting

    nist has a bunch of data. i remember a while ago downloading handwritten characters to make handwriting recognition software. they have data for just about everything, the chemistry data is probably some of the best to put in a relational database. check out: http://www.nist.gov/srd/index.htm

  3. Census Data by HotNeedleOfInquiry · · Score: 4, Interesting

    Here's all the data you'll ever need, free of charge from the gov. Some appears to be freely available and some is restricted. Have fun.

    ftp://ftp2.census.gov/census_2000/datasets/

    --
    "Eve of Destruction", it's not just for old hippies anymore...
  4. IMDb by br0ck · · Score: 5, Informative

    Use IMDbPY to populate a database with all data from the downloadable files from the Internet Movie Database.

  5. dmoz by heydude · · Score: 3, Informative

    Some folks have used the dmoz data. It is in RDF, so should be fairly flexible enough to get into most databases using most languages and an RDF library.

  6. Re:baseball stats by Gabey · · Score: 3, Interesting

    For a more developer oriented version of this database, check out http://www.baseball-databank.org/

  7. NOAA by QuantumRiff · · Score: 3, Interesting

    Go to the NOAA Web Site and download all the weather data from your area going back many, many years.. its facinating to take the datasets and plot the ranges in temperature, humidity, etc..

    --

    What are we going to do tonight Brain?
  8. Datasets by Pathwalker · · Score: 4, Interesting

    The USGS has a huge database of Streamflow data online.
    You can pull tables for rivers near you, and see how often they flood.

    With a bit of work, you can pull all sorts of things out of the current tiger dataset - for example, there are about 4.8 million unique street/zipcode combinations in the US.
    See how many streets near where you live are unique ( two streets just down the road from me - Kentvale and Uthers - appear to be unique).

    There's lots of interesting data out there, keep poking around in .gov sites, and you'll find all sorts of stuff.

  9. DAFIF data by orn · · Score: 3, Interesting

    DAFIF data contains all sorts of aviation related airspace and airport information. Here's a link:

    https://164.214.2.62/products/digitalaero/index.cf m

    Make some neat tools for that and a zillion simmers (and lots of poor pilots like me) will love you forever. Check out X-Plane while you're at it. Or even better, the open-source Flight Gear could probably use your help!

    --
    1. 2.
  10. Bioinformatics by babbage · · Score: 3, Interesting

    Have a look at the wonderful world of bioinformatics, where (hopefully) you should be able to find an array of academic institutions publishing their data for peer review.

    To pick just the one place I'm vaguely familiar with, try Boston University's BMERC lab, which publishes both raw genomes and MySQL databases. BMERC's main genome.sql.gz file is 119,294,059 bytes (113 mb, compressed), which should be well into the "large dataset" category you're asking for :-)

    There are surely many other schools publishing similar data if you poke around a bit.

    Of course, at that point, you start to be bound by the problem domain. Sure, you have lots of data and that's all well & good, but what does it mean? What sorts of analysis can you usefully do on it? Without a biology background, maybe not much, but it's an interesting field and you should be able to give yourself enough of a crash course to make something useful out of it...

    Have fun!