Large, Free, and Interesting SQL-ready Datasets?

Use a random number generator by jgardn · 2004-07-06 07:37 · Score: 2, Interesting

I am sure that you will get all the data you ever need from that.

As far as real world data, you'll have to collect it yourself. Start with incoming ethernet packets, and file them away into multiple tables. That'll give you a good dataset. You may also want to try your hand at incoming email, HTTP requests, etc...

--
The radical sect of Islam would either see you dead or "reverted" to Islam.

Automate the generation by Acidic_Diarrhea · 2004-07-06 07:39 · Score: 2, Interesting

Why not just write a simple script to populate your database? You might not want pure nonsense in there so you could use a dictionary file or something like that to grab random words. You could also use a fortune file to fill your DB. The way I see your need is that you just want a large dataset to see how your application will perform - this can easily be automated and manual data entry is clearly not needed since I see no reason you would need to have 'real' data.

If this isn't down to earth enough, write a simple application to fill your database with products from Ebay or Amazon - just change the product id and grab the resulting html, parsing out some identifying features and placing them in your DB.

--
I hate liberals. If you are a liberal, do not reply.

Re:Automate the generation by StandSure · 2004-07-06 07:44 · Score: 1

Why not get a cue cat and scan your books? You can run an application against Barnes and Noble or Amazon and parse out all kinds of stuff. I wrote one that even grabbed the front cover image to stuff in my database. Good Luck!
Re:Automate the generation by Anonymous Coward · 2004-07-06 09:55 · Score: 0

You wrote Book Collector? It's a pretty cool app.
Re:Automate the generation by isj · 2004-07-06 10:18 · Score: 1

The ideal would be if you could get your hands on some of the data from telcos - they store grotesque amounts of information: CDRs, invoices, intermediate processing, ... They love data. However, the chances of getting access to it is practially nil due to privacy concerns.

But it should be reasonable easy to generate using a script. Even the process of the creating such a script will teach you a lot. Let's take the CDRs: about 50 columns per row (username, account, usage data, CLID, ANI, ...), 3.5mil per day. 1 month. =~50GB. That should be enough to get you started.

USDA by L.+VeGas · 2004-07-06 07:41 · Score: 5, Informative

One I've used for laughs is the USDA Nutrient Database. It gives you, well, nutrient information on just about any food you can think of. It's normalized, and just complicated enough to have fun with.

You're going to have to google it yourself, though.

--
Best Windows Freeware

Re:USDA by Anonymous Coward · 2004-07-06 07:44 · Score: 2, Informative

You're going to have to google it yourself, though.

...or click here.
Re:USDA by Anonymous Coward · 2004-07-06 13:36 · Score: 0

Cool! Thanks!

NIST databases by bmwm3nut · 2004-07-06 07:41 · Score: 5, Interesting

nist has a bunch of data. i remember a while ago downloading handwritten characters to make handwriting recognition software. they have data for just about everything, the chemistry data is probably some of the best to put in a relational database. check out: http://www.nist.gov/srd/index.htm

Real data dard to come by. by saden1 · 2004-07-06 07:42 · Score: 1

I'm currently generating my own test data and you'll probably have to do the same. Data is precious commodity and the more real it is the less likely it will be publicized. I would be fired if I make my current test data public.

--

-----
One is born into aristocracy, but mediocrity can only be achieved through hard work.

John, post your email if you want a reply.. by Anonymous Coward · 2004-07-06 07:42 · Score: 0

.. I'm putting together some datasets you can have but I'm not willing to post my email (to avoid spam) or website (to avoid the slashdot effect).

Re:John, post your email if you want a reply.. by Doug+Merritt · 2004-07-07 10:25 · Score: 1

I'm interested, for more or less the same reason as this John guy, so do please let me know at the email address in the header above. Thanks!

--
Professional Wild-Eyed Visionary

One word... by slappy · 2004-07-06 07:44 · Score: 2, Insightful

...northwind

Re:One word... by slashjames · 2004-07-06 09:27 · Score: 1

Northwind is only 3.6 MB (in MS SQL Server 2000) or 1.7MB (in MS Access XP). He's wanting a database that's over 20MB.
Re:One word... by crisco · 2004-07-06 10:34 · Score: 1

Hasn't Northwind been superceeded by Duwamish or something like that?

--
Bleh!

Census Data by HotNeedleOfInquiry · 2004-07-06 07:46 · Score: 4, Interesting

Here's all the data you'll ever need, free of charge from the gov. Some appears to be freely available and some is restricted. Have fun.

ftp://ftp2.census.gov/census_2000/datasets/

--
"Eve of Destruction", it's not just for old hippies anymore...

IMDb by br0ck · 2004-07-06 07:46 · Score: 5, Informative

Use IMDbPY to populate a database with all data from the downloadable files from the Internet Movie Database.

baseball stats by Gilk180 · 2004-07-06 07:47 · Score: 2, Interesting

Sports stats are always good.

Frankly, 20 MB is not going to give you performance issues. To realistically test the performance of your engine and your queries/schemas, you need at least enough data to fill main memory and cause disks to be used. Much more would be much better.

Re:baseball stats by Gabey · 2004-07-06 08:15 · Score: 3, Interesting

For a more developer oriented version of this database, check out http://www.baseball-databank.org/
Re:baseball stats by Anonymous Coward · 2004-07-06 10:07 · Score: 0

Actually, a more appropriate measure of "size" for such a test database would be the number of tables, number of foreign key relationships, number of indeces, number of stored procedures, and number of rows. Yes, the size in MB is useful as well, but only if combined with the above quantities.

More significantly, I think it is poor advice to suggest that a developer needs to "fill main memory and cause disks to be used" in order to learn how to write a well-performing database application. In any real-life situation, memory and disk issues are first addressed by purchasing more memory and faster disks. Only if this is not sufficient, or exhorbitantly expensive, does one attempt to improve the code. Hardware is much less costly than developer time.

In addition, if the final application will run on a server machine (most database apps do), then the fact that the developer "filled main memory" or "caused disks to be used" on his/her developer machine has no real significance.

That said, if you are looking for a dataset that will help you write high-performing database code, then you really need to determine what type of database application is your target?

A dynamic website that looks up content and user preferences in a MySQL database is an entirely different beast from, say, an order management application, or a project management application. The first would indeed require a large (measured in MB) sample dataset, while the others would require their "size" to be large in terms of number of interrelated columns, triggers, etc.

For most database applications, the key performance issue is increasing "concurrency". If, however, your application is of the single-user type, then again your performance concerns will be significantly different.

Plenty of data is right under your nose. by mcgroarty · 2004-07-06 07:49 · Score: 2, Interesting

Any collection of mailing lists or newsgroups are good candidates for inclusion. You've got a wealth of data in the headers, as well as a nice free-form body and a spaghetti maze of parent and child linkage.

Your web logs or even your system logs are good candidates as well, as are the package description and dependency databases for any given Linux distribution, and the bug reports for same. One cool project might be to load the Debian, RedHat, etc dependency databases and merge them together and report the differences. That's a good-sized project with the potential to benefit the free software community.

You owe the oracle * FROM wallet WHERE denomination > 20;

SPAM by pgaffney · 2004-07-06 07:50 · Score: 1

Do you have a spam folder? Save all your spam emails with headers intact. Also export your email address list into a text table. Write perl scripts to parse emails and check for whatever variables you are interested in; what type / size attachment, is the sender in your addr book, does the sender addr appear to be valid or not. You could even try to combine this with your web browser history. The key is that while you still need to build the DB yourself, there's a lot of interesting information you can easily extract with simple scripts and minimal actual work.

Ta!

-petertgaffney

dmoz by heydude · 2004-07-06 07:51 · Score: 3, Informative

Some folks have used the dmoz data. It is in RDF, so should be fairly flexible enough to get into most databases using most languages and an RDF library.

Re:dmoz by vigilology · 2004-07-06 10:09 · Score: 2, Interesting

I second that. Use Catalog to pump the dmoz files into MySQL. This should give you a nice big database, well over 1GB.
Note: I think I had to use Catalog 1.01 because 1.02 didn't work.

Internet Movie Database & Yahoo Stock Data by fear025 · 2004-07-06 07:59 · Score: 1

If you want Movie information, you can grab the data files from the Internet Movie Database.
http://www.imdb.com/interfaces

My father has a large VHS collection of movies whose info he keeps updated in a spreadsheet. It was a lot of fun to link his database to the IMDB one, and start searching for movies that included my favorite actors (and were already in the house).

Another source of data that I use from the internet is daily stock data. If you go to this page Yahoo S&P 500 Info and click Historical Prices you will get a page that lists the historical data in a browser-friendly format. At the bottom of this page is an option to download the data in a comma separated value (CSV) file.

Re:Internet Movie Database & Yahoo Stock Data by Deagol · 2004-07-06 08:32 · Score: 1

If you want Movie information, you can grab the data files from the Internet Movie Database.
Damn, that's cool!
I kinda held a grudge for the IMDB folks for a while, since the data used to be a collection of text files distributed on USENET (eventually binary DB files, along with a simple query tool, distributed via ftp). I contributed a fair bit of stuff back in those days.
I thought that once they went "commercial" on the web, years ago, that it meant the end of all that community-created data. I'm pleased to see that's not the case (though I haven't read their license for the data -- it may be draconian for all I know).
Similarly, one could play around with the CDDB/freedb data sets. I'll kill for the raw data behind the allmusic.com site. :)

--
Method of processing duck feet

Here's a possible source by jbarr · 2004-07-06 08:02 · Score: 1

I hear the RIAA has some pretty interesting databases. Obtaining them might be challenging, though...

--
My mom always said, "Jim, you're 1 in a million." Given the current population, there are 7000 of me. God help us all!

Machine Learning Databases by Tozog · 2004-07-06 08:02 · Score: 2, Interesting

You could probably write a script to use the data from the machine learning database collection from UCI.

Some are large, some are interesting, some are simple, but plenty of data.

http://www.ics.uci.edu/~mlearn/MLSummary.html

Re:Machine Learning Databases by "Zow" · 2004-07-07 02:21 · Score: 1
A similar set that I didn't see anyone mention was the KDD (Knowledge Discovery in Databases) Cup datasets. Every year for the ACM SIGKDD conference they make a dataset available and see who in the KDD community can do the best job of mining it (for instance, do the best classification on the test data). There isn't a central repository as different organizations host the challenge every year, but a quick Google search for "KDD Cup" will give you the list. The last four years had data sets such as:
- "clickstream and purchase data from Gazelle.com, a legwear and legcare web retailer that closed their online store on 8/18/2000"
- "data from genomics and drug design"
- "data mining in molecular biology domains"
- "network mining and the analysis of usage logs"
-"Zow"

If you're gonna get snarky... by Deagol · 2004-07-06 08:03 · Score: 2

At least provide a link to the current dataset: Release 16-1. (Well... the link itself is generic and not tied to a specififc version.)

:)

--
Method of processing duck feet

Re:If you're gonna get snarky... by AhBeeDoi · 2004-07-06 12:37 · Score: 1

Egads, man, the mdb file is 68MB uncompressed. Surely, it's enough to satisfy the database starved ones.

Spec Int by mnmn · 2004-07-06 08:04 · Score: 1

Try www.spec.org, you'll find big HTML tables or CSV files for the specs of various computers and CPUs. This is also pretty useful information say if you wanna compare a Pentium1's floating with sparcstation, or if youre looking on eBay to buy a cheap 64-bit cpu machine that can do a decent job of a firewall.

Newer datasets of the specint can help you decide between a Pentium4 CPU and Athlon machines, and their chipsets. So pretty cool and useful information there.

Also heard of telemetry information, on the heights of various spots of ground above sea level against GPS coordinates.... with that data you can build a 3D world thats realistic. Couldnt find the dataset though.

--
"Give orange me give eat orange me eat orange give me eat orange give me you." -Nim Chimpsky

RSS/RDF Feeds by HaloZero · 2004-07-06 08:07 · Score: 1

...are ideal. Come in a standard XML format, which is easily mutable using perl. While I was learning how to clunk about in databases, I had a simple script which made a call to several sites with RDF/RSS, grab a copy of their feed, and check to see if an entry exists in the DB. You can even learn to account for repeat results, and update existing entries. Watch out, though; making too many calls to a website's feed can make the webmastership unhappy with you. My script ran nightly, at 2 AM EST, and grabbed the news from about a dozen sites (slashdot included).

If you want to get REAL creative, you could hack up a little applescript that reads entries from the DB (through perl), and reads you the news. That was neat to do with my PowerBook, as I could get an overview of the day before on my drive to work.

YMMV.

--
Informatus Technologicus

Audioscrobbler by WiKKeSH · 2004-07-06 08:08 · Score: 1

The audioscrobbler database is available under a creative commons license.
http://www.audioscrobbler.com

Audioscrobbler is a computer system that builds up a detailed profile of your musical taste. After installing an Audioscrobbler Plugin, your computer sends the name of every song you play to the Audioscrobbler Server. With this information, the Audioscrobbler server builds you a 'Musical Profile'. Statistics from your Musical Profile are shown on your Audioscrobbler User Page, available for everyone to view.

--
Downmix - The Artscene News Source!

Stock Performace by skreuzer · 2004-07-06 08:12 · Score: 1

Last semester, my final was to develop a database, and write a front end client to interact with the database.

I didn't feel like entering in all the data by hand so I compiled a database of various stocks, and how they have performed over a few months.
All of my data was obtained through finance.yahoo.com, and they allow you to download historical data for numerious stocks and they provide it to you as a comma seperated file.

NOAA by QuantumRiff · 2004-07-06 08:15 · Score: 3, Interesting

Go to the NOAA Web Site and download all the weather data from your area going back many, many years.. its facinating to take the datasets and plot the ranges in temperature, humidity, etc..

--

What are we going to do tonight Brain?

Firewall output by Gadzinka · 2004-07-06 08:19 · Score: 2, Interesting

Catch all ``unusual'' packets on your firewall and log them. Lots of data and interesting things to do in order to find patterns in this aparent chaos.

I use iptables for this, but I'm sure you can do this with all the rest. You could even (as an excersize) try to log it directly to database. I just occasionally scan logs left by syslog-ng.

Robert

--
Bastard Operator From 193.219.28.162

here's some genomic data by Glog · 2004-07-06 08:20 · Score: 1

You can always use genomic data - there's plenty of it to go around for everyone. Following is a link to some downloads for mySQL: http://www.ensembl.org/Download/

Datasets by Pathwalker · 2004-07-06 08:21 · Score: 4, Interesting

The USGS has a huge database of Streamflow data online.
You can pull tables for rivers near you, and see how often they flood.

With a bit of work, you can pull all sorts of things out of the current tiger dataset - for example, there are about 4.8 million unique street/zipcode combinations in the US.
See how many streets near where you live are unique ( two streets just down the road from me - Kentvale and Uthers - appear to be unique).

There's lots of interesting data out there, keep poking around in .gov sites, and you'll find all sorts of stuff.

Re:Datasets by /dev/trash · 2004-07-06 12:06 · Score: 1

Unique how? I know in my state, all street names in the COUNTY have to be unique for 911 purposes.
Re:Datasets by Pathwalker · 2004-07-07 00:10 · Score: 1
Unique in the entire US.

Lots of street names show up hundreds, or thousands of times, all over the US.
- Main shows up just under 10,000 times
- Washington shows up just under 6000 times
However there are quite a few which only show up once ( Analog for example), or all of the streets with that name are concentrated in a small area (take hells gate as an example - they're all near each other in Texas).
Re:Datasets by /dev/trash · 2004-07-07 04:20 · Score: 1

Ohhhh oops.

Some of the fun ones have already been posted. by Deagol · 2004-07-06 08:24 · Score: 2, Informative

The US Census has tons of great info, as does the USDA Nutrition Database. Mortality stats gathered by the CDC are fun in a morbid kind of way. :)

There are some great collections of historical climate data out there for free. Here's a source for the Western US (a similiar compilation for the entire US would be great). Some earthquake data can be found here.

Heck, just enter "raw data" into google, along with your topic of choice, and have fun.

--
Method of processing duck feet

How about the open directory project rdf dump? by DeadSea · 2004-07-06 08:33 · Score: 1

How about importing the Open Directory Project? It has an RDF (xml) dump of its current data. It is a couple hundrdred megs compressed and a couple gigs uncompressed.

There are numerous utilities to put it into a database for you.

SQL Server? by webmaestro · 2004-07-06 08:35 · Score: 1

If your using MS SQL server you can get the BigPubs2000 database. If you've ever used SQL Server then you probably know what the Pubs database is, and this is a really big version of it. It's about 200 MB. you can google to find a download site for it.

DAFIF data by orn · 2004-07-06 08:38 · Score: 3, Interesting

DAFIF data contains all sorts of aviation related airspace and airport information. Here's a link:

https://164.214.2.62/products/digitalaero/index.cf m

Make some neat tools for that and a zillion simmers (and lots of poor pilots like me) will love you forever. Check out X-Plane while you're at it. Or even better, the open-source Flight Gear could probably use your help!

--
1. 2.

NIH Human Genome by Matt+Perry · 2004-07-06 08:39 · Score: 2, Interesting

You could download the human denome database from the NCBI. All the files are here.

--
Slashdot: Failed Car Analogies. Amateur Lawyering. Anecdote Battles.

Re:NIH Human Genome by M1FCJ · 2004-07-06 09:36 · Score: 1

Coool! I'm downloading the XML format of the human genome now, then I can write a couple of java apps to browse through the genome. Really coool!

Bioinformatics by babbage · 2004-07-06 08:39 · Score: 3, Interesting

Have a look at the wonderful world of bioinformatics, where (hopefully) you should be able to find an array of academic institutions publishing their data for peer review.

To pick just the one place I'm vaguely familiar with, try Boston University's BMERC lab, which publishes both raw genomes and MySQL databases. BMERC's main genome.sql.gz file is 119,294,059 bytes (113 mb, compressed), which should be well into the "large dataset" category you're asking for :-)

There are surely many other schools publishing similar data if you poke around a bit.

Of course, at that point, you start to be bound by the problem domain. Sure, you have lots of data and that's all well & good, but what does it mean? What sorts of analysis can you usefully do on it? Without a biology background, maybe not much, but it's an interesting field and you should be able to give yourself enough of a crash course to make something useful out of it...

Have fun!

--
DO NOT LEAVE IT IS NOT REAL

IP - DNS - Geographic Location dataset . . . by millisa · 2004-07-06 08:44 · Score: 1

A problem I enjoy pulling out to play with now and then is doing a usable data set that can do reverse IP lookup to geographic location . . . there are services that offer it, but there isn't any reason you couldn't write stuff to do this yourself.

Big data sets with a lot of the information being available through publicly queryable sources.

There are lots of approaches to mapping it all out (start with class A, then to B, and onward or you can start with specific ranges . . .). Between whois info, dns, and trace routes alone you can have some fun doing the mappings . . .and in the end, you have a useful and marketable dataset that you created . . . IP -> Geography is valuable. Once you've got the data though, there's even more . . .how do you manage data sets that large? You can't just throw every IP into a single table . . .not unless you are the index king . . .

I don't know of a ready made data set of this nature, but putting the data together into the dataset and loading it is gonna teach you as much about the languages you are using to load it as it will about the data and managing it via sql services. . .

"interesting data set" . . .I am so amused.

Northwind Database. by Randolpho · 2004-07-06 08:44 · Score: 2, Informative

30 posts and nobody mentions the Northwind Database that comes with MS Access? You can download it and set it up as an ODBC source, which you can use with pretty much anything. Well, provided you're using Windows. ;)

Northwind Access 2000 Download page

--
"Times have not become more violent. They have just become more televised."
-Marilyn Manson

Wikipedia by Captain+Nitpick · 2004-07-06 08:48 · Score: 1

A dump of the Wikipedia database is available. It's certainly big, and the content is interesting, although the structure isn't.

--
But then again, I could be wrong.

Do what I did-hack porn databases by Anonymous Coward · 2004-07-06 08:54 · Score: 0

Seriously. you can search for some online order entry terms, then if you know the default web pages you can get the location of their database. then DL it. No joke. Learned that a guy 4 blocks from me orders some of the most disgusting stuff. shudder. About 1 in 12 of the ones I tested had this thing wide open. I e-mailed a few of them warning them....far as I know they did not make any changes. Go figure.

easy by Jonny+290 · 2004-07-06 08:57 · Score: 2, Funny

step 1: install windows xp on PC.
step 2: turn on application crash logging.
step 3: give to mother
step 4: ??? (aka wait 3 months)
step 5: collect 3gb dataset.

--
Hey Taco! Looks like you're using the "infinite monkeys and typewriters" scheme to generate Ask Slashdots again...

Here you go. by SiMac · 2004-07-06 10:16 · Score: 1

You can download the entire Wikipedia database. It weighs in at 15GB if you include all revisions, or 600MB with just the newest copy of everything. Have fun.

astronomical surveys by sdedeo · 2004-07-06 10:32 · Score: 1

There are plenty of astronomical surveys that are in the public domain. For example, the Sloan Digital Sky Survey (google "SDSS SQL") has its public release in SQL format. I am not sure how they bundle it for download and offline access -- the full set in in the TB, so you would have to make some serious cuts.

Look around for an astronomical survey that interests you. Learn a little of the science so that you can think of interesting things to play with (or just read some of the associated papers to repeat their results.) Keep an eye on the science press to hear about new (and old) surveys, and then look up their homepages to see if they have released the data to the public. Oftentimes, data is released with a paper, so look on xxx.lanl.gov as well.

If you come up with interesting educational uses and a nice interface, consider putting it up on the web and dropping a line to the coordinator. Most surveys don't have dedicated public education staff. (On the other hand, most surveys are very busy, so don't expect a rapid response.)

Important: some of the sets are huge. Be conservative with the bandwidth. Don't download gigs and gigs over and over again, and if you have a choice, pick "off peak" times to grab a set. Also, the surveys usually have their hands full already, so don't expect to get any help with untangling the data -- if a set is not making sense or is poorly documented, your only real option is to move on to another one.

--
Protect your liberties. Donate to the ACLU

Zen-Cart by Tobias+Luetke · 2004-07-06 12:02 · Score: 1

My recommendation is to install Zen-Card on the system.

Zen is a full featured e-commerce solution and is opensource and written in php. When you install it you can choose to have it populate a demo database which has quite a bit of stuff in it. (100s of products).

I don't think its the pinnacle of database engineering but if all you are looking for is mock data there you go.

FCC License databases by dfranks · 2004-07-06 12:17 · Score: 1

I use the FCC License databases for most of my small to medium database testing. They can be downloaded at FCC Universal Licensing System, and are in BCP format (for SQL Server, easy to import into anything else). Layout files and schema create scripts are also available for download.

There are a few related tables available for download, but I mostly use it for Name/Address test data.

Dean

freedb.org by Darth+Yoshi · 2004-07-06 12:59 · Score: 1

The freedb.org database is available for download.

--
// TODO: fix sig

Re:freedb.org by Anonymous Coward · 2004-07-07 10:28 · Score: 0

I'd like to throw out a second vote for freedb. There's something like 1.7 million rows that result from the postgre or mysql import scripts.

GIS Data Depot by Dylbert · 2004-07-06 13:09 · Score: 1

I do a little work with environmental data, and in my travels I found this site. It has a huge amount of GIS data collected from various sources for the purposes of mapping various parameters, such as political boundaries, vertical relief, drainage networks, rail/road/piping networks, etc. The only downside is that the data is aging, and the free stuff only goes down to 1:1000000, but its a neat repository if you need a rough map drawn.

Have fun.

--
I swear, if I see another Slashdot comment with "It will be interesting to see"...

IRC by norculf · 2004-07-06 13:26 · Score: 1

Build a log bot, run it in a few large channels on Freenode.

Thanks! A veritable cornucopia by Anonymous Coward · 2004-07-06 14:01 · Score: 0

I'm sure I'll find plenty of stuff to work with.
As a public service, here's a summary of the links I found most interesting:

Food: http://www.nal.usda.gov/fnic/foodcomp/Data/SR14/dn load/sr14dnld.html

National Inst. of Standards & Tech: http://www.nist.gov/srd/index.htm

Census data: ftp://ftp2.census.gov/census_2000/datasets/

Open Directory: http://search.cpan.org/~ldachary/Catalog-1.02/ and http://rdf.dmoz.org/ and http://amix.dk/codecrib/dmozparser.php

Machine Learning: http://www.ics.uci.edu/~mlearn/MLSummary.html

Music Preferences: http://www.audioscrobbler.com/

Weather: http://www.noaa.gov/ and http://www.wrcc.dri.edu/climsum.html

Genomics http://www.ensembl.org/Download/

Stream flow http://waterdata.usgs.gov/nwis/rt

Earthquakes http://www.ngdc.noaa.gov/seg/hazard/earthqk.shtml

Aviation https://164.214.2.62/products/digitalaero/index.cf m

Human Genome ftp://ftp.ncbi.nih.gov/snp/

Boston U bioinformatics/genome ftp://mcclintock.bu.edu/BMERC/mysql/

Wikipedia http://download.wikimedia.org/

FCC Licensing http://wireless.fcc.gov/cgi-bin/wtb-datadump.pl

Freedb http://www.freedb.org/

GIS http://data.geocomm.com/

one more by bmac · 2004-07-06 15:23 · Score: 1

Check out the Nat'l Imaging and Mapping
Agency (NIMA)'s database of placenames:

http://earth-info.nga.mil/gns/html/cntry_files.h tm l

This is the db (200M compressed, 700M un) of
foreign placenames. I actually used the US
placenames file that NIMA has (don't have the
URL, but it should be free for US residents)
for zips, counties and cities of the US. Each
"placename" has the latitude and longitude as
well.

Note that db's like these need to be scrubbed
and massaged before they can be properly
read into a relational db. That's where the
true expertise comes in; that's also where
perl shines as a language. Once you get the
data clean and in, everything should be easy
after that.

Note that I had a paying gig that worked with
this NIMA data as the basis for a worldwide
db for locating the nearest service center to
a user's location. Learning to be able to
compute distance from two lat/long coord
pairs is an interesting and real-world
exercise.

Good luck,
bmac

Re:one more by Anonymous Coward · 2004-07-06 20:56 · Score: 0

It is not necessary to hit the "Enter" key at the end of each line. The text box will wrap the text automatically for you. People who want to read short lines will resize their browsers accordingly, in order to read other posts here that are not scrunched up vertically like yours. I stopped reading your post halfway through because the short lines were so distracting. Please stop doing this. Thank you.

I hear . . . by acceleriter · 2004-07-06 15:58 · Score: 2, Funny

. . . the U.S. Department of Justice has a foreign lobbyist database that should be big enough to test with.

--

CEE5210S The signal SIGHUP was received.

Or a non-random name generator by titaniam · 2004-07-06 16:06 · Score: 2, Informative

Want a whole bunch (most) registered domain names in the world? You'll need to fill out some forms and wait maybe a week (except edu), but it's worth it. Click for biz, edu, int, info, org, com, net. These files are whoppers for the most part. Perl would not read the com file under redhat 6 its' so big. I use them for my surf engine, iconsurf.com.

how about using a benchmark ? by perlchild · 2004-07-06 18:46 · Score: 1

Isn't the TPC-C or some similar sql-based (like crashme) database benchmark closer to what you need?

DBMonster by Anonymous Coward · 2004-07-06 20:10 · Score: 0

Ever tried DBMonster? its a java app - you tell it your schema, how many rows you want, maybe something to seed the randomness, then off it goes and fills your database with dummy data.

You can go as far as integrating with DBUnit for ant/maven build tasks too...

I think its a sourceforge project, but Google is your friend.

Please learn how to make links. by Anonymous Coward · 2004-07-06 20:49 · Score: 0

Please learn how to make links.

<a href="http://www.nist.gov/srd/index.htm">NIST databases</a>

yields: NIST databases

Wikipedia by andyr · 2004-07-07 00:15 · Score: 1

Wikipedia has weekly MySQL database dumps of their content.

~~~~

--
Andy Rabagliati

get the do not call list by WhiteDragon · 2004-07-07 05:16 · Score: 1

I think that the do not call list would be a great data set with a bunch of records and non-random data.

--
Did you mount a military-grade, variable-focus MASER on an unlicensed artificial intelligence?

Amazon.com is offering webservices XML data by Anonymous Coward · 2004-07-07 17:55 · Score: 0

http://www.amazon.com/gp/browse.html/002-9937486-2 304013?node=3427431

I had been parsing amazon webpages until i found this link. Just pull of all the data on 10000 books or something. This is better than boring data that MIGHT have a commercial purpose.

Slashdot Mirror

Large, Free, and Interesting SQL-ready Datasets?

73 comments