Library of Congress Offers Update On Huge Twitter Archive Project
Nerval's Lobster writes "Back in April 2010, the Library of Congress agreed to archive four years' worth of public Tweets. Even by the standards of the nation's most famous research library, the goal was an ambitious one. The librarians needed to build a sustainable system for receiving and preserving an enormous number of Tweets, then organize that dataset by date. At the time, Twitter also agreed to provide future public Tweets to the Library under the same terms, meaning any system would need the ability to scale up to epic size. The resulting archive is around 300 TB in size. But there's still a huge challenge: the Library needs to make that huge dataset accessible to researchers in a way they can actually use. Right now, even a single query of the 2006-2010 archive takes as many as 24 hours to execute, which limits researchers' ability to do work in a timely way."
Why does the federal government need to archive the useless information twitter calls tweets .. yet another huge wast of my money (being a taxpayer and all)
Is there limitation hardware or software? Where is the bottleneck?
Just give me a csv.
provide a limited version of the database with only some information from the tweets, so there's less data to search through? (of course, keep the full data in case a search depends on it)
I listen to both RIAA and non-RIAA stuff if I like the music, tangential business/politics nonwithstanding.
Just buy batches of 300 of those 1Tb flash drives in the article below and pass them out to the researches as needed?
Garbage in, garbage out
You had better turn on indexing.
RedShift is perfect for this:
http://aws.amazon.com/redshift/
Archives trash.
Really, why not record and archive random traffic sounds? Some day when everyone is flitting about in whisper quiet air cars they'll marvel at the cacophony of the present age. Gadzooks!
A feeling of having made the same mistake before: Deja Foobar
Why not index the collection using a Sphinx cluster (http://sphinxsearch.com)? Using the proper indexes and transformations this should net them a pretty good system. They wouldn't even need to keep the raw information online as the necessary references to the raw data could be made.
Let Google index them.
Classical books, works of art, grand inventions that changed the world...and we chose to archive people pissing about on a Friday night. Good Job, America. You've shown the world where your priorities lie.
So, just how many 'Libraries of Congress' are there in 300TB? ;-)
Does this mean that as the archives swell, the metric does also?
Where does this madness end?
Down With Slashdot BETA!!! I've been around the corner and seen the oliphant; you can only abuse me from your perspecti
Twitter is the scourge of the internet
1 Library of Congress ~ 10Tb of data
Therefore, the database will be around 30 LoCs in size.
But, if we consider this database as part of the Library of Congress, we get a fixpoint problem..
"The Library's mission is to support the Congress in fulfilling its constitutional duties and to further the progress of knowledge and creativity for the benefit of the American people." (from its website.) No, I don't see how archiving Twits and tweets furthers this mission *at all.*
... oh, wait, sorry. That's in the FBI's mission statement.
It's not much of a step from there to archiving all the phone conversations of all Americans
300TB worth of tweets, which are basically very small text files? A single tweet, that uses all available character should only be 140 bytes. I just refuse to believe that there is 2+ trillions tweets out there, to make up 280+TB. Considering 1 billion tweets would be 140GB. (unless I'm failing massively at math here, which is quite possible.)
Look, I don't know about you, but we process hundreds of TB of data when we process genomes, using this fancy stuff called "databases", "hash indexing", and fancy software that may be hard for you to find like Perl, C, and various scripting languages.
It's fairly simple coding. Just build an index hash from keywords (which are all preceded by #), add another index by words (ignoring all the bit.ly and other web links), add a third index by @ reference (aka user names, which are really just a 20 character part of an SMS message), and go to town.
We do it every day.
Now, you've got a few extra complexities, we tend to use GACT and similar short codes, but we also have to add skips, nulls, misreads, ambiguities, so it's usually 8 symbols and you're looking at an extended ASCII power.
Still, you're getting obsessed with the size (which is nothing compared to a genome, and we have drives much much bigger than that).
Just do it and stop thinking it's "hard". It isn't. Buy a decent Perl book for Biochemistry or Genetics and get cracking. We wrote most of the code you'll need to build new libraries from.
-- Tigger warning: This post may contain tiggers! --
What confuses me:
Percentage of Americans with Accounts:
Twitter: 13%
Facebook: 70%
So there is FAR less diversity, and extremely poor quality data, why did they not archive public Facebook posts instead?
I see it as, facebook hosts people who write articles, stories, poems, songs, music, pictures, etc. THAT is the point of the Library of Congress: Documenting and Preserving Culture. Not trying to datamine the history behind "WAT R U DOIN FRI GRRL?",
They should move away from using an excel spreadsheet to store and search for keywords and hire someone from walmart's big data team...
All your meme are belong to us!
You are being MICROattacked, from various angles, in a SOFT manner.
wait.. did you really just say, "Because academia is starved for data.".. HAAHAHAHA
really? I mean, "academia" is now deriving its useful scientific and mathematical data from twitter because it is 'starved of it' elsewhere?
Let me refresh your memory: Academia has nothing at all to do with data mining the personal (albeit public) information of all of humanity. What will these 'researchers' be accomplishing by watching the stream of shit that is twitter? Maybe finding the latest sexy pics posted? How about when someone tweets to humanity they are taking a shit - VERY useful info there.
Twitter is nothing more than an attention whore, egotistical palace. Actually in foresight It matches the united states government and library of congress to a T. Maybe it isn't such a bad match after all..
Its easy to process all that worthless data: throw it away and document something useful. This sounds more like a goldmine for advertisers than anything else. Who knows how much money has already been wasted on this
https://petitions.whitehouse.gov/petition/stop-library-congress-wasting-money-archiving-twitter-posts/x6h3VYvr
Researchers are hampered by all the CPU cycles going to FBI and CIA searches. (Makes me think of Person on Interest)
The average lifespan of a wet, nasty fart if only on the order of dozens of seconds. That doesn't mean anyone should be trying to bottle the gaseous shit in case someone wants to smell something so foul in the future! Is this seriously what these moron assholes are squandering our tax dollars on? Fucking TWEETS!?!
Christ what a bunch of retarded twats!
So how big, in Libraries of Congresses, is the archive that they're adding from Twitter, to said Library of Congress?
This tagline was transcoded to result in at least one smirk. If you experience failure to smirk, please consult your Gen
The average life of an inane conversation used to be maybe 15 minutes. I'm not sure the world is a better place for having extended that.
In the old days of USENET, conversation threads used to run for weeks, sometimes months, actually.
Not minutes.
of course, back then, we actually knew who everyone was, and could ping and finger them.
-- Tigger warning: This post may contain tiggers! --
A substantial number of posts are literal duplicates by known spambots.
You could store those separately as well as the Retweets (RTs).
Then, think about what typically gets posted.
Most might be something like 520,000 variations on "Touchdown!" or "That's gotta hurt!" during sporting events, or "It's snowing!"
A lot of the rest are probably repeats of what someone just said on Comedy Network or during a TV program. They will all be at about the same time in a region and be substantially the same thing, with 50,000 mispelt variations.
Add the ACs and it's a lot smaller than you think. Most of the rest of that are still duplicates of something somebody else wrote, but without attribution.
-- Tigger warning: This post may contain tiggers! --
300TB is about right. Twitter says they have 400 million tweets per day. Figure about 500 bytes per message with text, and metadata (source, destination, timestamp, flags). 400,000,000 msgs/day * 365*4 days * 500 bytes = 292,000,000,000,000 bytes.
Twitter offers a feed of 1 in 10,000 public tweets, so you can see how banal it is. I had a program monitoring that for a while, extracting links and evaluating them for spam. It's about as bad as you'd expect.
only 0.6 Libraries of Congress.
The average life of an inane conversation used to be maybe 15 minutes. I'm not sure the world is a better place for having extended that.
In the old days of USENET, conversation threads used to run for weeks, sometimes months, actually.
Not minutes.
of course, back then, we actually knew who everyone was, and could ping and finger them.
You missed the "used to be". You know, before usenet (which was hardly "the old days", when are you thinking of? 20 years ago? 30?).
Imagine if we had access to "twitter" like posts from WW2, or during the 60s. It's a huge cultural treasure that reflects the ideas of people back then, and while having less value now will certainly be heavily researched far into the future.
Right now, even a single query of the 2006-2010 archive takes as many as 24 hours to execute.
Why? Why does it take so long?
They talk about the hardware and software not being up to scratch, but many other companies seem to be able to process huge amounts of data quickly. Google, for one, seems to do it.
This post contains benzene, nitrosamines, formaldehyde and hydrogen cyanide.
It takes Library of Congress 24 hours to search 160 chars but Google less then 1 sec to search a world full of data. It's time to cut the goverment (public sector) and outsource any way we can.