Library of Congress Offers Update On Huge Twitter Archive Project

← Back to Stories (view on slashdot.org)

Library of Congress Offers Update On Huge Twitter Archive Project

Posted by samzenpus on Monday January 7, 2013 @10:38AM from the 140-little-problems dept.

Nerval's Lobster writes "Back in April 2010, the Library of Congress agreed to archive four years' worth of public Tweets. Even by the standards of the nation's most famous research library, the goal was an ambitious one. The librarians needed to build a sustainable system for receiving and preserving an enormous number of Tweets, then organize that dataset by date. At the time, Twitter also agreed to provide future public Tweets to the Library under the same terms, meaning any system would need the ability to scale up to epic size. The resulting archive is around 300 TB in size. But there's still a huge challenge: the Library needs to make that huge dataset accessible to researchers in a way they can actually use. Right now, even a single query of the 2006-2010 archive takes as many as 24 hours to execute, which limits researchers' ability to do work in a timely way."

88 comments

Min score:

Reason:

Sort:

Why? by Anonymous Coward · 2013-01-07 10:40 · Score: 5, Insightful

Why does the federal government need to archive the useless information twitter calls tweets .. yet another huge wast of my money (being a taxpayer and all)
1. Re:Why? by Anonymous Coward · 2013-01-07 10:51 · Score: 1
  
  Because we desperately need to know that little Susie just ate some pizza and finished taking a shit 5 minutes ago.
2. Re:Why? by griffjon · 2013-01-07 10:54 · Score: 4, Insightful
  
  To paraphrase a quote by the Internet Archive chairman from some years back, "The average lifespan of a Web page today is 100 days. This is no way to run a culture."
  
  --
  Returned Peace Corps IT Volunteer
3. Re:Why? by Anonymous Coward · 2013-01-07 11:10 · Score: 5, Insightful
  
  To paraphrase a quote by the Internet Archive chairman from some years back, "The average lifespan of a Web page today is 100 days. This is no way to run a culture."
  The average life of an inane conversation used to be maybe 15 minutes. I'm not sure the world is a better place for having extended that.
4. Re:Why? by icebike · 2013-01-07 11:14 · Score: 1
  
  mod parent up.
  
  --
  Sig Battery depleted. Reverting to safe mode.
5. Re:Why? by citylivin · 2013-01-07 11:28 · Score: 1
  
  "This is no way to run a culture."
  Teen angst and celebrity gossip are considered culture, but popular movies and music are not. American society at its finest!
  
  --
  As a potential lottery winner, I totally support tax cuts for the wealthy
6. Re:Why? by fsterman · 2013-01-07 11:28 · Score: 4, Interesting
  
  Because academia is starved for data. Companies hoarding information limits what we can do with it. The library of congress is acting as an aggregate buyer for thousands of individual researchers, it is a huge cost savings.
  
  --
  Is there anything better than clicking through Microsoft ads on Slashdot?
7. Re:Why? by skids · 2013-01-07 11:46 · Score: 1
  
  My first reaction was "no, please, don't encourage the twits."
  
  --
  Someone had to do it.
8. Re:Why? by Hatta · 2013-01-07 11:49 · Score: 4, Insightful
  
  Because Twitter is a great model for the spread of ideas. If you study the spread of ideas, you can begin to understand it and use that understanding to affect it. That has enormous value.
  
  --
  Give me Classic Slashdot or give me death!
9. Re:Why? by Dahamma · 2013-01-07 12:01 · Score: 1
  
  300TB of storage can be built for less than $100k these days. Far from a "huge" waste of money. Though given the value of most Twitter posts, it's still probably a waste of $99,500.
10. Re:Why? by DerekLyons · 2013-01-07 13:27 · Score: 2
  
  Because tweets aren't useless - they're as much a part of societies communications as post cards, phone calls, etc... etc... There's a lot of information there about the day-to-day interests and communications patterns of a lot of ordinary people.
  For a historian or a sociologist, that archive is going to be a gold mine.
11. Re:Why? by nospam007 · 2013-01-07 17:15 · Score: 1
  
  "Because Twitter is a great model for the spread of ideas."
  Indeed. We'll have a treasure trove of racist/bigot/whatever messages from 20-30 years ago, when they were young and dumb, for every candidate we are going to vote for.
12. Re:Why? by Anonymous Coward · 2013-01-07 17:55 · Score: 0
  
  You beat me to it, AC! Yes, why the hell is my $ to be wasted on saving this BS? Maybe they're seeing the 'good' tweets?
13. Re:Why? by Anonymous Coward · 2013-01-08 10:31 · Score: 0
  
  300TB for what? 140 Char. limit of text per post, if it's just text (mostly useless text) then does it need that much.
  Twitter is just an extension of ancient pager tech. from the 90s (where do you think that 140 char. limit comes from?)
  And the $100K is what actual business will pay for it, this is govt. it will be $100B at least.
14. Re:Why? by Dahamma · 2013-01-08 17:06 · Score: 1
  
  Like I said, I question the value, but the cost just isn't that much. People seem to think 300TB is a big number to store or manage, but it's really not any more.
  And honestly, I relish the day when all of the teens and twentysomethings get older and start running companies or running for political office and rather than guess at their *real* past lives we can just search for all of the idiotic/offensive/racist comments they made over the years...
I wonder... by Anonymous Coward · 2013-01-07 10:40 · Score: 1

Is there limitation hardware or software? Where is the bottleneck?
Just give me a csv.
1. Re:I wonder... by ackthpt · 2013-01-07 10:52 · Score: 1
  
  Is there limitation hardware or software? Where is the bottleneck?
  Just give me a csv.
  Probably a simple hashing routine would cut down on the size 1 = LOL, 10 = ROFL, 11 = ROFLMAO, ...
  
  --
  
  A feeling of having made the same mistake before: Deja Foobar
2. Re:I wonder... by Anonymous Coward · 2013-01-07 11:55 · Score: 0
  
  500m twits at 140 characters + datestamp + username is a metric shit ton of data.. probably pushing 60-80 gigabytes of raw data every day. take one hell of a database server to support 4 years worth of that data in something that can be searched relatively quickly (simple search queries that take seconds, not hours). be an interesting test case for benchmarking different databases, file systems, operating systems, hardware.......
3. Re:I wonder... by arielCo · 2013-01-07 11:58 · Score: 1
  
  Or just use some level of solid compression and the problem solves itself.
  
  --
  This post contains no rudeness or derision of any kind. All arguments are friendly. Terms and exclusions may apply.
narrow it down? by KingAlanI · 2013-01-07 10:41 · Score: 1

provide a limited version of the database with only some information from the tweets, so there's less data to search through? (of course, keep the full data in case a search depends on it)

--
I listen to both RIAA and non-RIAA stuff if I like the music, tangential business/politics nonwithstanding.
1. Re:narrow it down? by greenlead · 2013-01-07 10:45 · Score: 1
  
  I agree. I think they should limit the initial database to certain time spans surrounding events of national interest and "tweets" that seem to be related. They can learn database structure and procedures from there and perhaps later add in the full archive. The most important part of anything like this is metadata. For example, a tweet that says "dudes! this concert rocks!!!" is useless unless you happen to know that the user is at a Trans-Siberian orchestra concert. And then, if you are able to attach all of the posts related to that concert together, it could be potentially useful.
2. Re:narrow it down? by uncanny · 2013-01-07 11:04 · Score: 1
  
  with only some information from the tweets
  That's a really good idea. Hell, that would probably make their whole project a couple of megabytes!
3. Re:narrow it down? by KingAlanI · 2013-01-07 11:10 · Score: 1
  
  I meant providing all the tweets in a simpler form, as opposed to excluding some of the tweets entirely, but I suppose it would make sense to at least test on a small subset of tweets first.
  
  --
  I listen to both RIAA and non-RIAA stuff if I like the music, tangential business/politics nonwithstanding.
4. Re:narrow it down? by Instine · 2013-01-07 11:39 · Score: 1
  
  provide hourly chunks of raw data as torrents.
  
  Done!
  
  --
  Because you can - or because you should?
5. Re:narrow it down? by TapeCutter · 2013-01-07 11:46 · Score: 1
  
  I don't think they should pick and choose what to keep, the value is in the fact that they are everyday conversations and observations (much like Samuel peeps diary). However I can't think of a reason why an academic would want every tweet for four years, they could get the same insights from a much smaller random sample.
  
  --
  And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
Why not? by Anonymous Coward · 2013-01-07 10:45 · Score: 1

Just buy batches of 300 of those 1Tb flash drives in the article below and pass them out to the researches as needed?
Let me be the first to say by Anonymous Coward · 2013-01-07 10:47 · Score: 0

Garbage in, garbage out
1. Re:Let me be the first to say by Anonymous Coward · 2013-01-07 12:13 · Score: 0
  
  RT @AnonymousCoward Garbage in, garbage out #lol
My goodness 24 hours? by Bramlet+Abercrombie · 2013-01-07 10:47 · Score: 2

You had better turn on indexing.
1. Re:My goodness 24 hours? by Trepidity · 2013-01-07 11:06 · Score: 1
  
  Indexing on data sets of that size is itself a pretty big challenge. You don't want an index that takes years to build, and it doesn't do much good if it's so huge that it is itself super-slow to access.
  There is some research [pdf] on making compressed full-text indexes, but much of it is still research-level.
  
  --
  10 PRINT CHR$(205.5+RND(1)); : GOTO 10
2. Re:My goodness 24 hours? by Anonymous Coward · 2013-01-07 11:13 · Score: 1
  
  Deduplication ought to help too.
Amazon RedShift is Perfect for this. by Anonymous Coward · 2013-01-07 10:49 · Score: 0

RedShift is perfect for this:
http://aws.amazon.com/redshift/
He who archives my tweets by ackthpt · 2013-01-07 10:50 · Score: 0

Archives trash.
Really, why not record and archive random traffic sounds? Some day when everyone is flitting about in whisper quiet air cars they'll marvel at the cacophony of the present age. Gadzooks!

--

A feeling of having made the same mistake before: Deja Foobar
1. Re:He who archives my tweets by Reilaos · 2013-01-07 11:00 · Score: 4, Interesting
  
  Some of the most important historical knowledge comes from things that people at the time wouldn't consider important. Things like grocery lists can help determine the diets and agricultural abilities of a culture at the time.
  For an example I just made up: In the future, the presence or lack of traffic reports could, alongside legal/budget records, help a historian verify the spread/development of roadways.
  Twitter could be a huge source of topics and a wealth of information for historians in the future.
  They may conclude that we were all idiots. This too, counts as useful information.
2. Re:He who archives my tweets by TapeCutter · 2013-01-07 12:02 · Score: 1
  
  Yep, the ancient rubbish pit is often the most informative part of an archeological dig, however this is more along the lines of Samuel Peeps' diary. Four years worth of tweets is a bit over the top, IMHO a few random days and a few significant days would be all you really need. I have something similar at home, it's a large coffee table book that has one page of newspaper clippings for every month of the 20th century.
  
  --
  And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
3. Re:He who archives my tweets by Anonymous Coward · 2013-01-07 13:13 · Score: 0
  
  Perhaps they'll learn people on Slashdot can't spell PEPYS. Or is everyone doing it to make a link between peeps and tweets? No? Thought so. Now get off my lawn, I've got some cheese to bury.
4. Re:He who archives my tweets by Anonymous Coward · 2013-01-07 13:13 · Score: 0
  
  They may conclude that we were all idiots.
  The word you're looking for is "realize".
Sphinx? by Anonymous Coward · 2013-01-07 10:51 · Score: 0

Why not index the collection using a Sphinx cluster (http://sphinxsearch.com)? Using the proper indexes and transformations this should net them a pretty good system. They wouldn't even need to keep the raw information online as the necessary references to the raw data could be made.
Post them on the web by Anonymous Coward · 2013-01-07 10:59 · Score: 0

Let Google index them.
1. Re:Post them on the web by fsterman · 2013-01-07 11:39 · Score: 1
  
  Why is parent modded to 0? Storing them on any number of services cloud services would be a lot cheaper than building their own system. Amazon and Google already host public datasets for researchers over 300tb. Hell, they could just agree to pay Twitter a service fee for data and keep offline tape backups. While we are at it, why not maintain a Torrent of each year?
  
  --
  Is there anything better than clicking through Microsoft ads on Slashdot?
2. Re:Post them on the web by Desler · 2013-01-07 12:04 · Score: 1
  
  They aren't. AC posts always start at 0. Welcome to Slashdot.
3. Re:Post them on the web by Anonymous Coward · 2013-01-07 12:05 · Score: 0
  
  Why is parent modded to 0? Storing them on any number of services cloud services would be a lot cheaper than building their own system. Amazon and Google already host public datasets for researchers over 300tb. Hell, they could just agree to pay Twitter a service fee for data and keep offline tape backups. While we are at it, why not maintain a Torrent of each year?
  The problem here is making them searchable not storing it. Tweets are a lot smaller than webpages, so 300TB contains an insane number of things to index. I followed your link and those data sets are hosted for download, but not searching.
4. Re:Post them on the web by fsterman · 2013-01-08 08:03 · Score: 1
  
  Oh, right. I would like to point out my old-skoolish 6-digit uid!
  
  --
  Is there anything better than clicking through Microsoft ads on Slashdot?
Down the Terlit by Jetra · 2013-01-07 11:07 · Score: 0

Classical books, works of art, grand inventions that changed the world...and we chose to archive people pissing about on a Friday night. Good Job, America. You've shown the world where your priorities lie.
1. Re:Down the Terlit by viperidaenz · 2013-01-07 11:16 · Score: 1
  
  It's illegal to make a copy of any of those other things though.
2. Re:Down the Terlit by Jetra · 2013-01-07 11:25 · Score: 1
  
  Isn't Copyright just the worst?
3. Re:Down the Terlit by viperidaenz · 2013-01-07 12:45 · Score: 1
  
  It has its place.
4. Re:Down the Terlit by TapeCutter · 2013-01-07 13:06 · Score: 2
  
  Thing is YOU don't get to define what future generations think of you and your civilization, if you want to help them to form an accurate view rather than just the image you want to portray then you need to leave some juicy rubbish dumps undisturbed, this is one such dump. I'd question the justification for the size of this particular dump but you make it sound like they are throwing out Mark Twain to make room for twitter. You know they have the resources to do both things at the same time, and that this project probably cost less than a single hand written Twain manuscript, right?
  
  At the end of the day I see stuff like this as a GoodThing(TM). I'd much rather live in a society that over-values it's everyday trivia than one that under-values it's past, or worst still goes out of it's way to destroy it (Taliban). The Victorian English were the ones who started the drive for museums and preserving/understanding the past. Before that nobody really bothered about social heritage, it was all about family heritage. Egypt is a great example, they started looking after their own heritage after the British showed so much interest in digging up their monuments and taking the home with them. A large portion of the old city of Cairo is made from the lime that used to cover the pyramids (their sides were originally flat and white), all that's left is a little cap of lime on top of the largest one. Once the pyramid making fad and families had well and truly died, the people of Cario simply didn't care about a huge monument built by some long dead family, to them it was no more than a convenient source of building material that a long dead Pharaoh had left lying around.
  
  --
  And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
5. Re:Down the Terlit by Jetra · 2013-01-07 13:33 · Score: 1
  
  And what, exactly, is going to show our future kin if we archive all of Twitter? I'll tell you one thing, swallowing it is going to be very, very hard.
Stuck in a loop here.... by rts008 · 2013-01-07 11:08 · Score: 3, Funny

So, just how many 'Libraries of Congress' are there in 300TB?
Does this mean that as the archives swell, the metric does also?
Where does this madness end? ;-)

--
Down With Slashdot BETA!!! I've been around the corner and seen the oliphant; you can only abuse me from your perspecti
1. Re:Stuck in a loop here.... by Anonymous Coward · 2013-01-07 11:21 · Score: 0
  
  Yo dawg?
2. Re:Stuck in a loop here.... by Anonymous Coward · 2013-01-07 16:00 · Score: 0
  
  The madness ends when the entire Internet has been archived and any information can be used against them. Then you gave Google's dream: Soviet Russia reborn all over the world.
what a collasal fucking waste by Anonymous Coward · 2013-01-07 11:12 · Score: 0

Twitter is the scourge of the internet
Unit conversion by kav2k · 2013-01-07 11:19 · Score: 1

1 Library of Congress ~ 10Tb of data
Therefore, the database will be around 30 LoCs in size.
But, if we consider this database as part of the Library of Congress, we get a fixpoint problem..
1. Re:Unit conversion by voidphoenix · 2013-01-07 20:43 · Score: 2
  
  No, the 10TB estimate is incorrect. The LoC estimates that the digitized size of of its print collection was around 200 TB as of 2000 CE.
  
  --
  Excuse me, wtf r u doin?
The Library's Mission Statement is ... by jabberwock · 2013-01-07 11:20 · Score: 1

"The Library's mission is to support the Congress in fulfilling its constitutional duties and to further the progress of knowledge and creativity for the benefit of the American people." (from its website.) No, I don't see how archiving Twits and tweets furthers this mission *at all.*

It's not much of a step from there to archiving all the phone conversations of all Americans ... oh, wait, sorry. That's in the FBI's mission statement.
1. Re:The Library's Mission Statement is ... by Anonymous Coward · 2013-01-07 11:32 · Score: 0
  
  Telephone calls aren't public. Public tweets are meant to be visible to... you know... the public.
2. Re:The Library's Mission Statement is ... by XxtraLarGe · 2013-01-07 11:34 · Score: 1
  
  No, I don't see how archiving Twits and tweets furthers this mission *at all.*
  
  And in what way does that matter? After all, Congress, the President and the Supreme Court don't follow the Constitution, why should we expect any other bureaucracy to do what they're supposed to?
  
  --
  Taking guns away from the 99% gives the 1% 100% of the power.
3. Re:The Library's Mission Statement is ... by jabberwock · 2013-01-07 11:37 · Score: 1
  
  We're getting perilously close to "Don't say anything out loud that you don't want to be public."
4. Re:The Library's Mission Statement is ... by Anonymous Coward · 2013-01-07 11:48 · Score: 0
  
  ...wow. They actually do follow the constitution. But please, enlighten us with your Intro to Law class....
5. Re:The Library's Mission Statement is ... by gl4ss · 2013-01-07 19:48 · Score: 1
  
  ...wow. They actually do follow the constitution. But please, enlighten us with your Intro to Law class....
  I'm pretty sure the constitution doesn't say that you can forget it if you rent a piece of land and station people on it.
  and how can everyone get to bear arms if bears are a protected species..
  
  --
  world was created 5 seconds before this post as it is.
6. Re:The Library's Mission Statement is ... by Anonymous Coward · 2013-01-08 03:51 · Score: 0
  
  no, we aren't. this is the long-existing "don't say anything in public you don't want to be public."
  if you somehow felt like that wasn't something that was already true, well... we aren't responsible for keeping you comfortable in your delusions.
seriously? by Anonymous Coward · 2013-01-07 11:28 · Score: 2, Insightful

300TB worth of tweets, which are basically very small text files? A single tweet, that uses all available character should only be 140 bytes. I just refuse to believe that there is 2+ trillions tweets out there, to make up 280+TB. Considering 1 billion tweets would be 140GB. (unless I'm failing massively at math here, which is quite possible.)
1. Re:seriously? by gl4ss · 2013-01-07 20:17 · Score: 1
  
  but if you dump the data associated to the tweet as well...
  
  --
  world was created 5 seconds before this post as it is.
2. Re:seriously? by Anonymous Coward · 2013-01-07 22:38 · Score: 0
  
  You can attach images to tweets
3. Re:seriously? by __aablib8664 · 2013-01-08 06:03 · Score: 1
  
  They use UTF-8 - assuming only ASCII characters are used, it would take 140 bytes, but UTF also has characters that are more than 1 byte. So the size could 3 times larger depending on whats typed
  
  https://dev.twitter.com/docs/counting-characters
Obviously no programmers on right now by WillAffleckUW · 2013-01-07 11:31 · Score: 2

Look, I don't know about you, but we process hundreds of TB of data when we process genomes, using this fancy stuff called "databases", "hash indexing", and fancy software that may be hard for you to find like Perl, C, and various scripting languages.
It's fairly simple coding. Just build an index hash from keywords (which are all preceded by #), add another index by words (ignoring all the bit.ly and other web links), add a third index by @ reference (aka user names, which are really just a 20 character part of an SMS message), and go to town.
We do it every day.
Now, you've got a few extra complexities, we tend to use GACT and similar short codes, but we also have to add skips, nulls, misreads, ambiguities, so it's usually 8 symbols and you're looking at an extended ASCII power.
Still, you're getting obsessed with the size (which is nothing compared to a genome, and we have drives much much bigger than that).
Just do it and stop thinking it's "hard". It isn't. Buy a decent Perl book for Biochemistry or Genetics and get cracking. We wrote most of the code you'll need to build new libraries from.

--
-- Tigger warning: This post may contain tiggers! --
Data Relevance by __aablib8664 · 2013-01-07 11:44 · Score: 2

What confuses me:

Percentage of Americans with Accounts:
Twitter: 13%
Facebook: 70%

So there is FAR less diversity, and extremely poor quality data, why did they not archive public Facebook posts instead?

I see it as, facebook hosts people who write articles, stories, poems, songs, music, pictures, etc. THAT is the point of the Library of Congress: Documenting and Preserving Culture. Not trying to datamine the history behind "WAT R U DOIN FRI GRRL?",
Excel? by Anonymous Coward · 2013-01-07 11:47 · Score: 0

They should move away from using an excel spreadsheet to store and search for keywords and hire someone from walmart's big data team...
Kwic! Someone set us up the index! by SpaceLifeForm · 2013-01-07 12:03 · Score: 1

All your meme are belong to us!

--
You are being MICROattacked, from various angles, in a SOFT manner.
what.. the.. fuck.. by Anonymous Coward · 2013-01-07 12:14 · Score: 0

wait.. did you really just say, "Because academia is starved for data.".. HAAHAHAHA
really? I mean, "academia" is now deriving its useful scientific and mathematical data from twitter because it is 'starved of it' elsewhere?
Let me refresh your memory: Academia has nothing at all to do with data mining the personal (albeit public) information of all of humanity. What will these 'researchers' be accomplishing by watching the stream of shit that is twitter? Maybe finding the latest sexy pics posted? How about when someone tweets to humanity they are taking a shit - VERY useful info there.
Twitter is nothing more than an attention whore, egotistical palace. Actually in foresight It matches the united states government and library of congress to a T. Maybe it isn't such a bad match after all..
Petition anyone? by Anonymous Coward · 2013-01-07 12:14 · Score: 0

Its easy to process all that worthless data: throw it away and document something useful. This sounds more like a goldmine for advertisers than anything else. Who knows how much money has already been wasted on this

https://petitions.whitehouse.gov/petition/stop-library-congress-wasting-money-archiving-twitter-posts/x6h3VYvr
Bottleneck.... by careysb · 2013-01-07 12:37 · Score: 1

Researchers are hampered by all the CPU cycles going to FBI and CIA searches. (Makes me think of Person on Interest)
Oh yeah? by Anonymous Coward · 2013-01-07 13:12 · Score: 0

The average lifespan of a wet, nasty fart if only on the order of dozens of seconds. That doesn't mean anyone should be trying to bottle the gaseous shit in case someone wants to smell something so foul in the future! Is this seriously what these moron assholes are squandering our tax dollars on? Fucking TWEETS!?!
Christ what a bunch of retarded twats!
Oblig... by webmistressrachel · 2013-01-07 13:50 · Score: 1

So how big, in Libraries of Congresses, is the archive that they're adding from Twitter, to said Library of Congress?

--
This tagline was transcoded to result in at least one smirk. If you experience failure to smirk, please consult your Gen
1. Re:Oblig... by voidphoenix · 2013-01-07 20:37 · Score: 1
  
  That's around 1.5 Libraries of Congress.
  
  --
  Excuse me, wtf r u doin?
Re:Why? or lifespan of conversations by WillAffleckUW · 2013-01-07 14:13 · Score: 1

The average life of an inane conversation used to be maybe 15 minutes. I'm not sure the world is a better place for having extended that.
In the old days of USENET, conversation threads used to run for weeks, sometimes months, actually.
Not minutes.
of course, back then, we actually knew who everyone was, and could ping and finger them.

--
-- Tigger warning: This post may contain tiggers! --
How many of those are RTs? by WillAffleckUW · 2013-01-07 14:19 · Score: 1

A substantial number of posts are literal duplicates by known spambots.
You could store those separately as well as the Retweets (RTs).
Then, think about what typically gets posted.
Most might be something like 520,000 variations on "Touchdown!" or "That's gotta hurt!" during sporting events, or "It's snowing!"
A lot of the rest are probably repeats of what someone just said on Comedy Network or during a TV program. They will all be at about the same time in a region and be substantially the same thing, with 50,000 mispelt variations.
Add the ACs and it's a lot smaller than you think. Most of the rest of that are still duplicates of something somebody else wrote, but without attribution.

--
-- Tigger warning: This post may contain tiggers! --
300TB is about right by Animats · 2013-01-07 18:58 · Score: 1

300TB is about right. Twitter says they have 400 million tweets per day. Figure about 500 bytes per message with text, and metadata (source, destination, timestamp, flags). 400,000,000 msgs/day * 365*4 days * 500 bytes = 292,000,000,000,000 bytes.
Twitter offers a feed of 1 in 10,000 public tweets, so you can see how banal it is. I had a program monitoring that for a while, extracting links and evaluating them for spam. It's about as bad as you'd expect.
1. Re:300TB is about right by Anonymous Coward · 2013-01-07 20:47 · Score: 0
  
  No, it is not about right. It isn't even remotely close to right.
  The article itself claims 170B tweets - not the 584B that you calculate here. (They didn't always have that many per day.)
  500 bytes seems a bit of a stretch. Say maybe 4 bytes each for pointers to source and destination, with any flags tucked into spare bits... 4 bytes at most for time... That's 12 bytes of overhead. The average tweet length is under 70 characters, so the total average size should be far less than 100 bytes. Not 500.
  Then, there is an expectation of even basic compression, and at least taking advantage of retweets and complete duplicates...
  So it is only "about right" if you mean "within a few orders of magnitude". Sure, I could always write them out in the sand and store pictures of each one... But the resulting data size does not say much.
2. Re:300TB is about right by __aablib8664 · 2013-01-08 06:10 · Score: 1
  
  Is the average tweet under 140 characters? Many I've seen aren't. As for 500, it could easily reach that all totaled. being UTF encoded its possible for characters to be more than 1 byte each, meaning a message could easily be 4 times that size = 560 bytes. and still plus overhead, it certainly is possible.
  https://dev.twitter.com/docs/counting-characters
  
  plus, what kind of consipracy theory is it that the library of congress would want to willfully deceive us about the size of a twitter archive....... : )
3. Re:300TB is about right by Anonymous Coward · 2013-01-08 06:39 · Score: 0
  
  No, you have not seen those. The maximum tweet is 140 characters. The average size is under 70, and the median and mode are even smaller. Tweets are small.
  No, it really could not easily reach sizes that large. UTF-8 encoding won't make a notable difference to the statistics.
  Even with awful bloated sizes, the number of tweets is under 1/3 the number needed to add up. And you would still expect efficient representation of duplicates. And even compression. So it still isn't within an order of magnitude.
  What conspiracy theory? People give overblown descriptions of their projects all the time. That doesn't mean we have to be so weak as to just swallow what they shovel at us.
4. Re:300TB is about right by __aablib8664 · 2013-01-09 03:04 · Score: 1
  
  you're right, i ment to say the average being under 70, not 140. but i still often see tweets far closer to 140.
  
  and how do you say it wont make a difference to the statistics? 140 characters, multiplied by a -potential- 2, maybe 3 bytes per character, that absolutely increases the size! they aren't magical letters.
  when the data is available, feel free to run stats on tweet size, i'm sure it will be boring. the conspiracy theory was a joke. point being, who cares how big it is. im pissed that they spent ANY time or money on documenting this crap : )
So in hindsight by Mister+Liberty · 2013-01-08 00:46 · Score: 1

only 0.6 Libraries of Congress.
Re:Why? or lifespan of conversations by Anonymous Coward · 2013-01-08 01:34 · Score: 0

The average life of an inane conversation used to be maybe 15 minutes. I'm not sure the world is a better place for having extended that.
In the old days of USENET, conversation threads used to run for weeks, sometimes months, actually.
Not minutes.
of course, back then, we actually knew who everyone was, and could ping and finger them.
You missed the "used to be". You know, before usenet (which was hardly "the old days", when are you thinking of? 20 years ago? 30?).
This is an investment into future research by Anonymous Coward · 2013-01-08 01:57 · Score: 0

Imagine if we had access to "twitter" like posts from WW2, or during the 60s. It's a huge cultural treasure that reflects the ideas of people back then, and while having less value now will certainly be heavily researched far into the future.
Why? by Inda · 2013-01-08 02:02 · Score: 1

Right now, even a single query of the 2006-2010 archive takes as many as 24 hours to execute.

Why? Why does it take so long?

They talk about the hardware and software not being up to scratch, but many other companies seem to be able to process huge amounts of data quickly. Google, for one, seems to do it.

--
This post contains benzene, nitrosamines, formaldehyde and hydrogen cyanide.
24 Hours !!! by Anonymous Coward · 2013-01-08 03:31 · Score: 0

It takes Library of Congress 24 hours to search 160 chars but Google less then 1 sec to search a world full of data. It's time to cut the goverment (public sector) and outsource any way we can.