Slashdot Mirror


Library of Congress Offers Update On Huge Twitter Archive Project

Nerval's Lobster writes "Back in April 2010, the Library of Congress agreed to archive four years' worth of public Tweets. Even by the standards of the nation's most famous research library, the goal was an ambitious one. The librarians needed to build a sustainable system for receiving and preserving an enormous number of Tweets, then organize that dataset by date. At the time, Twitter also agreed to provide future public Tweets to the Library under the same terms, meaning any system would need the ability to scale up to epic size. The resulting archive is around 300 TB in size. But there's still a huge challenge: the Library needs to make that huge dataset accessible to researchers in a way they can actually use. Right now, even a single query of the 2006-2010 archive takes as many as 24 hours to execute, which limits researchers' ability to do work in a timely way."

14 of 88 comments (clear)

  1. Why? by Anonymous Coward · · Score: 5, Insightful

    Why does the federal government need to archive the useless information twitter calls tweets .. yet another huge wast of my money (being a taxpayer and all)

    1. Re:Why? by griffjon · · Score: 4, Insightful

      To paraphrase a quote by the Internet Archive chairman from some years back, "The average lifespan of a Web page today is 100 days. This is no way to run a culture."

      --
      Returned Peace Corps IT Volunteer
    2. Re:Why? by Anonymous Coward · · Score: 5, Insightful

      To paraphrase a quote by the Internet Archive chairman from some years back, "The average lifespan of a Web page today is 100 days. This is no way to run a culture."

      The average life of an inane conversation used to be maybe 15 minutes. I'm not sure the world is a better place for having extended that.

    3. Re:Why? by fsterman · · Score: 4, Interesting

      Because academia is starved for data. Companies hoarding information limits what we can do with it. The library of congress is acting as an aggregate buyer for thousands of individual researchers, it is a huge cost savings.

      --
      Is there anything better than clicking through Microsoft ads on Slashdot?
    4. Re:Why? by Hatta · · Score: 4, Insightful

      Because Twitter is a great model for the spread of ideas. If you study the spread of ideas, you can begin to understand it and use that understanding to affect it. That has enormous value.

      --
      Give me Classic Slashdot or give me death!
    5. Re:Why? by DerekLyons · · Score: 2

      Because tweets aren't useless - they're as much a part of societies communications as post cards, phone calls, etc... etc... There's a lot of information there about the day-to-day interests and communications patterns of a lot of ordinary people.

      For a historian or a sociologist, that archive is going to be a gold mine.

  2. My goodness 24 hours? by Bramlet+Abercrombie · · Score: 2

    You had better turn on indexing.

  3. Re:He who archives my tweets by Reilaos · · Score: 4, Interesting

    Some of the most important historical knowledge comes from things that people at the time wouldn't consider important. Things like grocery lists can help determine the diets and agricultural abilities of a culture at the time.

    For an example I just made up: In the future, the presence or lack of traffic reports could, alongside legal/budget records, help a historian verify the spread/development of roadways.

    Twitter could be a huge source of topics and a wealth of information for historians in the future.

    They may conclude that we were all idiots. This too, counts as useful information.

  4. Stuck in a loop here.... by rts008 · · Score: 3, Funny

    So, just how many 'Libraries of Congress' are there in 300TB?
    Does this mean that as the archives swell, the metric does also?
    Where does this madness end? ;-)

    --
    Down With Slashdot BETA!!! I've been around the corner and seen the oliphant; you can only abuse me from your perspecti
  5. seriously? by Anonymous Coward · · Score: 2, Insightful

    300TB worth of tweets, which are basically very small text files? A single tweet, that uses all available character should only be 140 bytes. I just refuse to believe that there is 2+ trillions tweets out there, to make up 280+TB. Considering 1 billion tweets would be 140GB. (unless I'm failing massively at math here, which is quite possible.)

  6. Obviously no programmers on right now by WillAffleckUW · · Score: 2

    Look, I don't know about you, but we process hundreds of TB of data when we process genomes, using this fancy stuff called "databases", "hash indexing", and fancy software that may be hard for you to find like Perl, C, and various scripting languages.

    It's fairly simple coding. Just build an index hash from keywords (which are all preceded by #), add another index by words (ignoring all the bit.ly and other web links), add a third index by @ reference (aka user names, which are really just a 20 character part of an SMS message), and go to town.

    We do it every day.

    Now, you've got a few extra complexities, we tend to use GACT and similar short codes, but we also have to add skips, nulls, misreads, ambiguities, so it's usually 8 symbols and you're looking at an extended ASCII power.

    Still, you're getting obsessed with the size (which is nothing compared to a genome, and we have drives much much bigger than that).

    Just do it and stop thinking it's "hard". It isn't. Buy a decent Perl book for Biochemistry or Genetics and get cracking. We wrote most of the code you'll need to build new libraries from.

    --
    -- Tigger warning: This post may contain tiggers! --
  7. Data Relevance by __aablib8664 · · Score: 2

    What confuses me:

    Percentage of Americans with Accounts:
    Twitter: 13%
    Facebook: 70%

    So there is FAR less diversity, and extremely poor quality data, why did they not archive public Facebook posts instead?

    I see it as, facebook hosts people who write articles, stories, poems, songs, music, pictures, etc. THAT is the point of the Library of Congress: Documenting and Preserving Culture. Not trying to datamine the history behind "WAT R U DOIN FRI GRRL?",

  8. Re:Down the Terlit by TapeCutter · · Score: 2

    Thing is YOU don't get to define what future generations think of you and your civilization, if you want to help them to form an accurate view rather than just the image you want to portray then you need to leave some juicy rubbish dumps undisturbed, this is one such dump. I'd question the justification for the size of this particular dump but you make it sound like they are throwing out Mark Twain to make room for twitter. You know they have the resources to do both things at the same time, and that this project probably cost less than a single hand written Twain manuscript, right?

    At the end of the day I see stuff like this as a GoodThing(TM). I'd much rather live in a society that over-values it's everyday trivia than one that under-values it's past, or worst still goes out of it's way to destroy it (Taliban). The Victorian English were the ones who started the drive for museums and preserving/understanding the past. Before that nobody really bothered about social heritage, it was all about family heritage. Egypt is a great example, they started looking after their own heritage after the British showed so much interest in digging up their monuments and taking the home with them. A large portion of the old city of Cairo is made from the lime that used to cover the pyramids (their sides were originally flat and white), all that's left is a little cap of lime on top of the largest one. Once the pyramid making fad and families had well and truly died, the people of Cario simply didn't care about a huge monument built by some long dead family, to them it was no more than a convenient source of building material that a long dead Pharaoh had left lying around.

    --
    And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
  9. Re:Unit conversion by voidphoenix · · Score: 2

    No, the 10TB estimate is incorrect. The LoC estimates that the digitized size of of its print collection was around 200 TB as of 2000 CE.