The Library of Congress Will Stop Archiving Every Public Tweet On January 1st (gizmodo.com)
An anonymous reader quotes a report from Gizmodo: In 2010, the Library of Congress started archiving every single public tweet that was published on Twitter. It even retroactively acquired all tweets dating back to 2006. But the Library of Congress will stop archiving every tweet on December 31, 2017. The Library of Congress issued a white paper this month saying that it was proud of its comprehensive collection of tweets from the first 12 years of Twitter, but that it's completely unnecessary for it to continue. Instead, the organization will only collect tweets that it deems historically significant. For instance, President Trump's tweets are almost certainly still going to be saved for future generations. One reason that the Library is stopping the comprehensive archive? The social media company's controversial change to allow 280 character tweets. The Library's halt on collection of all tweets puts Twitter more in line with the way that other digital collections are archived, including websites. The Library of Congress only archives websites on a selective basis, unlike the nonprofit, non-governmental organization the Internet Archive, which has a much broader goal of archiving everything online with its Wayback Machine. The Library of Congress also noted that many tweets include photos and video and that it has only been collecting text, making some of its collection worthless.
Assuming they are only archiving text, I wonder how much storage that requires. Of course it would compress VERY well.
On a good day about 1 bit per character.
"His name was James Damore."
Shannon's paper "Prediction Entropy of Printed English" tries to measure the amount of information per character in English.
He found that English is about 1 bit per character, and so compressed text can be expected to take up about that much room.
The paper is a pretty interesting read if you have read his 1948 paper that defines entropy (also a good read).
He came up with some interesting experimental methods to measure entropy in English.
I'll throw in one more data point. I developed a predictive text entry database for my previous employer - similar to the old T9 ( better obviously since I was involved ;-) ) and for English (and similar languages) it would take about 4 bits per dictionary word you trained (which is less than 1 bit/char since the average word length is a bit over 5). It is worst-case as we are talking about a dictionary, so no repeating words etc that compress a lot - however the information about how long a word is is not included in those 4 bits, so you save there (the way to think about it is that the user provides the length of word knowledge, the linguistic db the rest).
But the idea is that English is pretty compressible...
Violence is the last refuge of the incompetent. Polar Scope Align for iOS