The Library of Congress Will Stop Archiving Every Public Tweet On January 1st (gizmodo.com)
An anonymous reader quotes a report from Gizmodo: In 2010, the Library of Congress started archiving every single public tweet that was published on Twitter. It even retroactively acquired all tweets dating back to 2006. But the Library of Congress will stop archiving every tweet on December 31, 2017. The Library of Congress issued a white paper this month saying that it was proud of its comprehensive collection of tweets from the first 12 years of Twitter, but that it's completely unnecessary for it to continue. Instead, the organization will only collect tweets that it deems historically significant. For instance, President Trump's tweets are almost certainly still going to be saved for future generations. One reason that the Library is stopping the comprehensive archive? The social media company's controversial change to allow 280 character tweets. The Library's halt on collection of all tweets puts Twitter more in line with the way that other digital collections are archived, including websites. The Library of Congress only archives websites on a selective basis, unlike the nonprofit, non-governmental organization the Internet Archive, which has a much broader goal of archiving everything online with its Wayback Machine. The Library of Congress also noted that many tweets include photos and video and that it has only been collecting text, making some of its collection worthless.
Assuming they are only archiving text, I wonder how much storage that requires. Of course it would compress VERY well.
On a good day about 1 bit per character.
"His name was James Damore."
I'm actually more surprised this data collection has gone on at the Library of Congress since 2010, than the news that it's ending.
Now if the story had started with "The NSA...", I would've been quite shocked at its termination.
Happiness in intelligent people is the rarest thing I know.
Ernest Hemingway
Archaeologist 1: "Hey, I just discovered a message broadcast by the leader of the once great empire, United States!"
Archaeologist 2: "Marvelous! What's it say?"
Archaeologist 1: 'Let's see..."Rosie O. looks like a horse farted out a prune. Disgusting loser, so sad!"'
Archaeologist 2: "On second thought, let's pretend we never found it."
Table-ized A.I.
Twitter is little more than a digital version of some a-hole writing something on the wall of a public restroom.
Along the line of a-holes on Twitter..
Wasn't it established in a federal court that Trump's tweets amount to official statements and can be cited as effective policy statements? Further, I recall they must be preserved by the official records act.
I don't have the citation handy and don't remember what venue it was but perhaps someone else here can post it.
Some things never change.
Table-ized A.I.
1 byte. you mean 1 byte.
No. He means one bit. One byte (8 bits) is completely uncompressed. But English text will compress down by nearly 90%, which leaves about 1 bit per character.
The best compression ratios are for large texts using a consistent writing style and vocabulary, so tweets would yield less than 90% compression, but would likely still be better than 85%.
"Twitter is little more than a digital version of some a-hole writing something on the wall of a public restroom."
Nonetheless historians are studying the graffiti on the walls of Pompeii and Herculaneum.
https://www.smithsonianmag.com...
I say it will average 1 Library of Congress to store a Library of Congress worth of data.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Mostly a collection of advertisements and banal BS.
In hindsight, it is often the banalities that are the most interesting. Archaeologists often learn more from looking at ancient garbage dumps than from excavating palaces.
Shannon's paper "Prediction Entropy of Printed English" tries to measure the amount of information per character in English.
He found that English is about 1 bit per character, and so compressed text can be expected to take up about that much room.
The paper is a pretty interesting read if you have read his 1948 paper that defines entropy (also a good read).
He came up with some interesting experimental methods to measure entropy in English.
Those are not lies, they are "colorful and whimsical interpretations of events and alternative realities".
Table-ized A.I.
I'll throw in one more data point. I developed a predictive text entry database for my previous employer - similar to the old T9 ( better obviously since I was involved ;-) ) and for English (and similar languages) it would take about 4 bits per dictionary word you trained (which is less than 1 bit/char since the average word length is a bit over 5). It is worst-case as we are talking about a dictionary, so no repeating words etc that compress a lot - however the information about how long a word is is not included in those 4 bits, so you save there (the way to think about it is that the user provides the length of word knowledge, the linguistic db the rest).
But the idea is that English is pretty compressible...
Violence is the last refuge of the incompetent. Polar Scope Align for iOS