24-Year-Old Asks Facebook For His Data, Gets 1,200 PDFs
chicksdaddy writes "Be careful of what you ask for. That's a lesson Max Schrems of Vienna, Austria learned the hard way when he sent a formal request to Facebook for a copy of every piece of personal information that the social network had collected on him, as required under European law. After a wait, the 24-year-old law student got what he was seeking: a CD with all his data stored on it — 1,222 files in all. The collection of PDFs was roughly the length of Leo Tolstoy's War and Peace, but told a more mundane story: a record of Schrems' years-long relationship with the world's largest social network, including reams of data he had deleted. Now Schrems is pushing Facebook to disclose even more of what it knows."
It should be illegal for these companies to keep user generated content once the user deletes it.
Sure, a flood of data looks mundane, but combing it with the right filters probably tells lots of interesting stuff, like the DNA of relationships and interests. I can only hope mine is utterly meaningless. I've tried very hard to ensure that eventuality.
A feeling of having made the same mistake before: Deja Foobar
This article's summary is rather baited. I fail to see how see how this guy "learned the hard way". It's not like they rolled up with a truck and dumped reams of paper in the middle of his living room. He received a CD with files in an easily searchable format. I'm sure he knew going into it he wasn't going to read through it all in a night, and probably doesn't contain any surprises. If anything, Facebook "learned the hard way", now that they have to divulge the massive amount of data that they store, upon request, which means they must employ people to do this. Are the costs incurred outweighed by any profit produced by hoarding this particular information?
I've worked for a number of tech companies that dont actually delete anything, the simply mark the record "deleted" in the database. It's a pretty common practice that didn't really ever get talked about until it came to light that Facebook did it. Let's face it, once something is out there, it never ever really goes away, whether it be on Facebook or somewhere else,
And if the "attention" he gets convinces some people to stop using facebook or not to start using it in the first place, then he has done something worthwhile.
You are welcome on my lawn.
Yes, they're getting better, but there are inherent problems in the methodologies of stylistic analysis that make any claims of being able to identify authors based on style alone open to extreme skepticism. To put it another way, the only people claiming they can ID you based on how you write are marketing droids or snake-oil salesmen.
I did some work in a highly related field, stylochronometry. That's the measurement of change over time in a single author's style. The classic problem set for this kind of work is the Platonic corpus: people try to write algorithms to order Plato's writings chronologically. Philosophers want this information so they can trace the development of Plato's thought over time, so they give the problem to computational linguists, who try to measure things like the frequency of certain kinds of sentences or phrases or particles (hard to define words that show the relationship between sentences or, even more vaguely, give phrases "flavor") in various texts and then compare those frequencies to generate trends. There's generally an assumption that at least some of these variables will have a linear increase or decrease over time. More problematic, though, is that Plato may have gone back and edited parts of texts or entire texts, and there's some evidence (from outside these methodologies) that indicates this is the case. These problems have caused some (very rightly) to call into question the validity of stylochronometry, and the fact of the matter is that each study that's been done comes up with a different sequence in which the texts were written. It's a lot of effort being thrown at a problem in vain.
The same problems plague the study of authorship of anonymous internet posts through stylistic analysis. On Slashdot, you can't edit, but you can on blog posts, and you can have multiple authors collaborating without attribution. There's also plagiarism to complicate the number of authors: you don't know if person X's post is entirely his own or if parts were snagged from elsewhere, which would throw an algorithm off track. Most importantly, the basic assumption of stylochronometry, that style changes with time, causes a problem for algorithms that seek to find correlation among posts that were written at different times. Worse, people change their style from day to day or hour to hour (maybe I'm babbling now because I've had a lot of rum; maybe I'm usually more concise) and from context to context (maybe I write one way when responding to some articles, but I cite more sources on others, or I troll in other environments like ZeroHedge, or I use lots of abbreviations when discussing my furry anime fetishes -- rhetoric depends on context).
Things on the internet won't be traced back to you unless you're a bot that always writes in the same style. And, you'll never discover the order in which Plato wrote his dialogues.
I think I agree with you. I never understood why people complain about what sites do when all of what they do is in the terms.
From what I can tell, pretty much everything there is to know about how your data is used by Facebook is on:
http://www.facebook.com/legal/terms
http://www.facebook.com/full_data_use_policy
http://developers.facebook.com/policy/
http://www.facebook.com/ad_guidelines.php
All that comes in at about 15000 words. Sure, this will probably take you more than a few minutes to read and understand, unless you are Lt. Cmdr. Data. But if it is so important to you, than why not spend the time?
I have an feeling that people are either too lazy for their own good, or just like to see injustice where there is none because they like the feeling of righteous indignation
Sorry, I don't usually rant; please, anyone, do not take this post as impugning you personally; and I am probably missing many good counter-points.
Because a user shouldn't have to read 15,000 word legal documents to understand what could be written in layman's terms in a point form spanning just a few pages.
In addition to full legal documentation, there should be a brief summary in point form for the average user to get a basic understanding of what's what. If he then wishes, he can gain more information from the legalese docs, or otherwise agree.
I agree, it should be your choice. However, I'm one who really, really likes the idea of keeping an edit history for posts if one so chooses.
And I can understand why Facebook doesn't actually delete the data, but just flags it as hidden/deleted -- it's a real bear to update and nullify all the object id references to a post in such a mammoth system. There are links all over the place from people whose "feed" pages may reference your post. There are forwards and reposts of your post which create a commented link to your post -- does your right to delete your post mean you have the right to delete the posts of people who've commented on it?
Given that some of the content links could be in archived databases instead of mainline storage or cache, updating them could be virtually impossible.
Canada is facing the same issue with it's Long Gun Registry being shut down by Harper's Conservative government -- the data is cross-linked throughout government and law enforcement system, with over a decade of archived databases referencing the LGR databases. Truly deleting the data requires restoring the archived external databases, updating their contents to remove the references, exporting the database for an updated backup, and archiving it for storage.
Now there's the cascade effect -- any references to the archive disks now have to be updated to reference the new archive database content instead of the original.
They're currently expecting it to take over FIVE YEARS to purge that one database, and it's pitifully small compared to Facebook or Google.
Never mind the potential legal issues of external and archive systems that are mandated to be write-only by government legislation, and which have to be retained for 7-10 years in many cases.
Realistically, a versioning system or flagging content as deleted instead of purging it is the only option available for large systems that maintain historical data of any significant size.
I do not fail; I succeed at finding out what does not work.