Facebook Kills Dataset of Crawled Public Profiles
holy_calamity writes "Internet entrepreneur Pete Warden wrote a crawler that collated the public profiles of 210 million Facebook profiles and was set to release an anonymised version to researchers. The pages crawled can be read by any web user, and the robots.txt did not forbid crawling. However, Facebook claimed he had violated its terms of service and threatened legal action. Fearing costs, Warden has now destroyed his dataset. For a snapshot of the insights that data could have allowed, see Warden's post on how the friend networks of the 120 million US users in his data segregated into seven clusters." Of course, if he had it, this means anyone who wants it made their own version of this.
Fearing costs, Warden has now destroyed his dataset.
Couldn't Warden have sent requests to the EFF to provide lawyers so he could fight an evil corporation to use freely publicly available information?
Then Facebook could ask the EFF to protect their user's privacy and information being sold to marketers and corporations (sorry, when you're introduced as "Internet entrepreneur" that means there's profit to be had).
My work here is dung.
...you'd be flaming them for invading your "privacy".
Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
I see very little problem with an automated scan that respects robots.txt.
By not blocking automated access to the profiles, facebook is squarely at fault.
...all the researchers who do everything in the open and with proper anonymization.
Since this is publicly available information, and all he did was send a program to go grab it (much akin to asking your web browser to download it), does this mean Facebook has essentially threatened him for no more than reading too much of Facebook too quickly? Sounds absurd to me.
Don't see Facebook going after Google, even though the data that they posses is ostensibly the same as Warden's. The primary diff that i see is that warden was offering analysis and results for free- not trying to monetize it. Maybe that's what made them mad.
All data that exists, and someone can sell somehow, is for sale somewhere, somehow. That's the law of money, which is rather strong. So forget the right to privacy law, it's not working for a long time now, there is no way to enforce it, just like the law prohibiting drugs, it just doesn't work. I don't know the solution, or if it's good or bad, but that's the situation, like it or not. Wikileaks, for example, is a result of this.
Build your own energy sources from scratch. http://otherpower.com/
Besides the obvious (wasting time, too much info being shared with future employers), their privacy and data policies have gotten worse and worse. Once you sign up with them, they own everything you do. Or at least so they believe. From his writing, this researches was quite open and tried to be as forthcoming as possible. If they had concerns over anonymity, I suspect he would have been happy to discuss the exact data-scrubbing procedure to make sure it's on the level. But instead, these turds reach for the lawyers.
So it's fine for search engines to cache this data. It's fine for marketing firms to use it to pester even more people. But the moment the researchers get in on it - oh noes, gotta stop that shit from happening.
With any spare time, I'd sit down, recreate the damn dataset and post it to every torrent site in the world. Let's Streisand these jerks!
(not that it was actually destroyed), but why destroy the dataset? Just post to slashdot, wait for someone to send you a link to chilling effects or eff, then follow up with chilling effects or eff, then release the dataset.
-- I was raised on the command line, bitch
I'll let others debate the 'privacy' issues; (personally I think there's nothing wrong with scraping profile information that people have explicitly made 'public')
Anyways, just check what he did with it; very interesting: (FTA)
http://petewarden.typepad.com/searchbrowser/2010/02/how-to-split-up-the-us.html
There must be many, many legit uses this data could be put too...shame it's being killed by NIH syndrome
They did something similar to FB Purity, a Greasemonkey script that allows users to filter out apps and other stuff they don't want to see in their feed. Facebook argued that they were misusing their "FB" trademark... eventually they let them continue under the name "fluff busting purity", probably due to the PR backlash that shutting them down would bring.
They've also shut down the Facebook portion of the Web 2.0 Suicide Machine, which runs scripts that allow a user to delete their social profiles as thoroughly as sites will allow. In that case, they argued that the Suicide Machine was violating their "Statement of Rights and Responsibilities"... which isn't even a law! Nonetheless, the Suicide Machine didn't have the financial ability to fight even frivolous claims like that, so they folded that section.
Facebook apparently believes that its users will continue using the site regardless of the ridiculous access policies that their legal department create and defend. I hope they're wrong.
It's better to vote for what you want and not get it than to vote for what you don't want and get it.
- E. Debs
Legal action? On what grounds, and for what damages? What did this guy have to fear? Jail time? Court imposed fines? He doesn't need a lawyer to defend him in this.
I'm sorry- it is..
robots.txt allows you to "refuse a specific named bot" or "refuse everyone" or "allow everything" or "allow these directories" or "only allow these directories"
(want a fascinating read? try robots.txt at your favorite government site- whitehouse.gov used to be fascinating stuff)
there is no way in robots.txt to permit crawling based on intent of information use like a CC license does
I can- with photographs, have a creative commons license that sez "use it for anyhting" "use it with credit to me" "free for non-commercial" etc.
I would WANT google to see my site, I would want bing to see my site- for the purposes of indexing in a search engine.
I can't say in robots.txt
"come in and index for search engines and relevance- but you may not use the data to collect information on our membership for marketing to or marketing their info to others"
If I build a website all about-- coffee- I want the information available to the general public,but from/on my site....
every day http://en.wikipedia.org/wiki/Special:Random
personal information - they own it!
He'll have a few recordable DVDs lying around somewhere to use when FB eventually dies or he thinks enough time has passed to anonymously float the data out on a torrent.
Somebody else will do it again, this time anonymously and with an evil robot that hides its tracks. It only takes perl, LWP, MySQL, tor and a little time and imagination to do so.
Fuck you, Zuckerberg.
and I really think it is worth making.
Copyright protections are important, the snippet of text that google uses to let people know my site is relevant is easily fair use
I don't have a problem with it- I welcome it as it's beneficial for both myself and google for it to be there.
the ENTIRE TEXT of my site- copied and recopied to put into a web page that exists only to generate ad-sense revenue by a third party is not.
and if robots.txt had a 'license' mode, I'd have a much stronger case of protections if I chose to pursue a blatant copying and re-publication of my site.
robots.txt labels that I wish there were include
'allow function:indexing'
'disallow function:total and complete reproduction'
'disallow function: total and complete reproduction for XXX days'
(so I can allow wayback machine and equivalents'
'disallow function: aggregate data collection'
'disallow function: user data collection'
'disallow function: email collection'
looking at amazon, http://www.amazon.com/robots.txt
they somewhat do this by putting the information they don't want into the wild in it's own directories
then disallowing those directories- actually, now that I look at it- it's a neat way to go..
but I'd still prefer a robots.txt option that different 'intended use of data to be crawled' permissions covered
every day http://en.wikipedia.org/wiki/Special:Random
Quote: Facebook claimed he had violated its terms of service
As I understand it the information was openly available and therefore does not require you to use Facebook friend requests to get it. I fail to see how Facebook can impose a TOS on someone who accesses the site but does not use the service.
Is it assumed I agree to the TOS of Yahoo.com by visiting the frontpage? Is it assumed I agree to the TOS of any website by just visiting, even though they may not have explicitly stated I have agreed to it? If I can make people agree to a TOS without their knowledge than I am going to file a lawsuit against Facebook claiming they owe me $1,000,000 because it is in the TOS right here on my desk about them using my data.
Twilight was written by a Morman Author. That's why it shows up in your morman section. Apparently writing a script to scrape facebook profiles is easy research, but not looking up an entry in wikipedia.
http://en.wikipedia.org/wiki/Stephenie_Meyer
The most boring of the clusters, the area around Seattle is disappointingly average.
Ignoring the legality of it for a moment. What sort of questions can we ask and answer with the facebook data? Look how he has managed to divide the US into groups based on who is friends with who? That's a very interesting way of dividing up a country! StayAtHomeIa. Haha.
I for one, wish the entire facebook profile database was made public (with personal identifiable information removed). The benefit to researchers would be immeasurable.
This is one case I am glad I RTFA. The dataset is destroyed, but there is still a neeto little web application to play with. It's fun to poke around with... I find myself wanting more.
:)
And of course facebook wanted to shut him down... this is probably data they are collecting themselves and are selling / want to sell
You (and, sadly, many others looking to make a quick buck) seem to think that "proper" anonymization means removal of Personally-Identifiable Information (PII) from the data.
Removal of PII is neither sufficient nor, in certain cases, necessary for real anonymization. I'll leave the explanatory lecture for my next security class, but a very good rule of thumb for estimating whether an anonymization technique is adequate is whether applying that technique to all documents classified at the Secret level would yield documents suitable for declassification and public release.
If the anonymization technique you're considering would leave behind information which would require the document to remain classified at the Secret level, then it is not "proper" anonymization.
This is actually more important and relevant than you might think as post-9/11 more and more security-related Agencies need to find reliable, automatable methods of publishing (only to other Agencies, of course) the non-classified portions of their classified datasets.
I'm not sure copyright law even applies here. No more than it applies to say Google or Yahoo. He scraped DATA from a publicly accessible website as permitted by the robots.txt file. How is this really any different than what Google or Yahoo does? Perhaps the distribution? Though that's hardly significant in this case as the data is already out there. He just organized the presentation. Sounds to me like Facebook just pushing buttons to try and avoid another privacy controversy. /IANAL //Don't use facebook, I'm aware what companies are scraping and misusing what they sniff all too well.
Finding something on the web does not give you the legal authority to publish and redistribute it.
The US doesn't have "database copyright". The US has Feist vs. Rural Telephone, which says that "facts" can't be copyrighted. It's legal to scan in a phone book and load the address info into a database. You just can't reproduce the page layout; that's covered by copyright. That decision created the third-party phone book industry and began the era of widespread data mining.
The EULA issue is harder. If you're going to mine Facebook, you probably shouldn't have a Facebook account.
I'm surprised, though, that Facebook doesn't have systems which prevent programs from accessing pages in bulk.
An empty robots.txt is not blank-check permission to crawl and use the data for whatever you want.
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.
Putting a line somewhere on your website doesn't mean it applies to everyone who visits your website.
*Reading this comment intitles the writer of this comment, to compensation of no less then 100,000 USD per reading
I'll assume the check is in the mail, by your logic.
Has anyone else noticed this new banner at the top of Slashdot?
Become a fan of Slashdot on Facebook
It's funny that as much railing on Facebook that is done on Slashdot that Slashdot is advertising for people to become fans of them on Facebook.
An empty robots.txt is not blank-check permission to crawl and use the data for whatever you want.
But has the guy even signed up? We're not talking the Geneva Convention, here. Could facebook really impose its facebook Constitution on a non member? Sure I understand they'd want to. But wanting and having are two different things, he said, noting the absence of his army of Natalie Portman fembots.
Do you suggest that this work falls in the realm of unauthorized access? Do you think facebook has specifically authorized Google? There are facebook pages in Google's cache. So does Yahoo! And bing, dogpile, redz . . . Have they really authorized all of these? These sites are certainly not providing their services without an eye to making money off of them.
But I could be wrong. Every search engine provider could have a deal with every web page that its system crawls . . .
I am not a crackpot.
You are correct. Simply reading it does not mean that.
If you plan on caching and reusing the data, however, it does mean that you should check for applicable terms and copyrights.
If I see a nice picture gallery on a website, I’m welcome to click through and admire the pictures. But if I want to save them and publish them elsewhere, I’d better check the bottom of the page and/or the TOS page for any copyright notices. It’s no different.
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.
I don’t think it falls under unauthorized access... I think it’s unauthorized use of the information.
Yeah, it’s a much trickier question since a lot of spiders have implicit authorization to use the information. Googlebot will obviously spider it and index it for Google, and this is such a well-established fact — as is the way to prevent it from doing so by robots.txt — that not actively preventing Googlebot from accessing the page is probably pretty good justification for claiming that you’re permitting Google’s use of that information.
I’d accept that excuse from any major search engine for not having explicit, personal permission from Facebook allowing it to spider Facebook pages. I wouldn’t accept any excuse like that from some unknown spider that was designed solely to index Facebook profiles.
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.
There is no copyright on informaiton. if he had reprocued the entire site you might have an argument. But this is raw data. words and numbers.
If you have a list of sex offenders in your area on your website, and I go to the web site and cut and paste the list, that is not copyrighted material (or any list really).
If you have a poem on your website, that can be copyrighted.
Firstly, that's a straw man. Companies use generalised data all the time for marketing purposes. And actually, I'd say you're wrong - typically the response to "privacy" rights over public material is that people have to right to privacy - especially if it's on Facebook!
Secondly, these aren't mutually exclusive. Perhaps some people might have objected to this guy doing what he's doing, but that doesn't mean that it's right to claim he's bound by some TOS.
But hey, since arguing against straw men is an easy way to get karma, allow me to say actually, you're wrong, copyright infringement isn't stealing, and Linux is better than Windows.
First of all, even if there is not a copyright on pure information, there can still be a license on its use. You were given the information under the implicit license that you are a web browser and permitted to do what web browsers do: display the information for someone to read, download, etc. If you vastly expand on that functionality or do something altogether different with the information, you are no longer within the implicit license that was given to you when the server gave you the page. Unless perhaps you gave it a User Agent string that is indicative of the fact that you’re nothing it’s ever seen before.
Second, a lot of the profile information would be considered creative and protected under copyright. Your religion? Not just a drop-down; you are free to write-in whatever you want. This answer could be as generic or creative as you choose. A few of my more creative friends wrote “Following Jesus”, “Jesus is the way, the truth and the life”, and “Jesus died for me AND you!!” (yes, I have that sort of friends). That’s not just information, it’s a short essay response. Your favourite movies? TV programs? same. The profile picture for certain is copyrightable. Name? now you might think this if anything is cut-and-dried, but a few of my friends apparently think that “Name” is an outlet for their creativity as well...
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.
You have a valid point about #1, that is a matter of law as I am not sure what consitutes a valid license on the use of information which is in no way limited in its availability. The problem is that even if it violates the ToS, you dont have to accept the ToS to see that data.
The second part is just wrong. Answering a question in a creative amusing or entertaining manner is not a creative work. Its an answer to a question.
Answering a question in a creative amusing or entertaining manner is not a creative work. Its an answer to a question.
Yes, it is. The fact is not creative, but the presentation is, and if you simply copy the presentation verbatim, you have violated the creative work.
“Christian” may be a simple fact and not copyrightable, just like phone numbers and addresses are simple fact and you cannot copyright the phone book. However in most phone books some listings are larger and use graphics, colours, and/or borders to emphasize them; this layout is creative and can be copyrighted. Similarly the phrase that someone writes to indicate that they are Christian can be copyrighted.
You can’t just cut the binding off a phone book and run it through a duplex copier. You have to scan the pages, eliminate the creative part (the way that the information was presented) and create your own presentation of those facts. Similarly to copy the facts from a social networking profile, to avoid violating the author’s creative work you’d have to sanitize all of the creative portions of the profile. For instance, if you determine from their answer to “Religion” that they are a Protestant, that is a fact and you are free to reproduce it. However the answer itself may very well be a creative work.
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.
You are absolutely correct.
But that is not what he did. He took the raw data from FB profiles (public ones, not proviate ones) and then used the raw data (not the presentation) to data mine interesting information.
There is NOTHING copyrightable about that raw data. Take a look at the links above for his article on the 'zones' in the US. Its actually quite facinating from a sociological stand point.
My point is simple. FB had no right to threaten copyright on the data. If he had repproduced the pages en masse sure that would be a violation. But the data is NOT.
My point was that what you are calling “raw data” was in fact the copyrightable presentation of raw data.
If you scanned the pages of the phone book, digitally cropped out the listing for each number (including colours, fonts, graphics, and borders for any listings containing those), re-alphabetized and re-printed those exact duplicates of the listings at 200% for low-sighted individuals (supposing the actually arrangement on the page would be completely different, since you didn’t necessarily make the paper and margins 200% larger as well)... you’d still be violating the creative work of the original publisher.
And if you took the words verbatim, sliced out the answers to the questions on their profile, and stored those in a new database... you’d be violating the creative work of the people who wrote those answers.
What is or is not a creative work may be debatable on any small, single snippet. However if you store, verbatim, millions of people’s answers to the questions on their profiles, are you violating someone’s creative work? Without question you are because even if most of them cannot be considered creative, some of them are creative.
The generic, Times New Roman 10 pt listing with no fancy colours, borders, or font styles? Not creative. The guy who listed “Christian” under his religious views? Not creative.
The listing with an eye-catching graphic, border, colours, and larger font size? Creative. You could extract the information from it, discarding all of the creative features, and you’d be fine, but you can’t make an identical copy of the listing. The guy who wrote a short paragraph under his religious views? The same: you can distill out a factual answer from his creative response, but duplicating the response verbatim would violate his authorship of a creative work.
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.
I fail to see how he did anything wrong. If FB doesn't like it then they can change how their site works.
See subject, and then see troll Clone do so -> http://slashdot.org/comments.pl?sid=1591778&cid=31703134
Utterly hilarious - Clone opened up his piehole & now he can't back up his pure b.s.!
See subject, and then see Clone do so -> http://slashdot.org/comments.pl?sid=1591778&cid=31703134
Utterly hilarious - Clone opened up his piehole & now he can't back up his pure b.s.! He avoids answering questions at ANY cost, lol, when he knows he's f'd up here... hilarious!
they are scared of you knowing the truth.
But the EFF can't fight every battle -- they go after the land-breaking ones, the ones that will have the highest benefit/cost ratio. It's not clear that this is such a battle.
An empty robots.txt is not blank-check permission to crawl and use the data for whatever you want.
No, but it's not a ban either.
Common sense dictates that if data is publicly accessible and not accompanied by a specific usage limitation, you can mine the data and use it for scientific purposes as fair use. This guy did not charge for his results, nor for the compiled data, so it was textbook fair use.
Remember, he did not use the collected data directly but only the relationships it inferred. That information is the product of the crawlers compilation, not the data itself, and only the data itself can be copyrighted. It's just like the fact that you cannot copyright the mood a certain piece of music or movie puts you in, only the music or movie itself. The mood is the product of an interpretation of the music or movie, and while it may be an intended result, it is still not a part of the music or movie itself.
If only... I could copyright sappy lovesongs... Profit!!
"For every complex problem, there is a solution that is simple, neat, and wrong." -- H.L. Mencken (1880-1956) --