Facebook Kills Dataset of Crawled Public Profiles
holy_calamity writes "Internet entrepreneur Pete Warden wrote a crawler that collated the public profiles of 210 million Facebook profiles and was set to release an anonymised version to researchers. The pages crawled can be read by any web user, and the robots.txt did not forbid crawling. However, Facebook claimed he had violated its terms of service and threatened legal action. Fearing costs, Warden has now destroyed his dataset. For a snapshot of the insights that data could have allowed, see Warden's post on how the friend networks of the 120 million US users in his data segregated into seven clusters." Of course, if he had it, this means anyone who wants it made their own version of this.
Fearing costs, Warden has now destroyed his dataset.
Couldn't Warden have sent requests to the EFF to provide lawyers so he could fight an evil corporation to use freely publicly available information?
Then Facebook could ask the EFF to protect their user's privacy and information being sold to marketers and corporations (sorry, when you're introduced as "Internet entrepreneur" that means there's profit to be had).
My work here is dung.
...you'd be flaming them for invading your "privacy".
Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
I see very little problem with an automated scan that respects robots.txt.
By not blocking automated access to the profiles, facebook is squarely at fault.
...all the researchers who do everything in the open and with proper anonymization.
Since this is publicly available information, and all he did was send a program to go grab it (much akin to asking your web browser to download it), does this mean Facebook has essentially threatened him for no more than reading too much of Facebook too quickly? Sounds absurd to me.
Don't see Facebook going after Google, even though the data that they posses is ostensibly the same as Warden's. The primary diff that i see is that warden was offering analysis and results for free- not trying to monetize it. Maybe that's what made them mad.
I'll let others debate the 'privacy' issues; (personally I think there's nothing wrong with scraping profile information that people have explicitly made 'public')
Anyways, just check what he did with it; very interesting: (FTA)
http://petewarden.typepad.com/searchbrowser/2010/02/how-to-split-up-the-us.html
There must be many, many legit uses this data could be put too...shame it's being killed by NIH syndrome
They did something similar to FB Purity, a Greasemonkey script that allows users to filter out apps and other stuff they don't want to see in their feed. Facebook argued that they were misusing their "FB" trademark... eventually they let them continue under the name "fluff busting purity", probably due to the PR backlash that shutting them down would bring.
They've also shut down the Facebook portion of the Web 2.0 Suicide Machine, which runs scripts that allow a user to delete their social profiles as thoroughly as sites will allow. In that case, they argued that the Suicide Machine was violating their "Statement of Rights and Responsibilities"... which isn't even a law! Nonetheless, the Suicide Machine didn't have the financial ability to fight even frivolous claims like that, so they folded that section.
Facebook apparently believes that its users will continue using the site regardless of the ridiculous access policies that their legal department create and defend. I hope they're wrong.
It's better to vote for what you want and not get it than to vote for what you don't want and get it.
- E. Debs
I'm sorry- it is..
robots.txt allows you to "refuse a specific named bot" or "refuse everyone" or "allow everything" or "allow these directories" or "only allow these directories"
(want a fascinating read? try robots.txt at your favorite government site- whitehouse.gov used to be fascinating stuff)
there is no way in robots.txt to permit crawling based on intent of information use like a CC license does
I can- with photographs, have a creative commons license that sez "use it for anyhting" "use it with credit to me" "free for non-commercial" etc.
I would WANT google to see my site, I would want bing to see my site- for the purposes of indexing in a search engine.
I can't say in robots.txt
"come in and index for search engines and relevance- but you may not use the data to collect information on our membership for marketing to or marketing their info to others"
If I build a website all about-- coffee- I want the information available to the general public,but from/on my site....
every day http://en.wikipedia.org/wiki/Special:Random
Somebody else will do it again, this time anonymously and with an evil robot that hides its tracks. It only takes perl, LWP, MySQL, tor and a little time and imagination to do so.
Fuck you, Zuckerberg.
and I really think it is worth making.
Copyright protections are important, the snippet of text that google uses to let people know my site is relevant is easily fair use
I don't have a problem with it- I welcome it as it's beneficial for both myself and google for it to be there.
the ENTIRE TEXT of my site- copied and recopied to put into a web page that exists only to generate ad-sense revenue by a third party is not.
and if robots.txt had a 'license' mode, I'd have a much stronger case of protections if I chose to pursue a blatant copying and re-publication of my site.
robots.txt labels that I wish there were include
'allow function:indexing'
'disallow function:total and complete reproduction'
'disallow function: total and complete reproduction for XXX days'
(so I can allow wayback machine and equivalents'
'disallow function: aggregate data collection'
'disallow function: user data collection'
'disallow function: email collection'
looking at amazon, http://www.amazon.com/robots.txt
they somewhat do this by putting the information they don't want into the wild in it's own directories
then disallowing those directories- actually, now that I look at it- it's a neat way to go..
but I'd still prefer a robots.txt option that different 'intended use of data to be crawled' permissions covered
every day http://en.wikipedia.org/wiki/Special:Random
Someone ought to mod this up. Facebook's only value is in the information you provide to Facebook about who you are, where you live and who your connections are. As a result, they will defend that little nugget as if their life depended on it - because it does.
Those who can, do. Those who can't, sue.
If your position in entering the above motion was that "I'm right, so I should win" and offered nothing else - such as expert witnesses of your own, you are going to war unarmed. Of course you are going to lose.
The adversarial system is based on the idea that you have to defend your position. Ranting that "I'm right" doesn't count for much - presenting facts, witnesses, expert testimony, etc. is what counts. And doing so in the proper format for the court.
You are mostly correct that a lawyer would know these things and how they are done in court. Therefore, yes, almost always a lawyer is required, if for no other reason than to get through the proper procedural format of the court process. You want to do it yourself? You better spend some time learning how it is done, what is required to win and how to get there. Without that education, it is like taking someone that doesn't know computer programming and having them debug a program in an Assembler language.
Don't have the time to learn all this stuff? Well, that is why we have lawyers.
An empty robots.txt is not blank-check permission to crawl and use the data for whatever you want.
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.