Slashdot Mirror


Facebook Kills Dataset of Crawled Public Profiles

holy_calamity writes "Internet entrepreneur Pete Warden wrote a crawler that collated the public profiles of 210 million Facebook profiles and was set to release an anonymised version to researchers. The pages crawled can be read by any web user, and the robots.txt did not forbid crawling. However, Facebook claimed he had violated its terms of service and threatened legal action. Fearing costs, Warden has now destroyed his dataset. For a snapshot of the insights that data could have allowed, see Warden's post on how the friend networks of the 120 million US users in his data segregated into seven clusters." Of course, if he had it, this means anyone who wants it made their own version of this.

34 of 158 comments (clear)

  1. For an Interesting Exercise in Head Asplosion by eldavojohn · · Score: 4, Interesting

    Fearing costs, Warden has now destroyed his dataset.

    Couldn't Warden have sent requests to the EFF to provide lawyers so he could fight an evil corporation to use freely publicly available information?

    Then Facebook could ask the EFF to protect their user's privacy and information being sold to marketers and corporations (sorry, when you're introduced as "Internet entrepreneur" that means there's profit to be had).

    --
    My work here is dung.
    1. Re:For an Interesting Exercise in Head Asplosion by paeanblack · · Score: 4, Insightful

      Couldn't Warden have sent requests to the EFF to provide lawyers so he could fight an evil corporation to use freely publicly available information?

      Finding something on the web does not give you the legal authority to publish and redistribute it. Sure, he could have stuck the whole thing on a torrent somewhere, but if he actually wants to do real work and real research with these data, he's got to play by the rules of the real world...the one with the big blue ceiling and a concept called the rule of law.

      If you don't like that reality, keep it in mind next time you vote.

    2. Re:For an Interesting Exercise in Head Asplosion by Tobor+the+Eighth+Man · · Score: 3, Informative

      Not really a meaningful distinction, as contract law is very much an aspect of the law. We can bicker about whether terms of service are enforceable and to what extent, but the reality is that this guy has better things to do than wage a complex and almost certainly protracted legal battle against a corporation.

    3. Re:For an Interesting Exercise in Head Asplosion by Registered+Coward+v2 · · Score: 2, Insightful

      Couldn't Warden have sent requests to the EFF to provide lawyers so he could fight an evil corporation to use freely publicly available information?

      Finding something on the web does not give you the legal authority to publish and redistribute it. Sure, he could have stuck the whole thing on a torrent somewhere, but if he actually wants to do real work and real research with these data, he's got to play by the rules of the real world...the one with the big blue ceiling and a concept called the rule of law.

      If you don't like that reality, keep it in mind next time you vote.

      I'm not sure what he did was not legal; but the article is pretty clear he doesn't have the resources to fight it in a court and so decided to destroy it. Maybe someone with more money and time may someday decide to fight it and the legality of scrapping information will be clarified by a court.

      To me, the real question is how do TOS square with robot files? Given the generally accepted and followed practice of their use; does not forbidding crawling implicitly allow the data to be collected and used as the scrapper sees fit?

      If you view the data as facts; then they are not copyrightable and so aggregating them would be permissible; assuming the TOS is not binding if a scrapper follows the robots.txt instructions. If that is the case, I'd guess a lot more robots.txt files would prohibit scrapping.

      At any rate, I'd say the real world rules are not real clear here, other than the one that says "avoid picking a legal fight with someone who has a ton more money and lawyers than you."

      Personally, I'd be surprised if someone else already has the same data; but rather than publicize it the simply are using it however they see fit.

      --
      I'm a consultant - I convert gibberish into cash-flow.
    4. Re:For an Interesting Exercise in Head Asplosion by geekoid · · Score: 4, Insightful

      Yes, but you can collect data and publish it as such. Scientific data, not data in the computer sense.
      He should of kept his mouth shut, compiled the data , and then just submitted it to a number of journal. At that point Facebook needs to go after the journals. Facebook would have a tough time winning. and even if they did when, going after the journals would be bad PR. SO no real win there. There bet bet would be to actually help him after the fact and look at the data to ensure that an "individuals privacy has not been violated"

      The data on social networking sites is amazing and could teach us a lot about human nature.

      --
      The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    5. Re:For an Interesting Exercise in Head Asplosion by K.+S.+Kyosuke · · Score: 2, Funny

      Except Facebook is claiming he violated its terms of service (a contract), not the law.

      To me, this claim seems to be as legitimate as a public library claiming that I read too many books and threatening to sue me.

      --
      Ezekiel 23:20
    6. Re:For an Interesting Exercise in Head Asplosion by dubbreak · · Score: 4, Insightful

      Not really a meaningful distinction, as contract law is very much an aspect of the law.

      If he was using an account I could see there being a contract enforceable (e.g. if you except these terms of service we will give you an account). If he was just crawling publicly viewable facebook pages, then what is the consideration? I'd argue there is none and therefor no contract exists. You aren't forced to login to view many pages and it's not like they even have a click through "I agree" TOS on each publicly viewable page. He broke no laws and there is no enforceable contract.

      If facebook doesn't want people crawling pages publicly viewable pages then make them private (loging in required) or at least have a robots.txt that prohibits crawling of those pages.

      --
      "If you are going through hell, keep going." - Winston Churchill
    7. Re:For an Interesting Exercise in Head Asplosion by Rantastic · · Score: 2, Informative

      Finding something on the web does not give you the legal authority to publish and redistribute it.

      Nonsense.

      Allow me to call your attention to Fair use, a doctrine in United States copyright law that allows limited use of copyrighted material without requiring permission from the rights holders, such as for commentary, criticism, news reporting, research, teaching or scholarship.

      Of course, none of that is actually relevant as Facebook is not making a copyright claim. They are claiming he violated their terms of use. I just scanned it and the only seemingly relevant text I can find is

      If you collect information from users, you will: obtain their consent, make it clear you (and not Facebook) are the one collecting their information, and post a privacy policy explaining what information you collect and how you will use it.

      --
      Ask Slashdot: Where bad ideas meet poor googling skills.
    8. Re:For an Interesting Exercise in Head Asplosion by The+Moof · · Score: 2, Interesting

      but if he actually wants to do real work and real research with these data, he's got to play by the rules of the real world...

      The summary says the crawler simply indexed public information. Why is this relevant? Well, recently, I noticed that Facebook Apps, all of which I have all disabled and blocked via my privacy settings, have started accessing my information again. Naturally, I assumed something got reset and started hunting for the settings again. Until I found this new block of text in all of their privacy settings:

      When you visit a Facebook-enhanced application or website, it may access any information you have made visible to Everyone Edit Profile Privacy as well as your publicly available information. This includes your Name, Profile Picture, Gender, Current City, Networks, Friend List, and Pages. The application will request your permission to access any additional information it needs.

      So they claim they can't stop people from acquiring and using my 'publicly available' information, because it's open to the public. Then, they turn around and go after this guy for indexing and using the same 'publicly available' information.

      It all sounds a little two-faced to me.

    9. Re:For an Interesting Exercise in Head Asplosion by crashumbc · · Score: 2, Informative

      unless something has changed, you have to "login" to see anything in Facebook. Even if a page is "public" you can't view it without logging in with your own account.

      A crawler may or may not by pass that...

    10. Re:For an Interesting Exercise in Head Asplosion by clone53421 · · Score: 2, Informative

      They are claiming he violated their terms of use. I just scanned it and the only seemingly relevant text I can find is

      Here.

      --
      Alexander Peter Kristopeit bought his basement from his mommy for one dollar.
  2. If Facebook had done this... by John+Hasler · · Score: 4, Insightful

    ...you'd be flaming them for invading your "privacy".

    --
    Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
    1. Re:If Facebook had done this... by 2obvious4u · · Score: 5, Interesting

      Isn't this the golden egg of Facebook, I though this is what they were selling. That data is fascinating, it is completely anonymous, yet at the same time very insightful for marketing purposes. I think Facebook is just upset because they plan on selling the same data that Pete was.

    2. Re:If Facebook had done this... by Altus · · Score: 5, Insightful

      why do you think they threatened him? they want to sell this data themselves.

      --

      "In America, first you get the sugar, then you get the power, then you get the women..." -H. Simpson

    3. Re:If Facebook had done this... by NeutronCowboy · · Score: 4, Interesting

      Most likely. Facebook's gold mine isn't even so much the user information itself - it's the networks that they can build out of the relationship data. As of right now, they haven't figured out a way how to make money from it, but they certainly aren't going to let someone take the most valuable aspect of their system - the network information - and put it out in the open.

      Personally, I hope someone does the same work, but uploads the raw data anonymously to a torrent somewhere.

      --
      Those who can, do. Those who can't, sue.
  3. Facebook *did* do this by Chirs · · Score: 5, Insightful

    I see very little problem with an automated scan that respects robots.txt.

    By not blocking automated access to the profiles, facebook is squarely at fault.

  4. Yes, by all means, let's stamp out... by jeffb+(2.718) · · Score: 3, Insightful

    ...all the researchers who do everything in the open and with proper anonymization.

    1. Re:Yes, by all means, let's stamp out... by Anonymous Coward · · Score: 2, Interesting

      Even with names removed, data like this can often be traced back to the person. Your name isn't the only unique thing that appears in your facebook profile.

      As an example, how many others share your permutation of friends and fan pages?

    2. Re:Yes, by all means, let's stamp out... by thePowerOfGrayskull · · Score: 2, Informative

      Removing names isn't necessarily enough. The recent netflix case shows that . I think it's interesting that nobody catches the broader implications of that discussion -namely that whether they're "anonymizing" data for purposes of providing it for research, or selling it for marketing... the ability to reverse engineer patterns to undo it remains a risk. -

  5. Publicly available by mdsharpe · · Score: 5, Interesting

    Since this is publicly available information, and all he did was send a program to go grab it (much akin to asking your web browser to download it), does this mean Facebook has essentially threatened him for no more than reading too much of Facebook too quickly? Sounds absurd to me.

    1. Re:Publicly available by CoffeeDog · · Score: 2, Insightful

      Just because something is publicly available doesn't mean just anyone is free to reproduce and distribute it. In Facebook's TOS their users agree to give Facebook rights to distribute the data they provide to them. By your logic it should be legal to photocopy and distribute any book that is available from the public library or record and distribute MP3s of any song that was broadcast on a radio station.

  6. chilling effect by Anonymous Coward · · Score: 5, Insightful

    Don't see Facebook going after Google, even though the data that they posses is ostensibly the same as Warden's. The primary diff that i see is that warden was offering analysis and results for free- not trying to monetize it. Maybe that's what made them mad.

  7. Very interesting by Bearhouse · · Score: 2, Informative

    I'll let others debate the 'privacy' issues; (personally I think there's nothing wrong with scraping profile information that people have explicitly made 'public')
    Anyways, just check what he did with it; very interesting: (FTA)
    http://petewarden.typepad.com/searchbrowser/2010/02/how-to-split-up-the-us.html
    There must be many, many legit uses this data could be put too...shame it's being killed by NIH syndrome

    1. Re:Very interesting by Bearhouse · · Score: 2, Funny

      ahem, put 'to', of course...

  8. Facebook does stuff like this a lot by TheSpoom · · Score: 5, Interesting

    They did something similar to FB Purity, a Greasemonkey script that allows users to filter out apps and other stuff they don't want to see in their feed. Facebook argued that they were misusing their "FB" trademark... eventually they let them continue under the name "fluff busting purity", probably due to the PR backlash that shutting them down would bring.

    They've also shut down the Facebook portion of the Web 2.0 Suicide Machine, which runs scripts that allow a user to delete their social profiles as thoroughly as sites will allow. In that case, they argued that the Suicide Machine was violating their "Statement of Rights and Responsibilities"... which isn't even a law! Nonetheless, the Suicide Machine didn't have the financial ability to fight even frivolous claims like that, so they folded that section.

    Facebook apparently believes that its users will continue using the site regardless of the ridiculous access policies that their legal department create and defend. I hope they're wrong.

    --
    It's better to vote for what you want and not get it than to vote for what you don't want and get it.
    - E. Debs
    1. Re:Facebook does stuff like this a lot by Anonymous Coward · · Score: 5, Insightful

      They're not wrong though. People on FB constantly get outraged at new policies, interfaces and features, but I don't know of anyone who has actually left the site. I am just as bad myself; all I've done is remove everything from my profile and just use it as a hub to stay in contact with people all around me, I haven't gone as far as stopping using the site, and I don't think I will. Nor will many people.

    2. Re:Facebook does stuff like this a lot by flabordec · · Score: 2, Insightful

      Facebook apparently believes that its users will continue using the site regardless of the ridiculous access policies that their legal department create and defend. I hope they're wrong.

      I'm afraid the average Facebook user is a teen who is more worried with getting a higher score in whatever Flash game she is currently playing than in FB's access policies for computers.

      --
      "I see undead people" Warcraft III - Necromancer
  9. Robots.txt is insufficient. by way2trivial · · Score: 4, Interesting

    I'm sorry- it is..

    robots.txt allows you to "refuse a specific named bot" or "refuse everyone" or "allow everything" or "allow these directories" or "only allow these directories"
    (want a fascinating read? try robots.txt at your favorite government site- whitehouse.gov used to be fascinating stuff)
    there is no way in robots.txt to permit crawling based on intent of information use like a CC license does

    I can- with photographs, have a creative commons license that sez "use it for anyhting" "use it with credit to me" "free for non-commercial" etc.
    I would WANT google to see my site, I would want bing to see my site- for the purposes of indexing in a search engine.
    I can't say in robots.txt
    "come in and index for search engines and relevance- but you may not use the data to collect information on our membership for marketing to or marketing their info to others"

    If I build a website all about-- coffee- I want the information available to the general public,but from/on my site....

    --
    every day http://en.wikipedia.org/wiki/Special:Random
    1. Re:Robots.txt is insufficient. by truthsearch · · Score: 2, Informative

      So you block all of your content from being indexed by Google? Because Google's also using your content for marketing.

      Also, robots.txt doesn't refuse anything to anyone. It's just a suggestion that any system can ignore. If you don't want systems "seeing" your content, then you must remove your content from the internet or put it behind a wall. A crawler is just another client like a web browser. The internet is intentionally built without discrimination.

  10. Don't worry... by turbotroll · · Score: 3, Interesting

    Somebody else will do it again, this time anonymously and with an evil robot that hides its tracks. It only takes perl, LWP, MySQL, tor and a little time and imagination to do so.

    Fuck you, Zuckerberg.

  11. You are missing my point by way2trivial · · Score: 3, Interesting

    and I really think it is worth making.

    Copyright protections are important, the snippet of text that google uses to let people know my site is relevant is easily fair use
    I don't have a problem with it- I welcome it as it's beneficial for both myself and google for it to be there.

    the ENTIRE TEXT of my site- copied and recopied to put into a web page that exists only to generate ad-sense revenue by a third party is not.
    and if robots.txt had a 'license' mode, I'd have a much stronger case of protections if I chose to pursue a blatant copying and re-publication of my site.

    robots.txt labels that I wish there were include
    'allow function:indexing'
    'disallow function:total and complete reproduction'
    'disallow function: total and complete reproduction for XXX days'
    (so I can allow wayback machine and equivalents'
    'disallow function: aggregate data collection'
    'disallow function: user data collection'
    'disallow function: email collection'

    looking at amazon, http://www.amazon.com/robots.txt
    they somewhat do this by putting the information they don't want into the wild in it's own directories
    then disallowing those directories- actually, now that I look at it- it's a neat way to go..
    but I'd still prefer a robots.txt option that different 'intended use of data to be crawled' permissions covered

    --
    every day http://en.wikipedia.org/wiki/Special:Random
  12. Re:You see, Facebook doesn't only control your... by NeutronCowboy · · Score: 2, Insightful

    Someone ought to mod this up. Facebook's only value is in the information you provide to Facebook about who you are, where you live and who your connections are. As a result, they will defend that little nugget as if their life depended on it - because it does.

    --
    Those who can, do. Those who can't, sue.
  13. Re:On what grounds? by cdrguru · · Score: 2, Informative

    If your position in entering the above motion was that "I'm right, so I should win" and offered nothing else - such as expert witnesses of your own, you are going to war unarmed. Of course you are going to lose.

    The adversarial system is based on the idea that you have to defend your position. Ranting that "I'm right" doesn't count for much - presenting facts, witnesses, expert testimony, etc. is what counts. And doing so in the proper format for the court.

    You are mostly correct that a lawyer would know these things and how they are done in court. Therefore, yes, almost always a lawyer is required, if for no other reason than to get through the proper procedural format of the court process. You want to do it yourself? You better spend some time learning how it is done, what is required to win and how to get there. Without that education, it is like taking someone that doesn't know computer programming and having them debug a program in an Assembler language.

    Don't have the time to learn all this stuff? Well, that is why we have lawyers.

  14. Statement of Rights and Responsibilities, sec. 3-2 by clone53421 · · Score: 2, Interesting

    You will not collect users’ content or information, or otherwise access Facebook, using automated means (such as harvesting bots, robots, spiders, or scrapers) without our permission.

    An empty robots.txt is not blank-check permission to crawl and use the data for whatever you want.

    --
    Alexander Peter Kristopeit bought his basement from his mommy for one dollar.