Slashdot Mirror


LinkedIn Says It's Illegal To Scrape Its Website Without Permission (arstechnica.com)

A small company called hiQ is locked in a high-stakes battle over web scraping with LinkedIn. It's a fight that could determine whether an anti-hacking law can be used to curtail the use of scraping tools across the web. From a report: HiQ scrapes data about thousands of employees from public LinkedIn profiles, then packages the data for sale to employers worried about their employees quitting. LinkedIn, which was acquired by Microsoft last year, sent hiQ a cease-and-desist letter warning that this scraping violated the Computer Fraud and Abuse Act, the controversial 1986 law that makes computer hacking a crime. HiQ sued, asking courts to rule that its activities did not, in fact, violate the CFAA. James Grimmelmann, a professor at Cornell Law School, told Ars that the stakes here go well beyond the fate of one little-known company. "Lots of businesses are built on connecting data from a lot of sources," Grimmelmann said. He argued that scraping is a key way that companies bootstrap themselves into "having the scale to do something interesting with that data." [...] But the law may be on the side of LinkedIn -- especially in Northern California, where the case is being heard. In a 2016 ruling, the 9th Circuit Court of Appeals, which has jurisdiction over California, found that a startup called Power Ventures had violated the CFAA when it continued accessing Facebook's servers despite a cease-and-desist letter from Facebook.

28 of 167 comments (clear)

  1. then dont' make it public by Anonymous Coward · · Score: 5, Insightful

    don't make it public fi you don't want it read

    1. Re:then dont' make it public by Anonymous Coward · · Score: 5, Interesting

      don't make it public fi you don't want it read

      They want it read. By people. (And search engines.) They don't want it read by companies that take the information and then sell it as their business model.

      If we support hiQ, saying that scraping publicly-accessible content from another site and then using that for profit is permissible, then doesn't that mean it's also applicable to other sites? Slashdot's content is public: can I scrape everything, host it on my site, insert ads, and make money?

      Sorry hiQ, as much as software and internet legislation is behind the times and technically inappropriate, there are some things in law which follow common sense - and one of them is you can't take someone else's stuff and sell it for yourself. If you want to use their content then you need to follow the (common) practice of establishing some sort of licensing agreement.

      But anyways, what about their user agreement?

      You agree that you will not: [...] Develop, support or use software, devices, scripts, robots, or any other means or processes (including crawlers, browser plugins and add-ons, or any other technology or manual work) to scrape the Services or otherwise copy profiles and other data from the Services;

      Is that not enough for at least an injunction and civil suit?

    2. Re:then dont' make it public by BronsCon · · Score: 3, Insightful

      They don't want it read by companies that take the information and then sell it as their business model.

      What do search engines do, then?

      --
      APK quotes people (including myself) without context and should not be trusted. Just thought you should know.
    3. Re:then dont' make it public by sexconker · · Score: 4, Insightful

      No, only one side has legitimacy.

      If you complain about people using information you post PUBLICLY, you are an idiot.
      This doesn't even rise to copyright infringement.

    4. Re:then dont' make it public by smooth+wombat · · Score: 3, Insightful

      "we want to let search engines use it without license, but want to require a license for anyone else" attitude.

      No, that is not correct. Search engines point to a page and may give a very brief line or so from the article, but one still has to click on the link to go to the real page and read everything.

      hiQ goes to the Linkedin site and rather than pointing to the pages in question, takes the data, packages it, and then sells it to someone else, having left Linkdedin to do all the heavy lifting.

      The two are not close.

      --
      We will bankrupt ourselves in the vain search for absolute security. -- Dwight D. Eisenhower
    5. Re:then dont' make it public by BronsCon · · Score: 4, Interesting

      The two are not close.

      They really are, though. LinkedIn has copyright on all of their content, in whole and in part, not just as a whole. That's how copyright works, otherwise I could change a single word in a book and republish it as an original work under its own copyright. It is also important to keep in mind that (most) search engines -- and Google specifically -- don't just grab the page title, META description (or first couple lines of content) and a word/phrase count, they grab the entire content of the page, and they do so in order to display the exact part of the content that contains your search term(s) -- as I mentioned earlier -- rather than a likely irrelevant summary or intro.

      To do this, search engines must necessarily use the entire page and not just key pieces of data. That is, Google et-al get away with using more of LinkedIn pages without license than hiQ is using. Therein lies the problem.

      --
      APK quotes people (including myself) without context and should not be trusted. Just thought you should know.
    6. Re:then dont' make it public by buss_error · · Score: 2

      No, that is not correct.

      I'm of two minds about LinkedIn.
      In the first place, I'm required to have an account by my current employer.
      In the second place, LinkedIn in my opinion does a ton of scraping themselves (asking to access your mail box contacts, for instance.) But at least Linkedin ASKs to access it. Still, it feels creepy to me. The "psycho" girl friend kind of creepy.

      On the third hand, LinkedIn told the to stop. So they should stop.

      --
      Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.
    7. Re:then dont' make it public by BronsCon · · Score: 2

      Actually, anything you're able to view from a public space is fair game under current laws, with the exception of court orders stating otherwise. If hiQ's servers can view the content from the public internet (that is, if LinkedIn's servers serve it to them without them hacking around some technical measure), it's fair game unless LinkedIn gets an injunction against hiQ. That is, what you're claiming is really for the courts to decide.

      Or, you know, LinkedIn could just claim copyright on their data and issue a series of DMCA notices.

      IANAL but I've consulted several regarding a very similar issue in the past.

      --
      APK quotes people (including myself) without context and should not be trusted. Just thought you should know.
    8. Re:then dont' make it public by alzoron · · Score: 3, Insightful

      This is not a copyright issue. This is a CFAA issue. It's been long determined that you cannot copyright facts. The CFAA deals with unauthorized access to computer systems. LinkedIn told these companies to stop doing it and they kept doing it That's a pretty clear case of unauthorized access.

    9. Re:then dont' make it public by AK+Marc · · Score: 2

      Then why didn't they file a copyright complaint? Instead, they are claiming "hacking" for viewing public information. (not copyright for using it, but "hacking" for viewing). Copyright is irrelevant, and not the complaint.

    10. Re:then dont' make it public by Wootery · · Score: 2

      That something is 'in public' doesn't mean you're free to copy it.

      Walk around a city and you might see countless TVs. That doesn't mean you're allowed to record them and sell the videos - that's still copyright infringement.

    11. Re:then dont' make it public by BronsCon · · Score: 2

      Look again, they're given a lot of explicit restrictions and a handful of explicit permissions. In Google's case those are limited to:
      Allow: /psettings/guest-controls*
      Allow: /psettings/guest-email-unsubscribe*
      Allow: /psettings/sms-unsubscribe*
      Allow: /psettings/guest-controls/retargeting-opt-out*
      Allow: /settings/loid-email-unsubscribe-router*
      Allow: /settings/loid-email-unsubscribe*
      Allow: /help/

      For reference, the first 6 are pages where one can unsubscribe from various forms of marketing and the last is LinkedIn's support section. Anything else Google indexes (and they have indexed a LOT of LinkedIn's content) is without explicit permission, possible even contrary to the 45 explicit restrictions they've been given. For example, I found this in Google's index, and /profile/ is listed as a Disallow rule.

      Most of the search engines listed in that robots.txt have the same set of rules as Google. The only obvious exception is deepcrawl, which also has the following Allow rules:
      # Profinder only for deepcrawl
      Allow: /profinder*
      Allow: /profinder/*

      --
      APK quotes people (including myself) without context and should not be trusted. Just thought you should know.
  2. I've done several scraping projects by GerryGilmore · · Score: 3, Interesting

    Using some add-on python packages it is ridiculously easy to scrape any web page, even those that use ASP (It's a PITA to get set up the first time, but...). The ONLY thing - aside from legal action, apparently - is to have a login mechanism in front. Without authenticating, it's no-go.

    1. Re:I've done several scraping projects by iggymanz · · Score: 4, Interesting

      hahaha, you imagine login is a cure?

      no, scripts can log in. with sites having millions of users you can make as many logins as you need, it's a whack-a-mole the site can't win

    2. Re:I've done several scraping projects by im_thatoneguy · · Score: 3, Informative

      You can have terms of service though on a login to make it easily illegal.

      "By logging in you agree to not republish data that you view."

    3. Re:I've done several scraping projects by Gr8Apes · · Score: 4, Informative

      That's not illegal, that's merely a violation of the user agreement.

      --
      The cesspool just got a check and balance.
    4. Re:I've done several scraping projects by Zero__Kelvin · · Score: 2, Insightful

      Which gives them standing in court. It *might* not be a crime but it creates a contract that doesn't exist without it. This is far from the first time a company has tried the old "The Internet doesn't work the same way for us as it does for the rest of the world. Callsies, no take-backs!" defense.

      --
      Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
    5. Re:I've done several scraping projects by im_thatoneguy · · Score: 2

      Not criminal but breach of contract is grounds for a civil cause of action.

    6. Re: I've done several scraping projects by Zero__Kelvin · · Score: 2

      You couldn't be more wrong.

      --
      Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
    7. Re:I've done several scraping projects by Zero__Kelvin · · Score: 2

      That's not correct. You directing your son to click the button is no different than you directing him to commit a crime. The culpability and responsibility rests with you. You will be held to the contract in the former case and charged with a crime in the latter. Great parenting though!

      --
      Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
  3. Happens in other industries too by ErichTheRed · · Score: 5, Interesting

    Airline websites have this same problem -- the online "cheap ticket" engines regularly scrape the publicly available data by essentially running the "book a trip" workflow millions of times to try to pull the entire set of fares for different city pairs. It's a cat-and-mouse game because the information has to be available for normal humans to book trips; no one is going to solve a CAPTCHA to look up fares. Basically these engines are looking for any irregularities like mis-filed fares or fares that happen to be a particularly good deal. (Airlines have to publish their fares in advance and make them available to online sources that are available to travel agents. This is why you'll occasionally see stuff like a transatlantic business class ticket for $50 or similar...)

    I'm not sure if LinkedIn can actually bar someone from scraping their public data. If that was the case, no one could run wget on a website and pull down all the static content.

  4. This is bonkers! by Zobeid · · Score: 4, Interesting

    Here's why it seems bonkers to me. . . When you access a website, you are merely sending that site a request for information. That's all. Assuming it responds with the requested information, one must presume that's because the operator (and, by proxy, the owner) of the website set it up for that purpose. So what we have here is effectively. . .

    LinkedIn: Don't request information from us!

    hiQ: Please send the following information.

    LinkedIn: OK, here you go.

    LinkedIn: Dammit, you requested information after we told you not to! WE'RE GONNA SUE!!

    1. Re:This is bonkers! by bluefoxlucid · · Score: 5, Interesting

      Actually, LinkedIn has a point.

      LinkedIn supplies service to the public at-large, in the same way that a MicroCenter supplies retail service to the public at-large. All members of the public are allowed to enter a MicroCenter. You walk up to the doors and they open automatically.

      You can be trespassed for no reason by a retail center or other physical location open to the public at-large. The doors still open to you, but you're not allowed in. It's the same with a Web site: it's difficult in-practice to establish a verifiable packet identity on the Internet. IP addresses change, and you can do goofy shit like put the data scrapes in AJAX requests to distribute their source.

      In other words: you're by default authorized to access LinkedIn's public assets. You're not allowed to access stuff requiring a logged-in session until you've gotten log-in credentials, because there are actual systems in place to stop you from doing that, implying that you're not supposed to force access there. Basically, civilized understanding of the expectations of your host on the face.

      If LinkedIn tells you to stop, you've now had your authorization revoked. You can't claim a restraining order is invalid because someone's outside and you can also be anywhere outside, and you also can't claim that LinkedIn can't de-authorize you unless they specifically identify and block you. Blocking an individual entity from a Web site is hard and has collateral damage.

      So the CFAA is actually a valid vehicle here, since "abuse" is essentially defined as "accessing a system to which you are not authorized." The reasonable person test holds up a lot of behavior, largely because it's unreasonable for a person to determine if a certain behavior or function on a Web site might not be something they're allowed to touch, or whatnot, given the reasonable behavior of people at-large. A lot of stuff happens that won't pass CFAA as fraud or abuse, even though it's inconvenient and unintended. By the same token, when somebody has told you to stop accessing their systems in a certain way and you do it anyway, a reasonable person might assume you were, you know, told not to, and not allowed to do that, and that you know damned well you're not allowed to do that.

      That's not to say threats, lawyers, and other anti-social behavior are good business. Poor diplomacy here. Effective in the legal field, but not your best option.

    2. Re:This is bonkers! by tattood · · Score: 3, Insightful

      Then blacklist IP's at the firewall(s) for endpoints that are scraping your site.

      IP addresses are fairly easy to change. You can use something like TOR, so your public IP always changes.

      --
      WTB [sig], PST!!!
    3. Re:This is bonkers! by bluefoxlucid · · Score: 2

      Let's try this again.

      it's difficult in-practice to establish a verifiable packet identity on the Internet. IP addresses change, and you can do goofy shit like put the data scrapes in AJAX requests to distribute their source.

      Blocking an individual entity from a Web site is hard and has collateral damage.

      Wikipedia has tried this, with collateral damage and limited success. I've seen people get sent to jail for harassment and legally barred from accessing certain sites and systems under restraining order, and then continue to access them with no reasonable way to prove their identity (i.e. could be someone else pretending to be said person).

      These days, it's different. Those IP addresses are probably automatically-assigned or internal to cloud infrastructure. IAAS may share addresses across clients. The IPs may appear from a range of hundreds of subnets coming from auto-scaling AWS infrastructure, constantly provisioning and releasing addresses.

      In other words: "Block it at the firewall" can easily mean "Block everything coming from AWS, Azure, DigitalOcean, and all other data centers all over the world." Difficult (nigh-impossible) and prone to huge amounts of collateral damage.

      Then: the courts have already told you this is a matter of you having a public Web site, and you can deal with them "accessing" it yourself because you apparently have no right to tell people they're not allowed in due to their use of your published information. Now you have people jumping from address to address, and you're forced to play firewall whack-a-mole.

  5. Exactly! by Anonymous Coward · · Score: 2, Informative

    I refuse to use any social media site including LinkedIN. A lot of companies - such as Goodwill - recruit exclusively from LinkedIN. Fuck'em.

    I don't work for any company that uses social media for recruiting.

  6. Re:CFAA does not apply by russotto · · Score: 2

    It has not been tested in court that the CFAA covers violating terms of use.

    Yes, it has, but only in the Central District of California as far as I know. The interpretation that the CFAA covers violating TOS was found to be overbroad in U.S. v. Drew, 259 F.R.D. 449 (C. D. Cal. 2009).

  7. Wrong! by www.sorehands.com · · Score: 3, Informative

    The CFAA applies immediately or when the defendant (or defendant to be) exceeds the permitted access. This could be also through a cease and desist letter. See Facebook, Inc. v. Power Ventures, Inc., No. 13-17102 (9th Cir. July 12, 2016) https://cdn.ca9.uscourts.gov/d...

    You are permitted to grant different people different terms or access. Look at https://qz.com/981029/a-federa...