LinkedIn Says It's Illegal To Scrape Its Website Without Permission (arstechnica.com)
A small company called hiQ is locked in a high-stakes battle over web scraping with LinkedIn. It's a fight that could determine whether an anti-hacking law can be used to curtail the use of scraping tools across the web. From a report: HiQ scrapes data about thousands of employees from public LinkedIn profiles, then packages the data for sale to employers worried about their employees quitting. LinkedIn, which was acquired by Microsoft last year, sent hiQ a cease-and-desist letter warning that this scraping violated the Computer Fraud and Abuse Act, the controversial 1986 law that makes computer hacking a crime. HiQ sued, asking courts to rule that its activities did not, in fact, violate the CFAA. James Grimmelmann, a professor at Cornell Law School, told Ars that the stakes here go well beyond the fate of one little-known company. "Lots of businesses are built on connecting data from a lot of sources," Grimmelmann said. He argued that scraping is a key way that companies bootstrap themselves into "having the scale to do something interesting with that data." [...] But the law may be on the side of LinkedIn -- especially in Northern California, where the case is being heard. In a 2016 ruling, the 9th Circuit Court of Appeals, which has jurisdiction over California, found that a startup called Power Ventures had violated the CFAA when it continued accessing Facebook's servers despite a cease-and-desist letter from Facebook.
don't make it public fi you don't want it read
I guess they don't care about being found on a search engine.
Because if it's not illegal to scrap their websites, black hat hackers will have a field day.
#DeleteFacebook
Using some add-on python packages it is ridiculously easy to scrape any web page, even those that use ASP (It's a PITA to get set up the first time, but...). The ONLY thing - aside from legal action, apparently - is to have a login mechanism in front. Without authenticating, it's no-go.
Airline websites have this same problem -- the online "cheap ticket" engines regularly scrape the publicly available data by essentially running the "book a trip" workflow millions of times to try to pull the entire set of fares for different city pairs. It's a cat-and-mouse game because the information has to be available for normal humans to book trips; no one is going to solve a CAPTCHA to look up fares. Basically these engines are looking for any irregularities like mis-filed fares or fares that happen to be a particularly good deal. (Airlines have to publish their fares in advance and make them available to online sources that are available to travel agents. This is why you'll occasionally see stuff like a transatlantic business class ticket for $50 or similar...)
I'm not sure if LinkedIn can actually bar someone from scraping their public data. If that was the case, no one could run wget on a website and pull down all the static content.
Here's why it seems bonkers to me. . . When you access a website, you are merely sending that site a request for information. That's all. Assuming it responds with the requested information, one must presume that's because the operator (and, by proxy, the owner) of the website set it up for that purpose. So what we have here is effectively. . .
LinkedIn: Don't request information from us!
hiQ: Please send the following information.
LinkedIn: OK, here you go.
LinkedIn: Dammit, you requested information after we told you not to! WE'RE GONNA SUE!!
If you read the act, it is clear that it applies to financial and government systems. It has not been tested in court that the CFAA covers violating terms of use. You really only need to go to some basic contract law to deal with people accessing your site in a way you do not like, and copyright law for distributing your material without permission.
I refuse to use any social media site including LinkedIN. A lot of companies - such as Goodwill - recruit exclusively from LinkedIN. Fuck'em.
I don't work for any company that uses social media for recruiting.
They DO want it read, by the end-users who consume it and don't resell it. They DON'T want it read by aggregators who profit from it (and especially in a way that drives users to restrict their use of the service).
There is no solid technical solution for distinguishing between these two classes of user. So businesses are using the law to draw that distinction instead.
I personally think the "don't scrape" approach is totally backwards. They should take a "don't redistribute" approach instead. But logic commonly loses to wealth in the real world.
The Law of the Land is HTTP.
403 my requests. Until then, your front-facing data is fair game.
what about wifi scanning just looking for ssid's is on by default on many os's
I refer to the Robot.txt used to tell search engines what's out of bounds. http://www.searchtools.com/rob...
I do a LOT of data scraping of the government websites here in my town and state. They don't make their data available for bulk download so the only option is to scrape it to turn it into something useful for data analysis. Many of the sites, however, are seriously underpowered for the software they run (mostly ASP and ASP.NET sites) and, even though I throw in generous delays between each request, the various entities still take notice because it takes anywhere from 1 to 5 seconds for a response to come back. That means that something on their end is consuming 1 to 5 seconds of at least 1 CPU core for each request. If the court sides with LinkedIn, it sets a very bad precedent that government, especially local city and state governments who prefer to hide any of their illicit/corrupt activities, will most certainly use against those who want to hold them accountable. Several of the entities have claimed that I or one of my colleagues have "hacked them" over the years but the lawyers always go to bat for me and those government entities shut up real fast and let me continue to scrape their data unhindered. As a taxpaying citizen, holding government accountable is what my tax dollars are for, so I'm totally okay with anyone who scrapes government websites.
I realize private entities are involved here, but precedent, once established, has a funny way of being abused by other entities, especially government. Frankly, I'd like to not see the inside of a prison cell for doing an important and necessary public service. The real solution is that LinkedIn should have separate infrastructure that can be scraped/interacted with all day long and not adversely affect their main infrastructure. They should be implementing public WebSockets so that scraping tools don't even need to guess and they can perform direct pushes of information. Maybe have bulk downloads of data available as well that LinkedIn perceives as "okay" for other entities to have. These are solvable problems with technology already at our disposal.
My view: If you are on the Internet, then you provide your data at your own expense. I run my own websites this way and it is working out just fine for me. If LinkedIn can't survive, then its business model is wrong and the business should start innovating or die, not lash out and sue everything and everyone it doesn't like.
I think that's a very unrealistic and uncommon presumption. 99% of the websites that I access, I have not agreed to any terms of service and I don't even know if they offer any special terms.
By default, websites don't have any documented terms of service. The terms are just: send a request and I'll probably send you back a reply. (It's not like, by default, an Apache install comes with a generic ToC boilerplate legalese thing.)
And most websites use this default (essentially: no terms). You're normally only limited by whatever the law happens to be, combined with various technical limitations. Usually there's no deviation, explained in some contract, which overlays the default situation.
Of course, there are exceptions. A website having terms isn't so uncommon that we're all shocked by the thought of it. But whenever that happens, there is always a mechanism whereby the website refuses service, until you have agreed to the terms.
And if the website doesn't refuse service without agreement to the "terms," then the "terms" weren't really terms.
Imagine I said this: "I'll sell you my old car for $10k. Those are the terms: pay me $10k and I'll sign over the title on my car to you."
You refuse my terms (because the car totally isn't worth it). Then you say, "Hey, can I have your car for free?" and I reply, "Sure, it doesn't run anyway. Just get that thing out of here. No? Ok, I'll pay you $50 to tow it out of here!"
Did you subvert or evade my terms? No, I say you discovered my true terms. Or maybe you persuaded me of your new terms, you damn Jedi.
It's always worth asking again, and everything is negotiable.
Now if LinkedIn had instead posted "ecto gammat", all the nerds would be in their corner.
#DeleteChrome
LinkedIn's whole business model is "scraping" information from people. It's not like they pay people to enter that information.
When CDDB tried this sort of B.S. it led to FreeDB. Maybe LinkedIn being assholes will lead to something similar.
if companies just chained enployees to their desk, there'd be no need to worry about them quitting
Can we talk about what HiQ is doing with the data for a sec? "HiQ scrapes data about thousands of employees from public LinkedIn profiles, then packages the data for sale to employers worried about their employees quitting" I mean WTF?
Curious how GDPR will or could play here. If the Europeans have it right, a profile isn't LinkedIns. It's the collective people who put their info there.
I'd suggest, a flag that the user can set, which says 'make x' of my profile indexable.
The Computer Fraud and Abuse Act is part of the Federal Criminal Code, and no private entity can use it to bring a suit. A prosecuting attorney for the government could make a criminal charge, but LinkedIn would have to persuade him/them to take that act. This is much ado about nothing.
The CFAA applies immediately or when the defendant (or defendant to be) exceeds the permitted access. This could be also through a cease and desist letter. See Facebook, Inc. v. Power Ventures, Inc., No. 13-17102 (9th Cir. July 12, 2016) https://cdn.ca9.uscourts.gov/d...
You are permitted to grant different people different terms or access. Look at https://qz.com/981029/a-federa...
Fight Spammers!
If you don't want your content to be scraped, don't make it publicly available. If I can view it without a logon you have no claim to protect the data.
LinkedIn published content under copyright. Another entity took that copyrighted material and re-published it without consent of the copyright holder. It seems like a pretty straight-forward case.
What am I missing here?
The only questions should be the size of the award LinkedIn should receive and whether anyone associated with the other entity should be criminally prosecuted.