Websites Complaining About Screen-Scraping
wilko11 writes "There have been two cases recently where websites have requested the removal of modules from CPAN. These modules could be used to access the websites (EuroTV and Streetmap) from a PERL program. The question being asked on the mailinglists (threads about EuroTV and about Streetmap) is 'can companies dictate what software you can use to access web content from their server?'"
If we piss them off enough by chopping off their advertisements and snipping out their content, they'll just write their sites in Flash, or as one big image file, or some other proprietary format. That'll pretty well dictate what software you use to view their site.
Comment removed based on user account deletion
I can understand how site owners could have a problem with a commercial software product like ExpertGPS wasting their bandwidth while skipping ads. ExpertGPS costs $59.95, but downloads maps from Microsoft's TerraServer without going through its web interface and viewing its advertising. Microsoft hasn't blocked access from these programs yet, but what if they do? All the paying users of ExpertGPS would be out of this functionality.
The solution that has worked best for me...is to avoid public discussion. -- CmdrTaco
I am constantly greeted with messages to the tone of:
How is this any different from what they are attempting to do here?
I hate to disappoint, but I don't think that this is a new precedent. What is a new precedent is the notion that they can request the removal, or to make unavailable, software that is otherwise available
The precedent here is not the software usage to access a website, but the notion that this can be extended to:
This was not ever realized, I believed mostly because of overpaid "web designers".
But the Semantic Web would require many funny user agents for all kinds of things.
Clearly, if this kind of thinking is allowed to persist in corporate headquarters, it will kill the Semantic Web before it gets started.
I wonder what Tim Berners-Lee thinks about this...
Employee of Inrupt, Project Release Manager and Community Manager for Solid
One of the biggest sites that I've not seen anyone mention is eBay. Following is in their eula:
Our Web site contains robot exclusion headers and you agree that you will not use any robot, spider, other automatic device, or manual process to monitor or copy our Web pages or the content contained herein without our prior expressed written permission.
You agree that you will not use any device, software or routine to bypass our robot exclusion headers, or to interfere or attempt to interfere with the proper working of the eBay site or any activities conducted on our site.
You agree that you will not take any action that imposes an unreasonable or disproportionately large load on our infrastructure.
Much of the information on our site is updated on a real time basis and is proprietary or is licensed to eBay by our users or third parties. You agree that you will not copy, reproduce, alter, modify, create derivative works, or publicly display any content (except for Your Information) from our Web site without the prior expressed written permission of eBay or the appropriate third party.
Now why they do this is obvious, they have an absolute goldmine of information and they want to be able to take advantage of it when they're good and ready. I assume other sites could adopt this type of eula, which wouldn't make the software itself illegal, but would make using it so (or at least until someone challenges it).
Actually, this is a field that is quickly being considered a new Turing test for the computer vision field. It is actually very easy to make pictures that humans can read and that machines currently can't. Look up more info on it here.
If it's for-profit but free, you're not the customer -- you're the product (e.g., the Slashdot Beta's "audience").
Comment removed based on user account deletion
If I buy a copy of The Hobbit, rip out every 5th page and then read it, have I created a derivative work and broken a law?
If I don't distribute it, can't I do whatever I want with the content?
If I was to then repost this on the web, yes...I could see where that would be a problem, but not what I do for myself.
the spirit of copyright laws are restricting COPYING
The problem here is that a U.S. court decision interpreted a copy in RAM as a "copy" for purposes of copyright law. Thus, when the kernel receives a packet, it COPIES the packet from the network card to the browser's memory, and then the browser COPIES and ADAPTS the HTML into a document tree, COPIES and ADAPTS the document tree into an offscreen bitmap, and COPIES the offscreen bitmap into your video card's RAM.
And if you're arguing fair use, as I said, you better have the money to pay an attorney to back it up.
Will I retire or break 10K?
I do feel pissed off every time we catch someone stealing our content and using it in their own tools. Copyright notices and T&C's are all well and good but they do NOTHING to stop someone from trawling your site.
As an owner and publisher I *can* say how my content is to be used because that's the licence I grant, it's MY choice. If I wanted it to be freely copied and used in any way then I would release it into the public domain...and it will be a cold day in hell when that happens.
The information (in our case TV listings) is costly to collect. I guess the spongers don't realise that or they just don't give a fuck.
I've found the solution is to a) implement technology to try to prevent it, and b) complain directly to their ISPs.
Both of the above solutions work but are themselves costly in terms of the technology and the time taken. These are two things we'd rather not spend our time and money on, and they distract us from creating great software.
At the end of the day if everyone trawled web sites for content then there would be no web sites supplying the content. The people trawling often request thousands or tens of thousands of pages in a very short space of time. The costs in terms of bandwidth and slow service to legitimate customers soon add up.
Our downloadable software TV guide (DigiGuide) did in the past have unencrypted data files. We didn't honestly expect someone to take our content and build a (possibly competing ) product around our data but they did. The data is now encrypted and should someone crack the encryption then we just change it and their hard work is wasted.
I feel sorry for web sites like TVGuide.com because they probably think they have some very loyal users that spend a lot of time on their site and read a lot of pages...instead they just have people sucking their content and paying them nothing for it. Ignorance is probably bliss for them.
Same goes for the deep-link fanatics. Create a 0px wide frame (basically invisible) the encompases the entire browser window content area and then load pages in there, on server side checking the HTTP_REFERER and on the client side, using JavaScript to ensure the documents are loaded inside the proper frame (which could have a static name or one that is dynamically allocated to each session, even). Make it run over SSL so no one can "steal" those URLs "in transit".
Is it really just easier to sue everyone than to pay a grungy guy in a t-shirt like me to set up your server to do this?
Ahh, I get it, it's the return you make on the "investment" in your lawyer.
CAn'T CompreHend SARcaSm?
Actually there is a much simpler way to defeat please enter the word on the image web sites, and one that actually raises a real issue. Those image tricks are discriminating horribly against the blind, the old and those with eye problems in general, as well in some cases dyslexics
Well, if screen-scraping is illegal (and in some forms, it certainly is), then somebody should sue the people who sell programs that harvest e-mail addresses from web sites.
Under the current state of US law, unauthorized access to a computer system is a federal crime. (I can't speak to EU laws, but I suspect parallels exist.) If Company X says, "You must use Internet Explorer 5.5 to access this site," then you must use IE 5.5. Of course, it would be just plain stupid to do so, but it's their computer system, and they get to decide who is authorized.
To judge from most of the comments here, the fact that it is incredibly stupid to impose such restrictions has obscured what is actually a legally unambiguous situation. Just because it's dumb doesn't mean it's not legal.
That an http server is nominally "public" doesn't mean diddly here. Any number of http servers provide for member- or employee-only access. The brick and mortar parallel would be those signs that say things like, "No shirt, no shoes, no service."
It is surprising that so few people have touched on the reason why companies might object to the distribution of Perl modules designed to harvest data from their sites: bandwidth costs and site performance. It doesn't take too many cron jobs banging on a site every minute -- and being ignored by their users most of the time -- to degrade site performance for "live" users and run up steep bandwidth bills.
Now, there is certainly no legal basis for Company X to demand that CPAN remove the modules, though it is hardly out of line to ask nicely. But there is firm legal grounds to prohibit anyone from actually using those modules.
Legal action is probably the wrong way to handle this, though. Having written fairly complicated web scrapers before, I know how easy it would be to make a site virtually impossible to harvest. Rather than make a big stink about the Perl programmers who contribute to CPAN, Company X would be well-advised to hire a good Perl programmer to thwart automated harvesters.
Proud member of the Weirdo-American community.