Slashdot Mirror


Websites Complaining About Screen-Scraping

wilko11 writes "There have been two cases recently where websites have requested the removal of modules from CPAN. These modules could be used to access the websites (EuroTV and Streetmap) from a PERL program. The question being asked on the mailinglists (threads about EuroTV and about Streetmap) is 'can companies dictate what software you can use to access web content from their server?'"

18 of 432 comments (clear)

  1. Sure they can! by stile · · Score: 5, Interesting

    If we piss them off enough by chopping off their advertisements and snipping out their content, they'll just write their sites in Flash, or as one big image file, or some other proprietary format. That'll pretty well dictate what software you use to view their site.

    1. Re:Sure they can! by SoCalChris · · Score: 4, Interesting

      You have good points, but try explaining that to a very non-technical executive who is afraid that everyone is out to steal their content. I've seen many companies that will do their entire website in Flash just so the content can't be "stolen".

      Personally, I refuse to install the Flash plugin, so if I come to one of these pages looking to do business, oh well. I'll just go somewhere else. The higher up people in companies that make all Flash sites don't seem to realize that Flash is annoying to a lot of people.

    2. Re:Sure they can! by CaseyB · · Score: 4, Interesting
      If human eyes can read it, someone can write software to parse it.

      Uh huh.

      Good luck, buddy.

  2. Comment removed by account_deleted · · Score: 4, Interesting

    Comment removed based on user account deletion

  3. TerraServer by Corrupt+System · · Score: 3, Interesting

    I can understand how site owners could have a problem with a commercial software product like ExpertGPS wasting their bandwidth while skipping ads. ExpertGPS costs $59.95, but downloads maps from Microsoft's TerraServer without going through its web interface and viewing its advertising. Microsoft hasn't blocked access from these programs yet, but what if they do? All the paying users of ExpertGPS would be out of this functionality.

    --
    The solution that has worked best for me...is to avoid public discussion. -- CmdrTaco
    1. Re:TerraServer by topografix · · Score: 2, Interesting

      TerraServer explicitly allows access to their USGS map database from programs like ExpertGPS. They even have a webpage with step-by-step instructions on how to do it.

      ExpertGPS could just as easily grab its maps from sites like TopoZone and deprive them of ad revenue. Other programs have actually done that, and caused the nice guys at TopoZone a lot of hassle and lost revenue. The guys at Geocaching.com spend lots of time dealing with database scrapers who mine the site continually, chewing up bandwidth.

      The moral of the story - play nicely. If a website like TerraServer is generous enough to offer you a way to scrape their data, say thank you. If a website asks that you refrain from using automated scripts, either work out a licensing agreement with them, or start your own website and learn how it feels to be on the other end of the scraper.

  4. Don't they already??? by tacocat · · Score: 5, Interesting

    I am constantly greeted with messages to the tone of:

    You must have Windows Internet Explorer 4 or higher installed on your system to view this website

    How is this any different from what they are attempting to do here?

    I hate to disappoint, but I don't think that this is a new precedent. What is a new precedent is the notion that they can request the removal, or to make unavailable, software that is otherwise available

    The precedent here is not the software usage to access a website, but the notion that this can be extended to:

    Dear Mozilla.org,

    It has come to our attention that people are using your software to access our website. We don't like this are sending our legal team over to discuss the removal of your software application from the internet.

    Similarly, we are contacting Netscape, AOL, Opera, Konqueror, et al and removing them as well.

    Have a nice day!

  5. The future of the web by KjetilK · · Score: 4, Interesting
    The web was never intended to be a browser-only environment. From the start, it was intended to be a medium that would be useful for a wide varity of user agents, crawling for info and presenting compiled and digested information to the user.

    This was not ever realized, I believed mostly because of overpaid "web designers".

    But the Semantic Web would require many funny user agents for all kinds of things.

    Clearly, if this kind of thinking is allowed to persist in corporate headquarters, it will kill the Semantic Web before it gets started.

    I wonder what Tim Berners-Lee thinks about this...

    --
    Employee of Inrupt, Project Release Manager and Community Manager for Solid
  6. Content is important by binaryDigit · · Score: 4, Interesting

    One of the biggest sites that I've not seen anyone mention is eBay. Following is in their eula:

    Our Web site contains robot exclusion headers and you agree that you will not use any robot, spider, other automatic device, or manual process to monitor or copy our Web pages or the content contained herein without our prior expressed written permission.

    You agree that you will not use any device, software or routine to bypass our robot exclusion headers, or to interfere or attempt to interfere with the proper working of the eBay site or any activities conducted on our site.

    You agree that you will not take any action that imposes an unreasonable or disproportionately large load on our infrastructure.

    Much of the information on our site is updated on a real time basis and is proprietary or is licensed to eBay by our users or third parties. You agree that you will not copy, reproduce, alter, modify, create derivative works, or publicly display any content (except for Your Information) from our Web site without the prior expressed written permission of eBay or the appropriate third party.


    Now why they do this is obvious, they have an absolute goldmine of information and they want to be able to take advantage of it when they're good and ready. I assume other sites could adopt this type of eula, which wouldn't make the software itself illegal, but would make using it so (or at least until someone challenges it).

  7. Captchas by Valdrax · · Score: 4, Interesting

    Actually, this is a field that is quickly being considered a new Turing test for the computer vision field. It is actually very easy to make pictures that humans can read and that machines currently can't. Look up more info on it here.

    --
    If it's for-profit but free, you're not the customer -- you're the product (e.g., the Slashdot Beta's "audience").
  8. Comment removed by account_deleted · · Score: 3, Interesting

    Comment removed based on user account deletion

  9. Re:Derivative work by Sabalon · · Score: 3, Interesting

    If I buy a copy of The Hobbit, rip out every 5th page and then read it, have I created a derivative work and broken a law?

    If I don't distribute it, can't I do whatever I want with the content?

    If I was to then repost this on the web, yes...I could see where that would be a problem, but not what I do for myself.

  10. COPYING in copyright law includes volatile RAM by yerricde · · Score: 2, Interesting

    the spirit of copyright laws are restricting COPYING

    The problem here is that a U.S. court decision interpreted a copy in RAM as a "copy" for purposes of copyright law. Thus, when the kernel receives a packet, it COPIES the packet from the network card to the browser's memory, and then the browser COPIES and ADAPTS the HTML into a document tree, COPIES and ADAPTS the document tree into an offscreen bitmap, and COPIES the offscreen bitmap into your video card's RAM.

    And if you're arguing fair use, as I said, you better have the money to pay an attorney to back it up.

    --
    Will I retire or break 10K?
  11. Speaking as someone that owns and a TV guide... by rusty+spoon · · Score: 2, Interesting

    I do feel pissed off every time we catch someone stealing our content and using it in their own tools. Copyright notices and T&C's are all well and good but they do NOTHING to stop someone from trawling your site.

    As an owner and publisher I *can* say how my content is to be used because that's the licence I grant, it's MY choice. If I wanted it to be freely copied and used in any way then I would release it into the public domain...and it will be a cold day in hell when that happens.

    The information (in our case TV listings) is costly to collect. I guess the spongers don't realise that or they just don't give a fuck.

    I've found the solution is to a) implement technology to try to prevent it, and b) complain directly to their ISPs.

    Both of the above solutions work but are themselves costly in terms of the technology and the time taken. These are two things we'd rather not spend our time and money on, and they distract us from creating great software.

    At the end of the day if everyone trawled web sites for content then there would be no web sites supplying the content. The people trawling often request thousands or tens of thousands of pages in a very short space of time. The costs in terms of bandwidth and slow service to legitimate customers soon add up.

    Our downloadable software TV guide (DigiGuide) did in the past have unencrypted data files. We didn't honestly expect someone to take our content and build a (possibly competing ) product around our data but they did. The data is now encrypted and should someone crack the encryption then we just change it and their hard work is wasted.

    I feel sorry for web sites like TVGuide.com because they probably think they have some very loyal users that spend a lot of time on their site and read a lot of pages...instead they just have people sucking their content and paying them nothing for it. Ignorance is probably bliss for them.

  12. Re:Why not? by Tokerat · · Score: 2, Interesting


    Same goes for the deep-link fanatics. Create a 0px wide frame (basically invisible) the encompases the entire browser window content area and then load pages in there, on server side checking the HTTP_REFERER and on the client side, using JavaScript to ensure the documents are loaded inside the proper frame (which could have a static name or one that is dynamically allocated to each session, even). Make it run over SSL so no one can "steal" those URLs "in transit".

    Is it really just easier to sue everyone than to pay a grungy guy in a t-shirt like me to set up your server to do this?

    Ahh, I get it, it's the return you make on the "investment" in your lawyer.

    --
    CAn'T CompreHend SARcaSm?
  13. Re:What falls out the back end of a bull? by Alan+Cox · · Score: 2, Interesting

    Actually there is a much simpler way to defeat please enter the word on the image web sites, and one that actually raises a real issue. Those image tricks are discriminating horribly against the blind, the old and those with eye problems in general, as well in some cases dyslexics

  14. I say turn it around... by ZoneGray · · Score: 2, Interesting

    Well, if screen-scraping is illegal (and in some forms, it certainly is), then somebody should sue the people who sell programs that harvest e-mail addresses from web sites.

  15. Stupid, but true by Angst+Badger · · Score: 2, Interesting

    Under the current state of US law, unauthorized access to a computer system is a federal crime. (I can't speak to EU laws, but I suspect parallels exist.) If Company X says, "You must use Internet Explorer 5.5 to access this site," then you must use IE 5.5. Of course, it would be just plain stupid to do so, but it's their computer system, and they get to decide who is authorized.

    To judge from most of the comments here, the fact that it is incredibly stupid to impose such restrictions has obscured what is actually a legally unambiguous situation. Just because it's dumb doesn't mean it's not legal.

    That an http server is nominally "public" doesn't mean diddly here. Any number of http servers provide for member- or employee-only access. The brick and mortar parallel would be those signs that say things like, "No shirt, no shoes, no service."

    It is surprising that so few people have touched on the reason why companies might object to the distribution of Perl modules designed to harvest data from their sites: bandwidth costs and site performance. It doesn't take too many cron jobs banging on a site every minute -- and being ignored by their users most of the time -- to degrade site performance for "live" users and run up steep bandwidth bills.

    Now, there is certainly no legal basis for Company X to demand that CPAN remove the modules, though it is hardly out of line to ask nicely. But there is firm legal grounds to prohibit anyone from actually using those modules.

    Legal action is probably the wrong way to handle this, though. Having written fairly complicated web scrapers before, I know how easy it would be to make a site virtually impossible to harvest. Rather than make a big stink about the Perl programmers who contribute to CPAN, Company X would be well-advised to hire a good Perl programmer to thwart automated harvesters.

    --
    Proud member of the Weirdo-American community.