Slashdot Mirror


Websites Complaining About Screen-Scraping

wilko11 writes "There have been two cases recently where websites have requested the removal of modules from CPAN. These modules could be used to access the websites (EuroTV and Streetmap) from a PERL program. The question being asked on the mailinglists (threads about EuroTV and about Streetmap) is 'can companies dictate what software you can use to access web content from their server?'"

8 of 432 comments (clear)

  1. Sure they can! by stile · · Score: 5, Interesting

    If we piss them off enough by chopping off their advertisements and snipping out their content, they'll just write their sites in Flash, or as one big image file, or some other proprietary format. That'll pretty well dictate what software you use to view their site.

    1. Re:Sure they can! by SoCalChris · · Score: 4, Interesting

      You have good points, but try explaining that to a very non-technical executive who is afraid that everyone is out to steal their content. I've seen many companies that will do their entire website in Flash just so the content can't be "stolen".

      Personally, I refuse to install the Flash plugin, so if I come to one of these pages looking to do business, oh well. I'll just go somewhere else. The higher up people in companies that make all Flash sites don't seem to realize that Flash is annoying to a lot of people.

    2. Re:Sure they can! by CaseyB · · Score: 4, Interesting
      If human eyes can read it, someone can write software to parse it.

      Uh huh.

      Good luck, buddy.

  2. Comment removed by account_deleted · · Score: 4, Interesting

    Comment removed based on user account deletion

  3. Don't they already??? by tacocat · · Score: 5, Interesting

    I am constantly greeted with messages to the tone of:

    You must have Windows Internet Explorer 4 or higher installed on your system to view this website

    How is this any different from what they are attempting to do here?

    I hate to disappoint, but I don't think that this is a new precedent. What is a new precedent is the notion that they can request the removal, or to make unavailable, software that is otherwise available

    The precedent here is not the software usage to access a website, but the notion that this can be extended to:

    Dear Mozilla.org,

    It has come to our attention that people are using your software to access our website. We don't like this are sending our legal team over to discuss the removal of your software application from the internet.

    Similarly, we are contacting Netscape, AOL, Opera, Konqueror, et al and removing them as well.

    Have a nice day!

  4. The future of the web by KjetilK · · Score: 4, Interesting
    The web was never intended to be a browser-only environment. From the start, it was intended to be a medium that would be useful for a wide varity of user agents, crawling for info and presenting compiled and digested information to the user.

    This was not ever realized, I believed mostly because of overpaid "web designers".

    But the Semantic Web would require many funny user agents for all kinds of things.

    Clearly, if this kind of thinking is allowed to persist in corporate headquarters, it will kill the Semantic Web before it gets started.

    I wonder what Tim Berners-Lee thinks about this...

    --
    Employee of Inrupt, Project Release Manager and Community Manager for Solid
  5. Content is important by binaryDigit · · Score: 4, Interesting

    One of the biggest sites that I've not seen anyone mention is eBay. Following is in their eula:

    Our Web site contains robot exclusion headers and you agree that you will not use any robot, spider, other automatic device, or manual process to monitor or copy our Web pages or the content contained herein without our prior expressed written permission.

    You agree that you will not use any device, software or routine to bypass our robot exclusion headers, or to interfere or attempt to interfere with the proper working of the eBay site or any activities conducted on our site.

    You agree that you will not take any action that imposes an unreasonable or disproportionately large load on our infrastructure.

    Much of the information on our site is updated on a real time basis and is proprietary or is licensed to eBay by our users or third parties. You agree that you will not copy, reproduce, alter, modify, create derivative works, or publicly display any content (except for Your Information) from our Web site without the prior expressed written permission of eBay or the appropriate third party.


    Now why they do this is obvious, they have an absolute goldmine of information and they want to be able to take advantage of it when they're good and ready. I assume other sites could adopt this type of eula, which wouldn't make the software itself illegal, but would make using it so (or at least until someone challenges it).

  6. Captchas by Valdrax · · Score: 4, Interesting

    Actually, this is a field that is quickly being considered a new Turing test for the computer vision field. It is actually very easy to make pictures that humans can read and that machines currently can't. Look up more info on it here.

    --
    If it's for-profit but free, you're not the customer -- you're the product (e.g., the Slashdot Beta's "audience").