Websites Complaining About Screen-Scraping

In short, no. by numbski · 2003-02-07 08:20 · Score: 5, Insightful

If you don't want your content being redisplayed on another site, place appropriate copyright and seek protections therein.

Don't stifle the technology. Treat the cause, not the symptom.

--

Karma: Chameleon (mostly due to the fact that you come and go).

Sure they can! by stile · 2003-02-07 08:21 · Score: 5, Interesting

If we piss them off enough by chopping off their advertisements and snipping out their content, they'll just write their sites in Flash, or as one big image file, or some other proprietary format. That'll pretty well dictate what software you use to view their site.

Re:Sure they can! by interiot · 2003-02-07 08:25 · Score: 4, Insightful

No they won't. The main goal of HTML wasn't so everything would be open and "stealable", the goal was to have content that could be viewed on a variety of platforms. You can't get that with flash or huge images, and in fact, for some of the more interesting devices (eg. cell phones, PDAs), it's explicitely required that the machine be able to understand the content to some extent so that it can transform it to something that better suits the particular device.
Re:Sure they can! by superdan2k · 2003-02-07 08:28 · Score: 4, Insightful

Yeah, and then they'll lose traffic and die because no one will bother wasting the time on their site.

What a lot of companies fail to realize is that the Social Contract (philosophy, not law) applies as much to the relationship between client and customer as it does between Joe and Jim Average. Play by the rules and be part of society, or doom yourself...that's basically it. No man is an island. No company is an island...well, maybe Microsoft, but that's it.

--
blog |
Re:Sure they can! by TheJesusCandle · 2003-02-07 08:29 · Score: 4, Insightful

Thats what I tell my clients who try to "encrypt" things in this silly manner. I've written packages that defeat those silly "enter the word contained in the image" tests, I've written packages that defeat silly anti-automation scripts.

It's really not hard.

Sure, theres always the 2% that can get around any barier you put up. Stopping the 98% is usually good enough to justify the extra effort of developing these measures.

You shouldnt complain too much about what your customers want, theyre paying you for your time right? Give 'em what they want.
Re:Sure they can! by SoCalChris · 2003-02-07 08:30 · Score: 4, Interesting

You have good points, but try explaining that to a very non-technical executive who is afraid that everyone is out to steal their content. I've seen many companies that will do their entire website in Flash just so the content can't be "stolen".

Personally, I refuse to install the Flash plugin, so if I come to one of these pages looking to do business, oh well. I'll just go somewhere else. The higher up people in companies that make all Flash sites don't seem to realize that Flash is annoying to a lot of people.
Re:Sure they can! by CaseyB · 2003-02-07 09:03 · Score: 4, Interesting

If human eyes can read it, someone can write software to parse it.
Uh huh.
Good luck, buddy.

Comment removed by account_deleted · 2003-02-07 08:24 · Score: 4, Interesting

Comment removed based on user account deletion

Re-read the article... by numbski · 2003-02-07 08:24 · Score: 5, Insightful

So far as apps are concerned, again no.

There's no law stating that we have to look at ads. Although I see the problem paying the bills, a flaw in a business model is not the problem of the application coder (namely: me, you, and most people reading this site).

--

Karma: Chameleon (mostly due to the fact that you come and go).

Don't they already??? by tacocat · 2003-02-07 08:29 · Score: 5, Interesting

I am constantly greeted with messages to the tone of:

You must have Windows Internet Explorer 4 or higher installed on your system to view this website

How is this any different from what they are attempting to do here?

I hate to disappoint, but I don't think that this is a new precedent. What is a new precedent is the notion that they can request the removal, or to make unavailable, software that is otherwise available

The precedent here is not the software usage to access a website, but the notion that this can be extended to:

Dear Mozilla.org,

It has come to our attention that people are using your software to access our website. We don't like this are sending our legal team over to discuss the removal of your software application from the internet.

Similarly, we are contacting Netscape, AOL, Opera, Konqueror, et al and removing them as well.

Have a nice day!

If you don't want window shoppers... by Eese · 2003-02-07 08:30 · Score: 5, Insightful

... don't put merchandise in the windows.

Just like you can listen to unencrypted radio broadcasts through the airwaves as much as you want, or stand next to a group of people talking and listen in, you can view web pages that are served openly over the Internet.

If you are going to be presenting something for people to observe, they can observe it however they like. Legislate all you want, but this is a fundamental component of logical (as opposed to legal) privacy.

Why not? by JazzyJ · 2003-02-07 08:30 · Score: 5, Insightful

There are a multitude of methods for providing different content based on what the client browser returns on certain environment variables. While I think it's silly to demand that modules be removed from CPAN, it's entirely up to the people running the server to determine who they want to serve content to....and who they dont.

If they can't figure out how to do it serverside (or with clientside scripting) then that's their problem.

That's the bitch about open standards....EVERYONE can use them.... :)

Learn from Google by shiflett · 2003-02-07 08:33 · Score: 4, Insightful

They should do as many of us do and learn a lesson from Google.

It is a violation of Google's terms of use for you to "screen scrape" search results. You can implement their API using a free key and achieve similar results, however.

Not only are these companies approaching the "problem" from the wrong angle in terms of common sense, they are also taking the most difficult approach. It is practically impossible to seek to outlaw software that fetches Web content, because Web browsers and wget (for example) are the same thing, HTTP clients. The HTTP protocol is an open standard that anyone can implement. If you don't want a valid HTTP client accessing your server, don't make your server an HTTP server.

Stated another way, don't try to take an open standard and restrict everyone else's use of it to suit your own needs. You don't see me (an avid soccer player) trying to get the NBA to change the rules of their game to require use of the feet for ball control. If I want to play basketball, I have to play by the rules, else I am not really playing basketball.

HTTP GET is an authorization by bwt · 2003-02-07 08:34 · Score: 5, Insightful

This is just another example of gross technical incompetence by executives and lawyers.

A company that attaches an HTTP server receives an HTTP GET request complete with some information in its headers. They have a reasonable case to request that that information be accurate. They have unilateral technical ability to firewall IP's or whole subnets. Otherwise, once they receive a GET request, when the machine that they have configured responds by sending a file, they have granted explicit permission to process that file consistent with the info in the GET request.

The owner of the server is completely in control at a technical level. If they don't like what you are doing, they can firewall you. Absent a contractual agreement not to, you have the permission to send ***REQUESTS*** for anything you would like to request. They can say no. If you lie in your request, then they have a case to say your use is unauthorized, but short of that, there should be no need to have the judicial system rewrite the technology.

Dangerous Precedent by EnglishTim · 2003-02-07 08:38 · Score: 4, Insightful

I find it sad that so many people seem to think it is just fine to mine their site for data. Sure, there's not all that much that they can do about it, except remove the data or make it harder for regular users of the site to use it.

For example, The EuroTV site seems to work on the concept that they provide the information for free for users of their site, but you can pay them to get it on your site. They're using their site as an advert for their services, while at the same time offering a useful service to the community. By making freely available a system to allow anybody to use their data in their own websites without paying them for it, you're completely ridding them of their reason for having the site up at all.

Yes, you can argue that they shouldn't put the information out there if they don't want people to use it, but then you're giving them a good reason not to put the information out there at all, which makes all of us poorer.

As for whether they can dictate that CPAN remove the modules, certainly it's fair enough of them to request that the module be removed, but it is a shame they leapt to threats of lawsuits quite so quickly.

The future of the web by KjetilK · 2003-02-07 08:42 · Score: 4, Interesting

The web was never intended to be a browser-only environment. From the start, it was intended to be a medium that would be useful for a wide varity of user agents, crawling for info and presenting compiled and digested information to the user.

This was not ever realized, I believed mostly because of overpaid "web designers".

But the Semantic Web would require many funny user agents for all kinds of things.

Clearly, if this kind of thinking is allowed to persist in corporate headquarters, it will kill the Semantic Web before it gets started.

I wonder what Tim Berners-Lee thinks about this...

--
Employee of Inrupt, Project Release Manager and Community Manager for Solid

Content is important by binaryDigit · 2003-02-07 08:43 · Score: 4, Interesting

One of the biggest sites that I've not seen anyone mention is eBay. Following is in their eula:

Our Web site contains robot exclusion headers and you agree that you will not use any robot, spider, other automatic device, or manual process to monitor or copy our Web pages or the content contained herein without our prior expressed written permission.

You agree that you will not use any device, software or routine to bypass our robot exclusion headers, or to interfere or attempt to interfere with the proper working of the eBay site or any activities conducted on our site.

You agree that you will not take any action that imposes an unreasonable or disproportionately large load on our infrastructure.

Much of the information on our site is updated on a real time basis and is proprietary or is licensed to eBay by our users or third parties. You agree that you will not copy, reproduce, alter, modify, create derivative works, or publicly display any content (except for Your Information) from our Web site without the prior expressed written permission of eBay or the appropriate third party.

Now why they do this is obvious, they have an absolute goldmine of information and they want to be able to take advantage of it when they're good and ready. I assume other sites could adopt this type of eula, which wouldn't make the software itself illegal, but would make using it so (or at least until someone challenges it).

Re:Content is important by anaradad · 2003-02-07 08:49 · Score: 5, Insightful

The eBay EULA only applies if you actually register for their service. If you have never signed up for eBay, you have never signed off on their EULA.

paging Jack Valenti by sydlexic · 2003-02-07 08:44 · Score: 5, Funny

didn't you read the terms of service agreement you were handed at birth (us citizens only) that states any bypassing of ads during receipt of content is theft?

I'm just waiting for ashcroft's goons to knock on my door, find the tivo and haul my ass off to jail.

Re:paging Jack Valenti by merlyn · 2003-02-07 08:57 · Score: 4, Funny

"Click Here To Accept Your Life's Conditions: [Agree] [Disagree]"
{grin}
--
- Randal L. Schwartz, Just another Perl hacker for Stonehenge
Re:paging Jack Valenti by trbogie · 2003-02-07 09:05 · Score: 5, Funny

I thought they were trying to modify that to say that "Having left the womb, you have, by default, accepted the agreements to all life's conditions."
Re:paging Jack Valenti by jon+doh! · 2003-02-07 10:25 · Score: 4, Funny

a correction of the correction

"Click Here To Accept Your Life's Conditions: [Agree] [Disagree]"

(it's greyed out, like the microsoft patch i applied that said "you need to reboot your computer for the changes to take effect" and had two buttons, one to reboot now, one to reboot later. the reboot later was greyed out...)

--
Free Webmail

Back in the day... by TheTick · 2003-02-07 08:44 · Score: 5, Insightful

Remember when the web -- no, remember when the net was about sharing information? I miss that time. If somebody wrote a cool front end to your service, it was COOL and more power to them. If it made your service (site, whatever) more accessible, that mean more people were looking at your stuff, and that was COOL.

Now we have entities that threaten legal action for accessing the stuff they've made publically available. There may actually be a case when the software scrapes and repackages the content (or, more importantly, redistributes it), but I hope the stuff about decoding the URL for easy use is bogus. I have my doubts that a court will see it my way, but still I hope for reason. Nevertheless, the whole idea makes me sad and nostalgic.

Another thought: is my mozilla vulnerable to this sort of action because it blocks ads -- essentially repackaging the server output for display to me? Now I'm really depressed.

--

--
bachiatari na torisetsu o yome!

What falls out the back end of a bull? by Wonko42 · 2003-02-07 08:44 · Score: 4, Funny

"I've written packages that defeat those silly "enter the word contained in the image" tests..."

Ahem. Bullshit.

What's the problem here? by hmccabe · 2003-02-07 08:45 · Score: 5, Insightful

I think this is something we're going to start seeing a lot of in coming years. Right now, the Internet in general is going through growing pains, and the pressure is starting to show in these "free services" type sites ( i.e. Mapquest )

I don't know about these site in particular, but many of the big sites around today were built with the failed dot-com business model of delivering free content and selling advertising that ran on the page (or popped up behind it.) This, of course, is dependant on people viewing the site in a browser. If people get the information without using a browser, therefore never seeing the ads, the advertisers won't want to spend any money on the site.

Another problem is, most companies don't want to take the risks associated with innovation, so instead they seek legal action to maintain the good thing they have going. While this is a quick fix, and in the company's best interests, we need companies to present a new business model to the public and see how it gets adopted. I would pay an annual subscription fee for things like Mapquest.com, tvguide.com and maybe even /. I believe others would as well.

Porn sites, Ebay auctions, games such as Everquest and services such as Apple's dot-mac are online services that subscribers happily pay for because more than anything, they are quality products(well, some of the porn is). If the company's revenue is coming from its users, they would be a lot less concerned about how the information is being distributed.

This isn't such a radical change, as they could add a premium subscription service, and slowly transition the focus of their business towards it. Wouldn't it be cool if I could write my own mapping application ( or download a pre-made one from the site ) and have it connect to xml.mapquest.com, give my username and password, and retrieve the data I requested.

Turing test? by siskbc · 2003-02-07 08:50 · Score: 4, Insightful

So far, I was under the impression no one had won the Turing contest yet. You are beating their trivial problems, but they're finally waking up and shifting the "online human test" to things that people haven't figured out how to code. I'd link to the article if I could remember where I saw it...

Hell, the simplest would be an easy reading comprehension or logic test with a short-answer blank - the computer would never get it, and all humans would.

My guess is that soon, people who REALLY want you out will keep you out.

--

-Looking for a job as a materials chemist or multivariat

Re:Maybe they can't but... by pla · 2003-02-07 08:58 · Score: 4, Insightful

but they can dictate whether you get the content or not

Yes, they can. They have the option of not putting it on a public webserver in the first place. Beyond that, they have no control over who sees it and how. They can use various technological measures to try to control access, but short of forcing some form of user authentication via a secure proprietary client, the ad-blockers and scrapers *WILL* win.

If they are getting no ad impressions, then they are getting no money.

This statement seems a common way of viewing these issues (Ad blocking, scraping, whatever). However, realize that they don't have a "right" to make money just because they offer otherwise-free content online. They offer that content in the *HOPE* of making money, but that comes with no guarantees. And yes, I go to the kitchen during commercials, or change the station, or fast-forward.

I see the problem as involving how offensive these sites make the ads. I find Flash and Shockwave ads so offensive (and, I find that they often crash my browser - the huge offensive Flash ad currently on the Onion, for example, crashes my browser every time) that I simply browse with them disabled. Pop(up/under) ads bother me enough that I have the "dom.disable_open_during_load" preference set to completely block them. In comparison, the small, unintrusive text ad in the upper left of K5's front page doesn't bother me at all, and I've even *clicked* on it a few times.

Companies (not just advertisers, but those who serve such ads) need to realize that more annoying ads do make an impression - a strongly negative one. If I want their products, *I'll* seek *them* out. If they detract from my web browsing experience, I will specifically make a point of seeking out their *competitors* if I need something they offer.

In case any marketing folks read this, I'll mention the last ad I *DID* watch - The one with the hamster and rabbit from Blockbuster. Why? Because I found the ad sufficiently amusing to watch, on its own merits. Important point there. It didn't annoy me, and it had value all by itself. *THAT* makes a positive impression on a potential customer. I don't even know what the hampster and rabbit talked about, but it doesn't matter, I remember that "Blockbuster amused me for 30 seconds". Making me waste a few minutes to figure out how to filter out your crap does *not* make a good impression. I will remember "X10 pissed me off for 30 seconds, let's visit Logitech's cam offerings instead".

Derivative work by yerricde · 2003-02-07 08:59 · Score: 5, Informative

There's no law stating that we have to look at ads.

What about 17 USC 106, which states that barring fair use, etc., the copyright owner has the right to prevent others from creating derivative works of a web page?

--
Will I retire or break 10K?

Re:Derivative work by Natalie's+Hot+Grits · 2003-02-07 09:36 · Score: 5, Informative

Yes, barring fair use, which explicitly allows you to do this unless you re-distribute the work. Which you aren't.

Short answer is that you can modify any work under fair use for your OWN PERSONAL USE and not for someone else. If your web browser cuts out ads, then that is legal, and no US Code that is currently existance disallows these modifications.

Aside from this point, there is still the legal rammifications that there is no US Law which states it is illegal to build, distribute, or use tools that can modify copyrighted works (unless the work is encrypted and covered under the DMCA)

If an ISP started doing this at his firewall, and then re-distributing the web site to your computer after you request it, then this might be illegal. They might be able to argue that one party is getting the work, modifying it, and redistributing it, which is certaintly not covered under the Fair Use Doctrine.

OTOH, if the ISP has a fair use reason to do this (such as reformatting the text to work on a text only terminal), then this may also be legal.

What it all boils down to is that the spirit of copyright laws are restricting COPYING and REDISTRIBUTING, not how a person uses those works. This has been true untill 1998 when the DMCA was enacted, and even now is still true for all copyrighted works that are not covered under the DMCA's encryption clauses. To this day, I have yet to find a website that is encrypted for purposes of the DMCA protection. Untill this changes, they won't have any legal legs to stand on.

--
Two infinite things: your stupidity and mine. But I'm not sure about the latter. If my sig offends you, I'm sorry.
Re:Derivative work by bwt · 2003-02-07 11:01 · Score: 4, Insightful

The author does not create the "web page", that is the job of the user agent. The author offers up raw HTML source code and YOU render it. Your argument proves too much -- it proves that all rendering of HTML in a browswer is copyright infringement because it creates a derivitive work of his source code. Indeed, it DOES create a derivitive work, just one that is **authorized**.

The author creates various files such as HTML text files, pictures, pdf's etc. By using HTML, he has authorized the user agent to render consistent with the HTML standard and his HTML code. Thus, he has explicitly authorized certain limited types of derivitive works to be made from his source code by using HTML. The HTML standard does not require images to be rendered, and since it was the author's choice to use HTML, no violation of copyright law occurs when HTML is rendered in a manner consistent with the HTML spec.

Had he wanted to mandate the exact representation, he could have used an image format or a PDF. It's his choice, but he must live with it and all that follows from it.

Of course, there is nothing wrong with not rendering the HTML at all and just looking at it as source code. Nor is there any cause of action under copyright law if you extract unprotectable facts and ideas from either the source code or the rendered version.

Don't like it? Don't put stuff on the web! by Maul · 2003-02-07 09:23 · Score: 4, Insightful

If you put something on the web, you have to assume that people are going to access that information in any way that they possibly can.

I suppose the big complaint is that people might not be viewing the "ads" on pages if they use certain HTTP clients.

I have a suggestion for the sites that are complaining. If you don't like it, don't put stuff on the web. Write your own custom client-server solution if you don't want people accessing it with certain browsers or other software.

If you are depending on ad banners for your revenues, you and advertisers are taking a "risk" that people might not see the ads, or that they might not buy advertised products. Tough luck if you lose out on your bet. Hopefully you have a solid way of making money related to whatever service you are providing to make up for it.

Whining about lost ad revenue and such is the same as whining about losing money in Las Vegas. You should have assessed the risks before playing the game.

--

"You spoony bard!" -Tellah

Captchas by Valdrax · 2003-02-07 09:32 · Score: 4, Interesting

Actually, this is a field that is quickly being considered a new Turing test for the computer vision field. It is actually very easy to make pictures that humans can read and that machines currently can't. Look up more info on it here.

--
If it's for-profit but free, you're not the customer -- you're the product (e.g., the Slashdot Beta's "audience").

Legal? Probably. Rude? Maybe... by Rob+Parkhill · 2003-02-07 09:40 · Score: 4, Insightful

EuroTV has a robots.txt file that asks to leave the various /scripts directories alone. If this Perl module is just ignoring that robots.txt file, then that is just rude, although I don't see how it is illegal.

Streetmap doesn't even have a robots.txt file, so I don't see why they are whining about it.

Although I can see why these websites could get upset. The TV-listing screen scrapers are especially bad at hammering a site relentlessly for a sustained period of time to obtain all of the programming information for a certian broadcast area. The scraper has to hit the site repeatedly to obtain all of the information, since it isn't all displayed on a single page. If any one of these scrapers gets to be really popular, it could kill the site.

Of course, the solution to that is to make all of the listing available as one big chunk to avoid repeated requests. But then the site goes out of business in a few weeks due to lack of advertising revenue.

I, for one, wish I could buy a subscription to zap2it.com that would give me fast, easy access to the channel listings in, say, XMLTV format. Is $25/year a reasonable fee, considering that I would only hit the site once a day at the most, and grab a single file?

--
"Tomorrow's forecast: a few sprinkles of genius with a chance of doom!" - Stewie Griffin

screen scraping software is completely legal by frovingslosh · 2003-02-07 10:00 · Score: 4, Insightful

Some /. readers seem to be missing this, but this is not a debate on if it's right to take someone's content and post it elsewhere. (To me it's clearly not without their permission, but that's not the issue here at all so lets not even pretend that it is by debating it.) The issue is "is it legal / proper/ ligitimate to write software that is capable of looking at the output of a website, by any means - including examining the HTML returned or by capturing the computer screen itself and analizing that? Of course it is. Such software in no way pirates a website owner's content, it just gives me additional tools for keeping current with the content of those pages. There are plenty of legitimate uses (the Streetmap reference was perfectly on target for this, just to give one). That someone might abuse such a tool and pirate content is hardly the issue, if it were every C compiler would also be at fault. People need to stand up against cranks like btek's Kate Sutton who think they can bully everyone else in the world. Simon Batistoni should have never even tried to be reasonable with her, and he should make his tool available again and sue her and her company for the slander she has done to him in the main perl5 bug queue.

Even if he had provided a tool to make a copy of a map, which he did not, there is nothing at all wrong with making and supplying others with that tool. It's how the tool is used that is the issue, and a tool that has legitimate useful uses can never be allowed to be the target of such a complaint or suit.

--
I'm an American. I love this country and the freedoms that we used to have.

Slashdot Mirror

Websites Complaining About Screen-Scraping

34 of 432 comments (clear)