Websites Complaining About Screen-Scraping
wilko11 writes "There have been two cases recently where websites have requested the removal of modules from CPAN. These modules could be used to access the websites (EuroTV and Streetmap) from a PERL program. The question being asked on the mailinglists (threads about EuroTV and about Streetmap) is 'can companies dictate what software you can use to access web content from their server?'"
If you don't want your content being redisplayed on another site, place appropriate copyright and seek protections therein.
Don't stifle the technology. Treat the cause, not the symptom.
Karma: Chameleon (mostly due to the fact that you come and go).
If we piss them off enough by chopping off their advertisements and snipping out their content, they'll just write their sites in Flash, or as one big image file, or some other proprietary format. That'll pretty well dictate what software you use to view their site.
Comment removed based on user account deletion
So far as apps are concerned, again no.
There's no law stating that we have to look at ads. Although I see the problem paying the bills, a flaw in a business model is not the problem of the application coder (namely: me, you, and most people reading this site).
Karma: Chameleon (mostly due to the fact that you come and go).
Do you deal with word in the image tests without requiring the user to read the word? How?
--Pat
Maybe they can't dictate what you use to access their content, but they can dictate whether you get the content or not. Seriously: If they are getting no ad impressions, then they are getting no money. Poof, your not getting the service any more.
I don't know what the answer is, but seriously, abusing the intent of their services (which *IS* to generate ad revenue, after all) shall do little but get them to change or remove those services.
Oh, and kudos for them for not just up and suing CPAN (which they have little grounds for, but we all know that proof is worth less than a cat fart in the U.S. legal system).
Put another way: particularly on a subscription site, the site owners may specify whatever stupid terms and conditions that their subscribers are willing to submit to. That does not mean, though, that the client software is obligated to know whether or not the software itself meets the TOS (nor can I be made to believe that this is possible).
Dewey, what part of this looks like authorities should be involved?
I can understand how site owners could have a problem with a commercial software product like ExpertGPS wasting their bandwidth while skipping ads. ExpertGPS costs $59.95, but downloads maps from Microsoft's TerraServer without going through its web interface and viewing its advertising. Microsoft hasn't blocked access from these programs yet, but what if they do? All the paying users of ExpertGPS would be out of this functionality.
The solution that has worked best for me...is to avoid public discussion. -- CmdrTaco
I am constantly greeted with messages to the tone of:
How is this any different from what they are attempting to do here?
I hate to disappoint, but I don't think that this is a new precedent. What is a new precedent is the notion that they can request the removal, or to make unavailable, software that is otherwise available
The precedent here is not the software usage to access a website, but the notion that this can be extended to:
Is it just like MSN wants you to use MS IE only?
5 22 9&mode=nested&tid=109
http://slashdot.org/article.pl?sid=03/02/06/164
... don't put merchandise in the windows.
Just like you can listen to unencrypted radio broadcasts through the airwaves as much as you want, or stand next to a group of people talking and listen in, you can view web pages that are served openly over the Internet.
If you are going to be presenting something for people to observe, they can observe it however they like. Legislate all you want, but this is a fundamental component of logical (as opposed to legal) privacy.
There are a multitude of methods for providing different content based on what the client browser returns on certain environment variables. While I think it's silly to demand that modules be removed from CPAN, it's entirely up to the people running the server to determine who they want to serve content to....and who they dont.
:)
If they can't figure out how to do it serverside (or with clientside scripting) then that's their problem.
That's the bitch about open standards....EVERYONE can use them....
Comment removed based on user account deletion
How do EuroTV get their television schedules?
...
Subscribe to some kind of schedule syndication service? Pay for some student to type them in? Or retrieve information from the broadcasters websites?
Wouldn't it be ironic if they scraped the (e.g.) BBC's Web Site to retrieve schedule information for Beeb 1, 2,
By the way, the easiest way to defeat WWW::EuroTV is to simply change your formatting every few days. The author will go crazy trying to keep up. :)
I know a lot of sites don't like the use of WGET to 'acess web resources'...so what makes this screen scraping technology any different?
Macs as a fetish property
From the perspective of a webmaster of a large commercial site that gets regularly scraped by users using libwwwperl and other various perl packages for its content...
We don't care how you look at our site, but we do, to the best of our ability, monitor you closely. We only care if you republish our stuff. You may look at it anyway you wish, but just don't make it available to others. It's in our notice on every page.
Screen scrape away...
Microsoft Sends Broken Stylesheets to Opera
Not exactly enforcing you through law, but definitely through little "accidents" like above.
Norris/Palin 2012
Fact: We deserve leaders who can kick your ass and field dress your carcass.
They should do as many of us do and learn a lesson from Google.
It is a violation of Google's terms of use for you to "screen scrape" search results. You can implement their API using a free key and achieve similar results, however.
Not only are these companies approaching the "problem" from the wrong angle in terms of common sense, they are also taking the most difficult approach. It is practically impossible to seek to outlaw software that fetches Web content, because Web browsers and wget (for example) are the same thing, HTTP clients. The HTTP protocol is an open standard that anyone can implement. If you don't want a valid HTTP client accessing your server, don't make your server an HTTP server.
Stated another way, don't try to take an open standard and restrict everyone else's use of it to suit your own needs. You don't see me (an avid soccer player) trying to get the NBA to change the rules of their game to require use of the feet for ball control. If I want to play basketball, I have to play by the rules, else I am not really playing basketball.
This is just another example of gross technical incompetence by executives and lawyers.
A company that attaches an HTTP server receives an HTTP GET request complete with some information in its headers. They have a reasonable case to request that that information be accurate. They have unilateral technical ability to firewall IP's or whole subnets. Otherwise, once they receive a GET request, when the machine that they have configured responds by sending a file, they have granted explicit permission to process that file consistent with the info in the GET request.
The owner of the server is completely in control at a technical level. If they don't like what you are doing, they can firewall you. Absent a contractual agreement not to, you have the permission to send ***REQUESTS*** for anything you would like to request. They can say no. If you lie in your request, then they have a case to say your use is unauthorized, but short of that, there should be no need to have the judicial system rewrite the technology.
So, were these modules really removed from CPAN or did the CPAN admin withstand the pressure?
The companies spend time and money making websites that are designed to help further their corporate goals, cross-promote products and services, and possibly act as a vehicle for third-party advertising. If somebody is making a product which is designed specifically to circumvent the reasons why you're providing the website, then of course you should ask them to stop.
Here's a news flash: TV Guide will eventually stop giving you free screen-scraped guide data. The map sites will stop giving you screen-scraped maps. And so on, and so forth.
If you want to do this on your own, nobody will stop you, but if you make it simple for thousands of people to use a company's resources while providing no benefit to that organization, you should expect that they'll ask you to stop.
And if they don't let me use what I want; I will just take my business elsewhere.
Simple as that.
I find it sad that so many people seem to think it is just fine to mine their site for data. Sure, there's not all that much that they can do about it, except remove the data or make it harder for regular users of the site to use it.
For example, The EuroTV site seems to work on the concept that they provide the information for free for users of their site, but you can pay them to get it on your site. They're using their site as an advert for their services, while at the same time offering a useful service to the community. By making freely available a system to allow anybody to use their data in their own websites without paying them for it, you're completely ridding them of their reason for having the site up at all.
Yes, you can argue that they shouldn't put the information out there if they don't want people to use it, but then you're giving them a good reason not to put the information out there at all, which makes all of us poorer.
As for whether they can dictate that CPAN remove the modules, certainly it's fair enough of them to request that the module be removed, but it is a shame they leapt to threats of lawsuits quite so quickly.
If content is obtainted in a manner that is not in violation of copyright, the next question is that of fair use. It didn't sound from the article that the either module author intended or enabled anything explicitly unfair for using the data. If the website owner's in questions were objecting strictly to the method with which their web data was being accessed, their arguement holds no water.
This is somewhat similar to the "what constitutes a license" arguement regarding database licenses, the contention being a warm body vs. a connection. In the case of these perl modules, just because there's not a warm body explicitly directing the access of the data should not automatically qualify that access as a breach of copyright.
It would be worth the effort to question both of the website owners as to what exactly did they consider the breach of copyright to be? My guess is that neither of them will be willing or able to express their concerns with enough technical detail or legal specificity to present a valid explanation.
Where is the boundary between acceptable viewing and unacceptable viewing of content they are making publicly available?
What if I have my display resolution set differently to the web designer's?
What if I use Netscape instead of IE?
What if I use a black & white screen?
What if I surf the site with image loading turned off?
What if I wear dark glasses with holes cut through so I can only see the content and not the ads?
What if I use a text-only browser?
What if I use a screen reader?
What if I use a hypothetical browser that summarizes paragraphs to the first few lines?
What if I use a browser that "collapses" paragraphs on the first few lines and lets me click on an arrow to reveal the entire paragraph?
There's a continuum between displaying the content exactly as they envisioned it, and reducing or distilling it to some other form. Other than sitting at the same computer used by the web designer, anyone who visits a web site is somewhere along that continuum.
Unless they can define a specific point beyond which viewing is objectionable to the publisher - and I don't think they can - then I don't see how this case could get anywhere. The person making the software can keep backing up slightly until the plaintiff's position is absurd.
Furthermore, the tool in question is not a commercial product, the developer is not trafficking in the publisher's data, and, as a practical matter, it's easy enough to rewrite it so that the web server can't tell the difference anyway.
This is just another case of someone with a lousy business model trying to fix their problems with a lawyer instead of a good solid application of common sense (CueCat anyone?).
"Patriotism is your conviction that this country is superior to all other countries because you were born in it." -- GBS
Basically: "Here is the code. Here is what it does. If it's illegal for you do do that where you live, then don't break the law."
Why should a developer be denied the right to publish code that "could be used to do something that may be illegal under certain circumstances". Hey, I know--I'll build a security system to protect against 19th century threats, and then sic my lawyers on anybody who invents a technology that might circumvent my security.
I have a pair of bolt-cutters in my garage. Ace hardware was happy to sell it to me. I don't think the hardware store owner should bow to pressure from the U-Stor-It down the street who might say: "Hey, people use those things to break into our storage facilities."
OTOH, if this actually gets to court and holds up, then I will create a website and copyright some work on that site. Perhaps an scan of an artistic display of one of my fingers. The license will say "You may not view any material on this website"
Then I will tell anyone who produces tools that allow this sort of copyright violation (web browsers) to take place that they must stop!
Hmmmm...Who should I start with??" MWAAAHAHAHAHA
"Reality is that which, when you stop believing in it, doesn't go away." - Philip K. Dick
This was not ever realized, I believed mostly because of overpaid "web designers".
But the Semantic Web would require many funny user agents for all kinds of things.
Clearly, if this kind of thinking is allowed to persist in corporate headquarters, it will kill the Semantic Web before it gets started.
I wonder what Tim Berners-Lee thinks about this...
Employee of Inrupt, Project Release Manager and Community Manager for Solid
One of the biggest sites that I've not seen anyone mention is eBay. Following is in their eula:
Our Web site contains robot exclusion headers and you agree that you will not use any robot, spider, other automatic device, or manual process to monitor or copy our Web pages or the content contained herein without our prior expressed written permission.
You agree that you will not use any device, software or routine to bypass our robot exclusion headers, or to interfere or attempt to interfere with the proper working of the eBay site or any activities conducted on our site.
You agree that you will not take any action that imposes an unreasonable or disproportionately large load on our infrastructure.
Much of the information on our site is updated on a real time basis and is proprietary or is licensed to eBay by our users or third parties. You agree that you will not copy, reproduce, alter, modify, create derivative works, or publicly display any content (except for Your Information) from our Web site without the prior expressed written permission of eBay or the appropriate third party.
Now why they do this is obvious, they have an absolute goldmine of information and they want to be able to take advantage of it when they're good and ready. I assume other sites could adopt this type of eula, which wouldn't make the software itself illegal, but would make using it so (or at least until someone challenges it).
didn't you read the terms of service agreement you were handed at birth (us citizens only) that states any bypassing of ads during receipt of content is theft?
I'm just waiting for ashcroft's goons to knock on my door, find the tivo and haul my ass off to jail.
Everyone's assuming the appropriate rules here are from copyright law, which allow you to protect the expression of an idea but not the idea itself. That's probably right. It's not the way some big organizations want to play.
In the United States, most major sports leagues (NFL, NBA, NHL, MLB, etc.) believe that they own the rights to real time scores, and can permit or restrict any desired use. I ran into this at a previous job: we could "broadcast" football, basketball, and hockey scores at the end of every "period," and baseball scores at the end of every half inning, but we couldn't send updated broadcasts for every new score. That information needed (so said the leagues) to be licensed, and most of it had been exclusively licensed for the medium (Internet) we were interested in.
Do they have a legal leg to stand on? No. (IANAL.) Are they leaning on a great, big, huge stick with nails driven through it? Apparently.
Stupid job ads, weird spam, occasional insight at
Remember when the web -- no, remember when the net was about sharing information? I miss that time. If somebody wrote a cool front end to your service, it was COOL and more power to them. If it made your service (site, whatever) more accessible, that mean more people were looking at your stuff, and that was COOL.
Now we have entities that threaten legal action for accessing the stuff they've made publically available. There may actually be a case when the software scrapes and repackages the content (or, more importantly, redistributes it), but I hope the stuff about decoding the URL for easy use is bogus. I have my doubts that a court will see it my way, but still I hope for reason. Nevertheless, the whole idea makes me sad and nostalgic.
Another thought: is my mozilla vulnerable to this sort of action because it blocks ads -- essentially repackaging the server output for display to me? Now I'm really depressed.
--
bachiatari na torisetsu o yome!
Ahem. Bullshit.
Frob.
//TODO: Think of witty sig statement
I think this is something we're going to start seeing a lot of in coming years. Right now, the Internet in general is going through growing pains, and the pressure is starting to show in these "free services" type sites ( i.e. Mapquest )
/. I believe others would as well.
I don't know about these site in particular, but many of the big sites around today were built with the failed dot-com business model of delivering free content and selling advertising that ran on the page (or popped up behind it.) This, of course, is dependant on people viewing the site in a browser. If people get the information without using a browser, therefore never seeing the ads, the advertisers won't want to spend any money on the site.
Another problem is, most companies don't want to take the risks associated with innovation, so instead they seek legal action to maintain the good thing they have going. While this is a quick fix, and in the company's best interests, we need companies to present a new business model to the public and see how it gets adopted. I would pay an annual subscription fee for things like Mapquest.com, tvguide.com and maybe even
Porn sites, Ebay auctions, games such as Everquest and services such as Apple's dot-mac are online services that subscribers happily pay for because more than anything, they are quality products(well, some of the porn is). If the company's revenue is coming from its users, they would be a lot less concerned about how the information is being distributed.
This isn't such a radical change, as they could add a premium subscription service, and slowly transition the focus of their business towards it. Wouldn't it be cool if I could write my own mapping application ( or download a pre-made one from the site ) and have it connect to xml.mapquest.com, give my username and password, and retrieve the data I requested.
Maybe all these sites will front-end their sites with "retype anti-script graphics". Kinda like what slashdot.org does to email your password.
$0.02
Hell, the simplest would be an easy reading comprehension or logic test with a short-answer blank - the computer would never get it, and all humans would.
My guess is that soon, people who REALLY want you out will keep you out.
-Looking for a job as a materials chemist or multivariat
* From: Jan Dubois
* Subject: Re: [Fwd: IMPORTANT: Request removal of WWW::EuroTV]
* Date: Thu, 06 Feb 2003 13:05:09 -0800
On Thu, 6 Feb 2003 21:44:20 +0100, "Bas A. Schulte"
wrote:
>They're just too ignorant that they think they can publish the data for
>everyone to see can only be seen through their own website.
[...]
>Anyway, I'd love to hear anyone on this with some legal knowledge. I
>don't believe at all that this will hold up in a court of law.
I think this discussion is missing the point. It should not be: "What can
we legally get away with?", but "Do we have the courtesy to respect the
wishes of publishers of information?", even if their wishes might not be
legally enforceable.
Since this is about Perl advocacy, I would like to quote a bit of Perl
culture: "It [Perl] would prefer that you stayed out of its living room
because you weren't invited, not because it has a shotgun."
I think the same rules should apply for screenscrapers too: If website
owners don't want their pages to be scraped, then people shouldn't do it
and get their information elsewhere. It is like honoring a robots.txt
file. It is probably not enforceable, but it is the right thing to do.
Cheers,
-Jan
PS: I'm not saying that "they" weren't the first ones to break the rules
of politeness by threatening a law-suit, instead of just asking for the
modules removal. But that doesn't mean that one has to respond in kind.
I just don't trust anything that bleeds for five days and doesn't die.
Hmm - my belief is this:
If someone displays information for public consumption: It releases the right to control what the public does with that information.
Information being website content - text.
Sure it is copyrighted. But if I want to screen scrap - make a collage out of it whatever I have that right. If I'm going to publish derivative works then you have to seek permission to publish said works.
But saying that you have to use a particular application/technology to access my public webserver is ludicrus. If you want to do that - make your data proprietary and require proprietary software to access said information.
It is well within the rights but don't complain how the public digests your information: do something about it.
Jesus - Forget all the politics and "My rights are getting trampled on". Write your own version of the module and be done with it. You can fuck the lawyers, and ignore whatever crap is going on.
But if the rules of a protocol are the only rules I need to follow in cyberspace then consider the following valid telnet sessions:
login: abe
password: lincoln
Wrong password, try again:
login: abe
password: Lincoln
Wrong password good bye.
DISCONNECTED
login: abe
password:linc0ln
connected!
$ sudo rm -r /
password: linc0ln
DISCONNECTED
I would argue that insecure systems deserve to be broken in to. The person here shouldn't have had such an easy to guess password. He also shouldn't have used an unencrypted protocol like telnet that anyone can listen to. If you want to have a telnet service running accessable by the public then it should be your responsibility to have a hard to guess password and keep people outside your firewall from being able to connect.
Eat at Joe's.
that counts.
Because none of your drivel counts for shit.
Suck it bitches!
You know, I think some of you are missing the point in all the technology. I work for a community newspaper publishing company, and we have copyright info at the bottom of every page. I found a guy on google that demonstrates screen scraping techniques using our main news page. That's fine. 99% of the time, it's not a big deal...it's going to happen. What we don't like is when somebody comes along, takes our content, and presents it in questionable environments, like a page that happens to have porn banners on it. Ever hear of "guilty by association"? Frankly, I think it's more likely to happen if screen scraping becomes more commonplace. Honestly... i haven't noticed a drop in traffic when someone does this.
Here is some more info
If this is how those companies treat their customers, fuck 'em.
While there's absolutely nothing illegal that you're doing - they don't deserve your patronage or the following that your utilities will create for their sites. Take your eyes elsewhere. There are fine alternatives.
Try Mapquest UK and your choice of alternative TV listings.This begs the question: why screen scrape in the first place? It's not very reliable in the sense that, barring special circumstances, there is no guarantee that the data that is returned in a response will be in the format the scraper expects.
You're basically trying to parse data out a string that you can at best only *assume* is going to be in a predetermined format. All the target has to do, in a lot of cases, is change a tag, comment, or what-have-you here or there (assuming that the response is a string of HTML) and it can throw the whole thing out of whack.
Now, if the response is just straight data, a return from a web service, or some other special case, then the data from it could probably be more trustworthy. But then again, if you're making requests to a web service, it's not really a "screen scrape", is it? And, I would also assume that if the target went to the trouble to expose a web service, they certainly expect outside parties to use it. Authorization issues, etc. would then become their burden.
In Soviet Russia, Chuck Norris will still kick your ass.
This is just another example of people turning to the law instead of using their brains.
Any admin worth his salt has to deal with undesirable traffic without crying for help. Whether it's spam, badly-written bots, DOS attacks, or just offtopic trolls in community/chat sites.
Don't like traffic from Nigeria, block it. Don't like bad bots, trap them. Don't like "First Posts", invent a clever meta-moderation system to deal with it.
Blaming CPAN for annoying bots is like blaming the NRA for gun violence. Oh, wait-a-minute, I DO blame the NRA for gun violence.
Blaming CPAN for annoying bots is like blaming Microsoft for every w32.Worm. Oh, wait-a-minute, that one is their fault too.
Blaming CPAN for annoying bots is like blaming CD players for Britney Spears. Yeah, that's it.
It's been in common practice for years, but people are just now bothering to complain about it??
"This site best viewed with Microsoft Internet Explorer"
There's no law stating that we have to look at ads.
What about 17 USC 106, which states that barring fair use, etc., the copyright owner has the right to prevent others from creating derivative works of a web page?
Will I retire or break 10K?
... they can.
Signatures are for stupids.
I've done this with my own scripts for eBay (to improve their search engine) and for Yahoo Groups (which hobbles a perfectly fine NNTP model with advertising, lack of threading, and slow HTTP retrieval). You code in your interests, and just key a simple shell command to have your pertinent info retrieved for you, instead of all the tedious pointing and clicking. Yahoo Groups is the absolute worst, but it is free and people are seduced to use it. Yahoo Groups is to Usenet as AOL is to a real ISP.
I don't believe the discussion is about whether or not screen scrape is feasible for people and whether or not it can be stopped through a bit of intelligence but is instead a discussion of whether or not one company has a right to grab content from a website and redistribute it on their own. Yes, it's possible to stop people from doing aforementioned grab (of course, as this war escalates you're going to have to start shutting real people out of your content) but should people have the legal right to do the grab. Now, what do you think of that question?
I hate liberals. If you are a liberal, do not reply.
ashcroft is a thug regardless of his party affiliation. take of your partisan blinders and understand that patriotism != submission.
Any web site that uses a visual method of authentication as the exclusive method of authentication will be inaccessible to people with vision problems and thus not be compliant with Section 508 of the U.S. Rehabilitation Act, and the entity that operates the web site will lose the U.S. government as a potential customer.
Will I retire or break 10K?
Complete and utter bullshit. OCR with results of 97%? Sure, if the text is consistent, all in the same direction, with basic fonts, and non-contrasting backgrounds.
Bascially, everything that the "Enter the word/Image" protection does not use. There are a hundred different ways to alter the text to prevent anything but human reasoning to read (decode). The beauty of these systems is that the transformations are computed upon request, which means you have no way of knowing what to expect. You might get backwards letters, or letters that are rotated, or words that are upside-down, with each letter as a different, crazy font (i.e., NOT Times Roman or Courier).
Sites like PayPal, Yahoo Mail, Ticketmaster and the like are using this system because so far there is no way around it. A computerized system that requires human authentication like this is an absolutely beautiful challenge to the hacking community. I honestly doubt you have a working solution.
If you did, you would be very, VERY rich, and would be too busy cavorting with naked Playmates on your desert island than to write this kind of crap on Slashdot.
They may not like it, but there really is little they can do about it.
Trying to stop content scraping is a loosing battle.
They can try to restrict it to real browsers. But what is a real browser? After all, Mozilla is open source. It executes JavaScript, or anything else they might care to attempt to detect. In the worse case, Mozilla being open sourced, could be hacked to go to their site, (yes an inefficient Perl module of course), scrape the content, executing JavaScript, etc., and then from Mozilla's menu, pick "Document Structure" and recover the information from there. All automatically and in the background.
They could start using Flash. But if it is text in flash, then the flash file can still be parsed. Its format is documented.
They could start generating a JPEG of the information. That can still be OCR'ed. Efforts to defeat the OCR would just make it harder for the human eyes to recognize. Do you want to look through TV listings in strange fonts, with lines through them, inconsistent or unattractive colors?
The price of freedom is eternal litigation.
Or more related to the point, here are some real-world scenarios:
1. Spammer tries to relay through a machine by looking for well-known CGI. For example, I frequently see requests for /cgi-bin/formail.pl, with the Referer: header set to the name of my domain.
2. Spammer tries to relay through either an HTTP server or HTTP proxy which supports the "CONNECT" method.
Has the owner of the machine explicitly granted spammer permission to (mis-)use his machine, just because a well-known script is present, or because CONNECT is enabled on the wrong side of the internet connection?
I would respectfully disagree.
Hey, Windows users, there is no such thing as "forward" slash, there is only slash and backslash.
Since when do you need a licence to view content? Does a library tell you to sign a NDA or other contract before you can look at those works or publications? If you don't want people to view your information don't post it on the web. The internet is for information and entertainment, and not many people pay for it in comparison to the rest of us.
If you put something on the web, you have to assume that people are going to access that information in any way that they possibly can.
I suppose the big complaint is that people might not be viewing the "ads" on pages if they use certain HTTP clients.
I have a suggestion for the sites that are complaining. If you don't like it, don't put stuff on the web. Write your own custom client-server solution if you don't want people accessing it with certain browsers or other software.
If you are depending on ad banners for your revenues, you and advertisers are taking a "risk" that people might not see the ads, or that they might not buy advertised products. Tough luck if you lose out on your bet. Hopefully you have a solid way of making money related to whatever service you are providing to make up for it.
Whining about lost ad revenue and such is the same as whining about losing money in Las Vegas. You should have assessed the risks before playing the game.
"You spoony bard!" -Tellah
I have difficulty buying that re-formatting a UI is ``creating a derivative work''.
If you're not independently wealthy, you'll also "have difficulty buying" the services of an attorney to defend you in a court of law.
The definition of "derivative work" in US copyright law can be found in 17 USC 101 plus case law with which I am not very familiar because I'm not a copyright lawyer.
Will I retire or break 10K?
Go Here for discussion last summer over at Perlmonks.
UNIX/Linux Consulting
If a website operator is having their copyrighted content lifted by another site and presented as its own, then that operator can sue using traditional copyright law. If they are having their website slammed because some clueless developer is scraping too often, they can block the IP. But trying to restrict access to the api is heavy-handed and futile.
They are giving you a service for basically free that has enormous costs for them. The expect that page views are just that, people actually viewing the complete page in the way it was presented.
Now, I also believe that companies should provide SOAP interfaces to their sites so that people can properly integrate the information available. However, they should also charge for this service.
Maybe if they don't have any other way to get the information screen-scraping is acceptable. But it's much better to have a SOAP interface you can use. Oh, and if they do have a SOAP interface, and you screen-scrap to get the same information without paying for it, you are stealing from them.
-Brent
Comment removed based on user account deletion
If you think that there wasn't anything as simple as a firearm before firearms were invented, take a look at a crossbow.
"Crossbows don't kill people; people kill people." "Crossbows don't kill people; arrows kill people." "Crossbows don't kill people; blood loss kills people." The clichés are intended to concentrate attention on different parts of the cause.
Will I retire or break 10K?
Dear Bill,
I ask that Microsoft to please recall all versions of Windows. They might be used be to illegally to spread content without my my approval.
Thanks,
Your copyright overseer,
Valenti
Actually, this is a field that is quickly being considered a new Turing test for the computer vision field. It is actually very easy to make pictures that humans can read and that machines currently can't. Look up more info on it here.
If it's for-profit but free, you're not the customer -- you're the product (e.g., the Slashdot Beta's "audience").
Comment removed based on user account deletion
I can think of a few creative uses of mod_rewrite to stop people from hot linking your images. Here is a tutorial on how to set it up using something as trivial as .htaccess.
Hire me...
If the company is unwilling to negotiate over you stripping dat from their web pages, perhaps there's another way to get what you want? Why not try asking them to implement web services via XML-RPC or SOAP that would provide the data you desire?
Gabriel Ricard
this news item makes me feel like there is a need for a generic screen scraper plugged into mozilla that would know how to get to a piece of data without having to navigate in mouseclick hell.
it is most probably a difficult task to make a tool that would be easy to use and powerful (how do you describe the way to parse the NFL webpage and get the score of your favorite team?) but sticking it to the people who create artificial limitation to the way their data can be accessed feels like a reward worth the effort.
Dev elpizw tipota, dev phoboumai tipota eimai lephteros http://euclidian.org
When are people going to get it into their heads that public accessibility != public domain? This is, essentially, the argument that both authors and some supports make, that if it is publicly available then it is within the public domain. It isn't. Books in a library are not in the public domain simply because any schmuck off the street can stroll in and look at them. TV shows and sound recordings broadcast over radio waves are not in the public domain because anyone can pull the signal out of the air. Movies are not public domain because anyone willing to pony up the cash to see one can see it. Correspondingly, webpages are not in the public domain just because any nitwit with a computer and a connection to the 'net can load a webpage.
Damn straight this about our rights online. It's an educational example that with rights come responsibilities. Those that abuse those responsibilities lose those rights.
Comment removed based on user account deletion
I've been using MythTV for the past few months, which uses XMLTV to scrape certain sites for TV program guides. I've felt kind of concerned about using that software. I wouldn't mind paying someone for my TV program guide -- I just don't want the provider to know what I'm watching (one tradeoff you have with Tivo, among others).
In addition, if you have a good site that has a vested interest in providing well-formatted data for you to download, you don't have to worry every day that the website might change it's layout or whatever.. I much prefer to use something that has a defined protocol, rather than something that is always in a state of flux..
Yes, but if you spoof the user agent, you get around that easily, as is stated in the comments on this page.
If a receiver's only clues as to a client's nature reside in an open protocol, the sender's nature can be faked. Mozilla can look like IE (though I wouldn't know why it'd WANT to), IRC bots can look like mIRC, etc.
All I want is a kind word, a warm bed and unlimited power.
If its purely internal then they should use a VPN and/or intranet and keep their stuff OFF the web.
The web is about as private as yelling at the top of your lungs at a karaoke competition. Anybody who thinks they can tell you to listen with one ear or the other is dumb.
MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
Thats what I tell my clients who try to "encrypt" things in this silly manner. I've written packages that defeat those silly "enter the word contained in the image" tests, I've written packages that defeat silly anti-automation scripts.
It's really not hard.
Can something that recognizes text in an image be written? Sure. It's just a form of OCR. Can you write one that's able to look at any generic webpage, a mix of text and images, and do what is being asked of a human? I don't believe you can, and it seems a pretty high expectation of any software for the current state of AI. A targeted program for one website I might believe, but such tests for a human are certainly valid protection against web crawling 'bots.
Which is not to say I in any way agree that screen scraping software in any way is a violation of a website owner's rights. It's not.
I'm an American. I love this country and the freedoms that we used to have.
EuroTV has a robots.txt file that asks to leave the various /scripts directories alone. If this Perl module is just ignoring that robots.txt file, then that is just rude, although I don't see how it is illegal.
Streetmap doesn't even have a robots.txt file, so I don't see why they are whining about it.
Although I can see why these websites could get upset. The TV-listing screen scrapers are especially bad at hammering a site relentlessly for a sustained period of time to obtain all of the programming information for a certian broadcast area. The scraper has to hit the site repeatedly to obtain all of the information, since it isn't all displayed on a single page. If any one of these scrapers gets to be really popular, it could kill the site.
Of course, the solution to that is to make all of the listing available as one big chunk to avoid repeated requests. But then the site goes out of business in a few weeks due to lack of advertising revenue.
I, for one, wish I could buy a subscription to zap2it.com that would give me fast, easy access to the channel listings in, say, XMLTV format. Is $25/year a reasonable fee, considering that I would only hit the site once a day at the most, and grab a single file?
"Tomorrow's forecast: a few sprinkles of genius with a chance of doom!" - Stewie Griffin
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
Let's not screw up our legal system with provisions to protect bogus business models. If streetmap.co.uk cannot figure out how to make money putting up information openly on the Internet, then either they should make room for someone who can, or maybe there just isn't a market there.
the spirit of copyright laws are restricting COPYING
The problem here is that a U.S. court decision interpreted a copy in RAM as a "copy" for purposes of copyright law. Thus, when the kernel receives a packet, it COPIES the packet from the network card to the browser's memory, and then the browser COPIES and ADAPTS the HTML into a document tree, COPIES and ADAPTS the document tree into an offscreen bitmap, and COPIES the offscreen bitmap into your video card's RAM.
And if you're arguing fair use, as I said, you better have the money to pay an attorney to back it up.
Will I retire or break 10K?
Gosh, I don't know, but don't I see Google redisplaying site content of billions of pages day in and day out?
Sounds to me like the area's too grey to ascertain right and wrong (I may be, and probably am, ignorant).
However, these sites definately have every right to do whatever they wish in order to prevent such use, such as IP blocking, taking some creative evasive measures, OR... securing content they don't feel Joe Public should consume.
What would happen if say, General Motors suddenly decided that each and every time a GM vehicle shows up in media that it was an abuse of their intellectual property??
Ptttth!
Anything which allows a Perl program to access their website more than once should be banned. Guess we'd better get rid of telnet. And FTP. And web browsers. Heck, let's get rid of ping just to make sure. Better get rid of modems and the programmers while we're at it. Screw it, just ban computers all together. Then there's no way those evil hackers can fuck with their website!
Note to M1-ers: a curt but otherwise insightful message is not "Flamebait" or "Troll".
My monitor has an automatic defrost feature.
---
DRM is like antifreeze, to the MPAA/RIAA it's sweet, to the consumers it's poison.
Comment removed based on user account deletion
They seem to try (and of course fail) to detect robots: browsing their site with OmniWeb and Safari on a Mac, you'll see a banner at the top of the pages which says:
"This banner is used to kill robots if you see it on a web page please advise technical@beweb.com".
Changing the User Agent in Safari displays a "normal" banner.
I wonder what the genious who has set up this had in mind?
Anyway, I reported to technical@beweb.com, and I'll see what they mean.
Do they know LWP::UserAgent? Guess not!
Me no sig.
I do feel pissed off every time we catch someone stealing our content and using it in their own tools. Copyright notices and T&C's are all well and good but they do NOTHING to stop someone from trawling your site.
As an owner and publisher I *can* say how my content is to be used because that's the licence I grant, it's MY choice. If I wanted it to be freely copied and used in any way then I would release it into the public domain...and it will be a cold day in hell when that happens.
The information (in our case TV listings) is costly to collect. I guess the spongers don't realise that or they just don't give a fuck.
I've found the solution is to a) implement technology to try to prevent it, and b) complain directly to their ISPs.
Both of the above solutions work but are themselves costly in terms of the technology and the time taken. These are two things we'd rather not spend our time and money on, and they distract us from creating great software.
At the end of the day if everyone trawled web sites for content then there would be no web sites supplying the content. The people trawling often request thousands or tens of thousands of pages in a very short space of time. The costs in terms of bandwidth and slow service to legitimate customers soon add up.
Our downloadable software TV guide (DigiGuide) did in the past have unencrypted data files. We didn't honestly expect someone to take our content and build a (possibly competing ) product around our data but they did. The data is now encrypted and should someone crack the encryption then we just change it and their hard work is wasted.
I feel sorry for web sites like TVGuide.com because they probably think they have some very loyal users that spend a lot of time on their site and read a lot of pages...instead they just have people sucking their content and paying them nothing for it. Ignorance is probably bliss for them.
Well, there are ways to terminate the agreement, but they ain't pretty...
Even if he had provided a tool to make a copy of a map, which he did not, there is nothing at all wrong with making and supplying others with that tool. It's how the tool is used that is the issue, and a tool that has legitimate useful uses can never be allowed to be the target of such a complaint or suit.
I'm an American. I love this country and the freedoms that we used to have.
Have you done your part in emailing the two companies and trying to say your thoughts hoping you would steer them to reason ?
If not do it so.
If you did, then switch to your anonymous email account and send them some nice hate mail as well...
__________
Don't belong. Never join. Think for yourself. Peace!
Yeah, and you're only allowed to look at my house with one eye.
... in the words of Mr. Garrison:
No!! No, No, No! I'm Mr. Hat and you're, you're a little turd! You hear me?!? You go to hell! You go to hell and you die!
Who do these people think they are... threatening programmers for writing code to read info that is published openly and allowed to be read from their servers...
Dipshits don't know what the web is for, but they use it anyway... next they'll threaten to sue makers of scissors or highlighters for providing tools to extract TV listings from the paper.
And we all know that if there isn't a specific law which pertains to an exact version of a piece of software, then that software will bring about the downfall of humanity, right?
It doesn't matter if the *issue* is already dealt with in lots of other laws, we need to create a new digital law to deal with this CyberCrime, because...errr... think of the children!
Can we go home now?
if someone is scraping the data and then redistributing it for their own profit - then that would be sketchy to me - but otherwise, I don't see how you look at it matters.
I have scrapers for a few things, and I doubt the company even notices (since they aren't mass distributed scrapers like a CPAN module is) - but I would think scraping wrong if I then took that data and used it for my own profit while getting it free from the other source. (thinking mainly of stock data - strip it for free from one location, but then charge users to see that data on your site)
There are some odd things afoot now, in the Villa Straylight.
Nice troll.
I find it very funny that you cut off the rest of the phrase you quoted -- "COPYING and DISTRIBUTING". Last time I checked, a computer's memory is not -- by nature -- a distribution medium to a mass audience.
The
"Information contained on this server is copyrighted and may not be distributed, modified, reused, re-posted, or otherwise used outside the scope of a WWW client without the express written permission of B. On The Net, owner of the EuroTV site."
If someone wants to write a program to push this data through another application or website they need to take the time to establish a way to build the database on their own. No one can copyright the data itself but they can keep you from using their data as the source.
A classic instance is the "deep linking" cases, where somebody doesn't want to let you see their deep pages except by coming through their front page. Rather than taking this to court, as several content providers have done, and beat up on users one at a time, it's much simpler to check the HTTP-REFERER to find out what page the request came from, and send an appropriate response page to any request that doesn't come from one of their other pages. (Whether that's a 404 or a redirect to the front page or a login screen or whatever depends on the circumstances.)
Screen scapers are an interesting case for a couple of reasons. One of them is that blind people often use them to feed text-to-speech browsers, so banning them is Extremely Politically Incorrect, as well as rude and stupid. Another is that anybody with a Print-Screen program on their PC can screen-scrape - you're only affecting whether they get ugly bitmaps or friendlier HTML objects. So you not only have to ban custom-tailored CPAN objects, you have to get Microsoft and Linus to break the screen-grabbers in their operating systems.
The related question "ok, so how *do* I detect and block http requests I don't like?" is left as an exercise to the blocker (and to the people who build workarounds to the blocks, and the people who also block those workarounds, etc...) The classic answers are things like cookies (widely supported "need the cookie to see the page" features seem to be available), ugly URLs that are either time-decaying or dependent on the requester's IP address, etc., or just checking the browser to see which lies it's telling about what kind of browser it is. There's also the robots.txt convention for politely requesting robots to stay away, and Spider traps to hand entertaining things to impolite robots or overly curious humans.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
What is the line in the sand separating a Perl script and Mozilla, in this situation?
Both collect data from the web, process it, and display it in a form understandable by the user. It just happens that one is more popular than the other.
If I was to rewrite that module to use AppleScript under OS X to go to their website, fill in the form, and save the image to my hard drive in a desired location, could they say I was violating their terms of service?
I'm using a web browser to access their service; it so happens that my preferred interface to that web browser is through AppleScript, instead of through the mouse and keyboard. Does that make it unacceptable to use their site
A what point is the information provided to you fair game. You can manualy take any information that is made public and view it under your own terms. You can save the page, copy/past into notepad, import into excell and even sort it.
o es mithhm@hotmail.com/ open/open.cgi?x=jack smithhm@hotmail.comn /open/open.cgi?x=jims mithhm@hotmail.com
What about public urls? can you visit a public url as often as you want as long as you dont DOS the server?
I got a SPAM that used a tracking url for a image source. can I write a script that passes that url bad tracking information?
http://216.219.227.69/cgi-bin/open/open.cgi?x=j
http://216.219.227.69/cgi-bin
http://216.219.227.69/cgi-bi
Im a gamer, not a grammer major. This post is full of spelling and grammer mistakes.
The aggregator of television guide content is in an interesting business... and problem. Yes, it is costly to get guide data, and they need to recoup their losses. Screen scraping free guide websites bypass the ads, which is, of course, how those websites pay for their work.
But there are so many potential uses for free guide content in an easily transformable format. Besides VCRs or PVRs that can find programs for you, what about websites dedicated to finding your favorite shows, or a simple PDA app that alerts you when it finds out that your favorite movie is due to air? So where is the data? Who would provide it?
Well, with a standard format (or this ) for such data, I believe the producers of that content (or, rather, the distributors... the networks themselves) could provide that data. It's obviously in their best interests for that data to be accurate, and as freely and widely available as possible. They want people to find their shows. Combine it with a simple automated lookup-table translation of network names to your local cable station numbers, and you're set. Of course, something like this could put guide aggregator businesses out-of-business, and I really doubt it will ever happen... But it probably should!
I use a Betabrite LED sign as one of my web-browsers:
http://www.remote-control.net/software/ledsign/
I always thought of html or xml as data that is provided publicly (internet) or privately (intranet) and the reader application uses what it can. (Internet Explorer vs Lynx vs BetabriteHeadlines)
Linus wrote the kernel, not the OS... and definately not the gui.
I'll be here all week ladies and gents. Please, try the fish.
Well, if screen-scraping is illegal (and in some forms, it certainly is), then somebody should sue the people who sell programs that harvest e-mail addresses from web sites.
This is the page that he is referring to. Admittedly, the last two were rather small, but they did have cows, and it was pretty plain to me after looking at just the first two pictures that the theme was cows.
Do not read this sig.
Unless I'm sadly mistaken, it's CODE that screen scrapes Web pages that's used by Google et al to populate their databases. Without these search engines, a lot of these sites wouldn't be getting any eyeballs at all.
What about that blind guy who is suing the airlines for not making their site bind person friendly? Will he be suing these companies too since they require him to read an image? lol.
Dish TV however takes their customer interaction guidelines primarily from Ernestine the Telephone Operator ("Cackle, we don't have to care!").
Hence their answer has been uniform. They don't bother to answer.
So I've been using XMLTV to download the listings, and a homegrown XSL transformation to change the listings to a nice grid for viewing in a browser. Works just fine, but I'm sure that the people running the sites with the listings are getting cranky at people doing this or similar things with XMLTV.
Worse yet, I believe that if they provided the listings in a slightly different format and compressed, the downloads would be much less onerous.
I suspect they're trying to figure out how to make downloadable listings available and then charge heavily for them - the paper version is now a couple bucks a month ($3.95?) so they're probably counting on being able to charge at least $30/month for the electronic version.
And they'll undoubtedly justify those high costs by pointing to the load XMLTV places on their systems.
And when they do, they'll stop anyone from using XMLTV or the like. I suspect that EuroTV is just doing the same thing.
Technologists should realize that:
1) More robots use the internet than people anyways
2) They cannot dictate how their site is used, just that it is used in a certain way. For instance, if the protocol is HTTP,and as long as a person or a robot uses HTTP they really don't have that much to say.
They don't have a broke-ass-splintered pirates leg to stand on by saying how to get the information as long as what I do with it after I get it is fair use.
If they want to be profitable, sell fucking memberships and quit bitching. The Internet ain't free (or am I the only person who fucking learned anything since 2000)
For instance, lets say I hit a website... do I care if the other end of the connection is a person responding to a terminal request for content and they drag and drop all of my stuff into a network hole? Or does it matter if it is an apache or an ISS server? It don't matter.... as long as we both follow an agreed upon protocol, they can eat my ass if they think I am going to grab something with a browser when I can have my agent do it.
I say to those Nazi ass sniffers that they can slurpo my dongo before they tell me how to use a computer.
Oh, and just so this comment can get modded higher, Bill Gates and Microsoft can swallow along with Dell and Toyota.
If they really cared about keeping this stuff from the public - surely they would have used SSL (HTTPS), even with an old expired key (as is so common nowadays)? I don't see a single reason to not blame them for this. You might feel another way, but technically you'd be wrong however you turn it.
You are really thick aren't you.
He wrote a package that happened to incorporate some OCR libraries that someone else wrote. He didn't claim to write the OCR libraries. Also, without his package, the OCR libraries wouldn't be applied to defeating this securtity.
Why do I waste my time with trolls? Why?!
Then I assume such agreements do not apply to c-section kids, do they? Oh the ineffable joy of medical techno... oops... does this make c-section DMCA circumvention device? =8-Z
My other Beowulf cluster is... er...
If they dont want people to use the information the way they do, why the hell are they publishing it on servers connected to a network not controlled by them...
I mean seriously, are they now telling us what packets and requests we are allowed to send over the internet?
By hosing an internet server they are accepting people can connect to it and send the data they like. If they dont like it, they should try and outsmart people with clever protecting software, or host it on their own private lans.
MOD PARENT UP
Which has been ruled legal in court, if I recall correctly. While it is legal to copyright the "arrangement" of the information, grabbing somebody's yellow pages and transcribing the whole thing with some re-typsetting makes that list of phone numbers yours, even to redistribute.
See here for info: http://www.writing-world.com/rights/fair.html
Search for the phrase "phone book" on the page. Copying "creative Works," like fiction or poetry, is frowned upon, while copying facts is not. The copyright is "in the expression" to quote the site, not in any underlying facts. If facts were copyrightable, we'd all be completely screwed. As it stands right now, we're only 97% screwed.
Actually, I believe that an art patron is perfectly entitled to vandalize anything that they buy.
There does exist limited protection of "moral rights" in United States copyright law, in 17 USC 106A, which would prevent such defacements.
Will I retire or break 10K?
I read it wrong. I thought that by "restricts COPYING and DISTRIBUTING" you meant "restricts COPYING or DISTRIBUTING", or "restricts COPYING and restricts DISTRIBUTING".
I also thought that copyright law restricted copying works other than computer programs into RAM except subject to limited fair use exemptions. But now, after trying to determine whether HTML counts as a computer program, and then reading and re-reading 17 USC 101, I realize that under a broad interpretation of 101, any work fixed digitally could be termed a "computer program" and subject to the additional limitations of 17 USC 117. Can anybody cite case law pertaining to this?
Will I retire or break 10K?
It all comes down to money and the models people have used to force advertizements onto people while they are entertained or eduacted.
the cold, hard truth is that the digital future obviates the traditional content control mechanisms used to force consumers to watch ads for content. The exact same lines are playing out on the web, on TV, in music, movies, magazines -- everywhere informationcan be digitized and presented in ways not tied to physical mediums.
The (now old) business models that the digital methods circumvent will eventually be redefined. Short term laws will support them, because the industries have eough money and clout to cause the laws to happen. Long term, though, people will no longer stand for the absurd, one-sided contract with society that is our current IP system.
This a vague comment, quickly written -- but I see here the exact same theme played out over and over in recent years. Free communication (amortized) + 'digitizable' items of value => lack of control by provider for profits. This is yet another example.
do you or do you not claim
I claim nothing. I have never even set foot inside a law school. I just wondered if anybody more familiar with the case law could elaborate, particularly about how much originality it takes to make a derivative work (as opposed to de minimis alterations).
are you or are you not always this pedantic with informal english?
No. But when it comes to the fine points that win or lose a lawsuit, pedantry pays.
Will I retire or break 10K?
before the lawyers and marketers got involved. We've had problems with: linking and deep linking, DRM, censorship, content legal in one locale and illegal but still accessible in others, domain name speculation, parody sites, libel (maybe even slander), etc. Of course many of these problems are caused by us litigious Americans, what with our ancient (technophobe) judges and all, but Germany, France, China, Saudi Arabia, etc. are also guilty. BTW, my employer wrote a screen scaper, and periodically the source IP's would be blocked, even though we had a contract with the target website(s), so we would renumber the scrapers every time the block happened. My guess is the target's tech people never spoke to the business development people.
Like this: Show a picture of a tree. The user fills in the blank. T-R-E-E. Any dipshit would get that right. Hell, even give them the T. I don't think a computer would get it in three tries - after that, do a 1 hour IP lockout. That should also prevent "guessing."
If you had a bunch of such problems, it would make it pretty tough. Would some of them be solveable some of the time? Maybe. But staying ahead of computers in the Turing test has ALWAYS been very easy.
But I know what you mean about daytime talk shows. ;)
-Looking for a job as a materials chemist or multivariat
Somebody came up with a proggie that puts up a program launcher window that covers the Opera ads precisely.
Some people in the freeware newsgroup complained that this violated the user's agreement with Opera allowing use of the ad-supported version of Opera.
I argued that since Opera's EULA makes absolutely NO mention of any requirement to view the ads (or even not to alter them), there was NO deal between the end user and Opera concerning the ads in any way. And even if there was, it would be ridiculous to demand that people actually view the ads.
There have been some ad-supported programs, IIRC, that actually demanded that people click on the ads before the program would function. I suspect most of these programs died a quick death in the marketplace.
The proggie was just a more sophisticated way of putting tape over that section of the monitor...
Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
to use an online resource, I do think there is a moral duty to pay in some way the provider of the information. Maybe if the provider actually created an easy API to retrieve information they could then ask coders to reciprocate the gesture. For example hyperlink them, or embed supplied advertisements. Unless a compromise can be reached then we get a confrontational attitude where both sides waste resources, the data source trying to protect itself and the other side continually trying to break the protection. CDDB is a nice example, it's embedded into loads of programs, but they all (or are supposed to) ask you to supply CDDB your email address before it'll become functional.
Palladium technology (or possibly TCPA) could fix this, without any new laws or arguing over copyright terms.
With Palladium, the server site could verify that the client (that's you) is running an approved (by the site) web browser and not a screen scraper. In order to access the site you would have to run a Palladium OS and run one of the web browsers the site owner accepted.
Your freedom would be complete. You could choose not to view the site, or you could choose to view it under conditions which are mutually acceptable between you and the site owner. That's the same basic bargain being offered in every voluntary transaction in the world.
someone should rmeind thee two companies that systems that display their sites as wap content to mobile phones are actual scrapping systems!
Jeeze how stupid can people get!
Don't Tread on OpenSource
One of my scrapers wraps IE. Another is a script that runs inside an html page, and scrapes using IE inter-zone bugs with IFrames. I'd love to see the language that precludes either one of those techniques from being used...
help me i've cloned myself and can't remember which one I am
Good one. The example did say IE 4 "or higher".
Congratulations, your custom software is "higher".
Or, "installed on your system" doesn't mean you need to use it. Use another browser.
Especially if you can't install IE on your OS. In a way that works, anyway.
http://cpan.azc.uam.mx/modules/by-category/15_Worl d_Wide_Web_HTML_HTTP_CGI/WWW/JOHANVDB/
May whatever you hold holy bless Google. =)
1. Restrict http access use to certain programs ...
2. See number 1
3.
4 PROFIT!!!
by offering your work on the web, you are posting it to the public domain and have fuck-all right to say what happens to it afterward.
At least that's the way it will be when I rule the web!
someday...
Under the current state of US law, unauthorized access to a computer system is a federal crime. (I can't speak to EU laws, but I suspect parallels exist.) If Company X says, "You must use Internet Explorer 5.5 to access this site," then you must use IE 5.5. Of course, it would be just plain stupid to do so, but it's their computer system, and they get to decide who is authorized.
To judge from most of the comments here, the fact that it is incredibly stupid to impose such restrictions has obscured what is actually a legally unambiguous situation. Just because it's dumb doesn't mean it's not legal.
That an http server is nominally "public" doesn't mean diddly here. Any number of http servers provide for member- or employee-only access. The brick and mortar parallel would be those signs that say things like, "No shirt, no shoes, no service."
It is surprising that so few people have touched on the reason why companies might object to the distribution of Perl modules designed to harvest data from their sites: bandwidth costs and site performance. It doesn't take too many cron jobs banging on a site every minute -- and being ignored by their users most of the time -- to degrade site performance for "live" users and run up steep bandwidth bills.
Now, there is certainly no legal basis for Company X to demand that CPAN remove the modules, though it is hardly out of line to ask nicely. But there is firm legal grounds to prohibit anyone from actually using those modules.
Legal action is probably the wrong way to handle this, though. Having written fairly complicated web scrapers before, I know how easy it would be to make a site virtually impossible to harvest. Rather than make a big stink about the Perl programmers who contribute to CPAN, Company X would be well-advised to hire a good Perl programmer to thwart automated harvesters.
Proud member of the Weirdo-American community.
When the water breaks, it is akin to breaking a shrinkwrap, so consent is implied.
Shame on Google.
It's not the ads that I dislike, it's the way they're made. Flashy, blinking ads are annoying, they enticed me to turn Flash off. OTOH I don't mind text ads and won't mind clicking on it, if I'm interested.
I've seen some sites start using text ads, lwn.net for example. And google has been doing it for a long time now, though it gets more annoying with more ads per page. Well, I guess they're successful *grin*
Yeah but it matters if there is creative expression involved. No one is saying a webpage is in the public domain, but rather that no one can copyright a fact. TV shows are not in the public domain, but no one can stop me from reporting that Joe Blow said "such and such" last night at 8:00 PST on channel 5. Thus the responsibilities to which you refer do not exist and are contrary to the public good with respect to the topic at hand. So remember, when people give up their rights because they are made to feel its the responsible thing to do, they are being irresponsible.
In the UK most people killed by fire arms are criminals killing each other and there are just a few incidents like those every year. It is simlarly so in most other places with gun restrictions.
In the US any retard can kill somebody else, they could not if they did not have a gun. It is similarly so in other places with lax gun restrictions or no restrictions at all.
Obviously the cause of so much violence is the widespread availability of guns.
You confuse cause and symptom, that tragic mistake commited by the US society is costing people in your country their lifes every day.
IANAL but write like a drunk one.
... is written where?
IANAL but write like a drunk one.
I was mistaken as to the interpretation of USA law.
Well in Seattle they have decided it is OK to photograph up a woman on the streets skirt. But I guess looking up a web site skirt is just too disguting and lecherous...
Personally, if I want to re-use content from an ad-supported site, I'd present the ad to the eyeballs. The company is being paid to show the ad along with the content, and I won't steal the content.
However, "Click the Monkey to Win!" isn't presenting any advertising information, and if the screen has no Web browser...the advertiser isn't going to get much business. But then, they don't get any business from my not clicking on the stupid image anyway. They'll have to present me with info which is interesting rather than stupid.
This one quote by the PERL developer says more than all the comments on Slashdot about how the real world works:
Today, they treatened me with a law-suit for writing this module. I would like to have the WWW::EuroTV module removed as soon as possible from CPAN and any of its mirrors.
In the finding, it says things like "By giving users access to Kelly's full size images on its own web site, Arriba harms all of Kelly's markets. users will no longer have to go to Kelly's web site to see the full-sized images, therby deterring people from visiting his web site."
Is scraping and reformatting the data you get equivalent to framing somebody else's content on your web site? Certainly the point of your modification is at a different point, and you are writing code to scrape and reformat, rather than HTML to make the browser load content from two different locations to simulate a single web page.
I'd love to be proved wrong on this....
Comment removed based on user account deletion
Isn't this a strange fact? They threaten people to write an API for eurotv, but they themselves offer a free API using javascript to display TV listings on your site! You can even use your own design, and I don't see any image you have to display on your site to give them credit. Okay, one catch though, you can click on a link that will bring you to the detail page of the show.
e mple_TV _Tonight.htm
This is what they say about the javascript API:
To have the TV tonight programs in their native language ! on your site, just copy the following line hereunder. Don't forget to change the parameters to adapt the country. Feel free to adapt the header and footer of your page before and after the include to adapt the look and feel to the one of your site.
You can read more here:
http://www.eurotv.com/scripts/jsinclude/Ex
Then, of course someone else (whoever gets the IP from the shared IP pool) is locked from that site for an hour.
But with the tree example - what if they try the following 3 answers, also valid responses, but not right?
- bush
- elm
- forest
Each response is a matter of the viewer's perspective.