Vertical Search Engines and Copyright
An anonymous reader writes "I am a big fan of Oodle, the online classifieds aggregator. I was disheartened when Craigslist announced that they would block Oodle from their site in late 2005 (old link), as I find their service very handy. I came across this page at the site of an aggregator of freelance job openings that summarizes the arguments around the legality of meta search engines and mashup-like sites and I found myself wondering if Oodle could have avoided the ban. There is an interesting argument there that seems to undermine copyright claims of user-generated content compilations. Are mashups legal? How does this affect sites like Digg or YouTube?"
I just can't get the hang of typing in a downwards direction.
does not a Slashdot post make.
In content aggregation lies all of my excitement about the future of the web (if people are allowed to continue being innovative and aren't prevented by heel-dragging by legal departments).
I don't even care if the aggregation happens server-side or browser-side. I want to be able to view a book product page on Amazon and click a "place local library hold" button. I want to be able to view my LiveJournal Friends page and have a superimposed queue and "recently watched" displays for those folks who are also my Netflix friends. Or current weather reports for those friends' locations. Fun stuff. I want to be able to stumble across an old news story and have a "there are 117 comments when this story was posted to Slashdot five months ago" notification.
There is so much potential here for crossover - and it's all data that already exists! Crosslinking through simple knowledge of "which person on one service is which person on another service" - and "which product on one service is which product on another service" - would open so many doors. I hope legal departments don't keep preemptively closing them. To me, this is what would excite me if it were true about "Web 2.0" - beyond just simple pretty, AJAX-enabled user interfaces. Although those are cool, too.
"Are mashups legal?"
What the hell is a "mashup"? When a butter and a potato truck collide?
Now, maybe I'm just not keen on the latest batch of synergistical leet speak, but aren't Digg and YouTube user contribution driven aggragators? Isn't the key feature of a Mashup that it uses functionality from different web services to create a new set of functionality? Say like tieing CNN's RSS feed to Google Maps to Flicker to get an interactive graphical, geographical, news browsing interface.
Or am I just out of touch?
-Rick
"Most people in the U.S. wouldn't know they live in a tyrannical state if it walked up and grabbed their junk." - MyFirs
If the mashup is put together in such a way that it is taking enough of the original data to significantly decrease the orginal sites then yes it is overstepping its bounds. but a lot of the time, two original sites really should be combined, with the mashup containing a small snippet of each topic, this way it points traffic toward the original site and makes things easier for people. whether this is the case or not, I have no idea, I dont really visit the mashup at all to tell.
Mashups are fine so long as they drive MORE traffic to the site than if they didn't exist. Things like the classified ads thing actually lower the amount of people that come and look at the site.
The line is not thier existence or non-existence. The line is whether or not they provide more or less service to the people that have the original site. If it creates traffic, good. If it lowers traffic, bad.
If I'm understanding correctly, craigslist has terms of service, and Oodle was systematically violating them. That's their right, whether there's a formal copyright violation or not.
I'd never heard of Oodle, but craigslist is notoriously easygoing and their terms (you can run searches but not mirror the whole damn thing) seem reasonable, so I think the way Oodle could have avoided the ban is by not pissing Craig off.
What I'm listening to now on Pandora...
If it's a site that is funded strictly from ads, then they have a lot to lose by others ripping their content. But at the same time mashups are a wonderful way of getting a lot of similar info together so it's a convenience to the end user.
IMHO aren't the two somewhat related?? (I know not really, but kInDa)
I don't know if mashups are legal in the strictest sense, but I do have an idea how I would want it to work. Academic publications are impossible to produce without citing the work of others. That's how research works. Information that did not originate with the author is attributed to its respective source(s). No muss, no fuss, usually, and there are accepted conventions for how this is done. Right now I don't think the web has any such accepted conventions, but it should. Practically speaking, it would be impossible to close down all aggregation sites anyway, so the best course of action, imho, would be to develop standards for citing information that comes from other sources. While these still can't be enforced 100%, peer pressure should at least give people the idea that citing sources is a good thing.
To the making of books there is no end, so let's get started
Content is not born in a vacuum.
Most content is derived in some way by other content.
There is some magical line between "inspired work" and "derivative infringement", but the line is merely a legal one and not necessarily a moral or helpful one.
Take the "Matrix" for example. Perhaps the Matrix is derived from "Dark City", "metropolis", biblical stories, Anime, Hung Fu Movies, and more.
But if the Matrix were to admit derivation, then they would likely have to royalties.
The effect of current copyright law is that your work can derive in two ways:
1. Derive directly. Pay royalties, give credit.
2. Derive in a subtle way. Make it different enough where it is said to be an "original work". Give no credit.
One could think of derivations in engineering terms.
Imagine a clean room reverse engineering of music.
1. Person A listens to music. Writes down timing, types of scores, chords, and timing used, but no actual notes.
2. Person B writes "new music" using "music specification".
Someone please finish this post. changing to AC. Time to take a nap.
I am making something similar to create notifications for posts on craigslist right now. It is written in Ruby, and it basically enters the sections you specify on craigslist, and downloads and stores the last 100 postings into an Sqlite3 database.
Then, as a human might do if he were obsessive, checks the section indexes for updates say every 10 minutes and incrementally stores new posts.
The data in sqlite is then indexed by the ferret search engine library, so that it can perform searches on the post content and uses gtk2's libnotify to pop up a notification bubble if it has found anything you previously said you were interested in.
I have not gotten banned in any way from craigslist, and I don't expect to be, since beyond the initial download of the sections, it behaves no different than an obsessive human who might be looking at 10 pages every X minutes. With this, I would be necessarily one of the first people to notice anything on the site that I'm interested in.
I will probably release this on my site for everyone. I'm aware it's against the terms of service to completely mirror the entire site, but does this count as mirroring? Can it be deemed similar to greping your firefox cache, or personal mirroring and indexing?
I know I'm sure as hell going to use it, that's why I made it, but it is an awkward feeling that if I give it away for free and people liked it, that I could get into some kind of trouble.
Craigslist blocked a site, it's their right and there's nothing wrong with that.
I have experience at two companies that did site aggregation. First, with a company that did travel deals but searching other sites and the next was a job site that did the same. Searching and presenting a summary with link to the real live content is legal. Taking the content and re-purposing even with credit is illegal. So as an example, with a travel sight, searching all the airlines, Expedia and so on, and displaying links with prices is valid. However, showing the flights and prices without links and then booking it in the background never displaying the site, illegal. We had a number of companies that tried to sue us, we send over legal opinions and case history on the topic, the suits would disappear. However, we did have a few sites that blacklisted our IPs, tried to break our scraper, and post nasty things about us on other sites.
I wonder what the rules are surrounding my site..
Then again, news = current events and current events are not copyrightable..
MABASPLOOM!
I have found Yahoo Pipes to be an indispensable companion to Craigslist RSS feeds. I can plop in feed from say Fresno and SF Bay, search with positive, negative, and grouped searches, and restuff that back into a new RSS feed.
You can't copyright fact. Once you've put something out there for other human beings to experience, it has become a physical fact, and there is no longer any control over it, because, as I said, facts are not copyrightable.
You could spend your whole life gathering data about something, publishing a book with the data, but you can't copyright the data, because it's fact. Sure, you could copyright the verbatim instance of these facts, but that only protects you against others claiming your book as theirs. Anyone could use the same data in their books, even the same layout, sorting, etc.
Same thing with lyrics. Once they've gone public, they become fact, not property. Same thing with MIDI files. People listen to the fact (waves of air), jot down notes in a sequencer, save an approximation, and they can do so legally, because the initial work has become a factual presentation, and again, facts are not copyrightable. It's easy to understand, if you want it to be.
Anyway the comments so far seem to be blurring together several important but very different notions.
All of these are interesting difficult question but let me say a few words about each in turn.
--
For #1 I am reasonably confident that the courts would find crawling that was so resource intensive that it effectively amounted to an DOS attack was banned but it's unclear where this line would be drawn. For instance is crawling that just causes a noticeable slowdown to other users enough to place one in this category? Does the size of the website suffering the slowdown matter or how frequently it happens? It would be unfortunate if archivists were forced to let any owner of a public page opt out because they don't know whose pages are being hosted over 56k modems. A good resolution to this problem will most likely have to await significant agreement on a defacto set of rules for playing nice that congress could then baptize into law. At the moment so long as a reasonable person wouldn't call you a DOS attack your probably safe in regards to the pure server load issue (though IANAL).
A more interesting question here is whether someone crawling your site is bound to follow your terms of service. Those silly little "you are not a member of law enforcement or the RIAA" access requirements have not held up in court suggesting that a totally open website like craigslist can't demand you accept it's terms of use just to crawl it (and your bot surely didn't sign a contract).
---
#2 gets a bit more tricky because now we are talking about copyright law. Obviously if you merely duplicate all their work and host it on your site you will have to pay up when they sue. However, it seems clear that in US courts a transformative use, like creating a search engine, that only displays small snippets of the original work is in the clear. True meta-search engines that repackage the search results of a few search engines seem to be on more shaky ground. So while IANAL it seems to me that if Oodle had been indexing craigslist as one site among many they would have been able to (eventually) win a lawsuit.
--
#3 is where the real action is. While you can probably legally get away with being a dick to some websites practically a de facto standard of good manners for web crawling and indexing is very important. Not only do we need a generally accepted sense of what is fair before we can pass the right laws as a practical matter if you don't comply with the de facto standards you will suffer. Once there is an accepted standard of behavior, like robots.txt, companies that flagrantly disregard it will find the hosts they are trying to crawl entering into an arms race with their crawling software. Oodle may have been in the legal right but even if so the practical difficulties with battling craigslist's web server team may have made it an unattractive prospect. On the other side of the table if you don't place nice with the bots and let them index fairly you may find your site delisted from Google and similar search engines.
Unfortunately it is totally unclear what the right standard in this area should be. Most people agree that the search should normally send people back to the original page but when is it okay to cache? It wouldn't be cool to copy all the posts from craigslist and republish them on your own site (almost certainly illegal) but what about copying all the data displayed in
If you liked this thought maybe you would find my blog nice too:
used to do the same thing, but they would stick Google Ads in between the actual scraped content so you were more inclined to accidently click a Google Ad than the Classified that you really wanted to see.
ABCFREE.COM seems to have lost their Google Ad account because of this and then I guess it was not worth scraping Craigslist anymore because the site has "down for maintenence" page up now for quite some time.
How many other sites had a business plan like this based on scraping Craigslist and sticking up Google Ads?
Oodle may be a good site, but it appears that many other sites decided to do the same thing around the same time span.
If I was running Craigslist I would wonder why the hell all these other sites were sucking our bandwidth and content and I would cut them off too.
I like microcars
I hope that didn't take you all afternoon.
The point is not can scrape another company's content, and republish it, and does it hurt their feelings or even their business.
The point is, can you scrape a web page such as craigslist, where Craig Newmark nor any of his employees was the author or copyright holder of the content you are scraping, and do whatever you want with that ? Can you scrape a web page in which the content is not even a "work" that is copyrightable, because it is a list of facts, and do whatever you want with that ?
Take these posts for example. We own their content according to slashdot, not slashdot. So, if I take all the comments and republish them in a book, who owns copyright of the book ? Can I stop the author from using my trolls ? What if the author only gets permission from Taco and not from me ?
First thought that came to my mind were Artists from clearly
different genres of music collaborating for a song, such as
Eminem and Elton John, or Nelly and Tim McGraw, or when a producer
of a Mix-tape samples various older songs from different genres and makes
some sort of a dance mix, or even an entirely new song.
Those would be cool to see more of.
You don't have to sign *anything* when you visit a website.
The terms of use are basically if the site responds to your HTTP GET. It can always deny the request (ok, not sure how net neutrality would stop servers from "discriminating" between normal end users and mashup sites).
Is it just me, or are job boards the worst offenders of "please don't use our content"? This has happened recently in Australasia with the takedown of jobby.co.nz and the legal threats of seek.com.au to myspider.com.au (blogged about at http://www.engageonline.co.nz/blog/?p=84).
What really miffs me, is how the job boards can say they "own" the content, when actually, it's been posted by other people on these sites and is really their content.