Copyright Tool Scans Web For Violations

← Back to Stories (view on slashdot.org)

Copyright Tool Scans Web For Violations

Posted by Zonk on Tuesday December 19, 2006 @04:21AM from the he-knows-when-you've-been-bad-or-good dept.

The Wall Street Journal is reporting on a tech start-up that proposes to offer the ultimate in assurance for content owners. Attributor Corporation is going to offer clients the ability to scan the web for their own intellectual property. The article touches on previous use of techniques like DRM and in-house staff searches, and the limited usefulness of both. They specifically cite the pending legal actions against companies like YouTube, and wonder about what their attitude will be towards initiatives like this. From the article: "Attributor analyzes the content of clients, who could range from individuals to big media companies, using a technique known as 'digital fingerprinting,' which determines unique and identifying characteristics of content. It uses these digital fingerprints to search its index of the Web for the content. The company claims to be able to spot a customer's content based on the appearance of as little as a few sentences of text or a few seconds of audio or video. It will provide customers with alerts and a dashboard of identified uses of their content on the Web and the context in which it is used. The content owners can then try to negotiate revenue from whoever is using it or request that it be taken down. In some cases, they may decide the content is being used fairly or to acceptable promotional ends. Attributor plans to help automate the interaction between content owners and those using their content on the Web, though it declines to specify how."

35 of 185 comments (clear)

Min score:

Reason:

Sort:

Wager by Baricom · 2006-12-19 04:33 · Score: 3, Insightful

Anybody care to place a friendly wager that they're not going to honor robots.txt?
1. Re:Wager by Crudely_Indecent · 2006-12-19 05:35 · Score: 2, Informative
  
  Another company "Cyveillance" already does this for major corporations and the government. I've used htaccess rules to disallow all from their assigned netblocks after they racked up almost 20,000 hits to my personal site in one day. As you mentioned, they didn't follow robots.txt and attempted to index parts of my site that are password protected as well as content names that did not exist (music and videos and such), all the while identifying their bot as a variant of IE.
  
  Here's how to block two subnets using htaccess and mod_rewrite on apache:
  RewriteEngine On RewriteCond %{REMOTE_ADDR} "^63\.148\.99\.2(2[4-9]|[3-4][0-9]|5[0-5])$" [OR] RewriteCond %{REMOTE_ADDR} "^63\.146\.13\.6([4-9]|[7-8][0-9]|9[0-5])$" Rewri teRule ^(.*)$ - [F]
  Line 1 activates the rewrite engine
  Line 2 sets the condition to include remote addresses 63.148.99.224-255 and includes [OR] to allow further processing
  Line 3 sets the condition to include remote addresses 63.146.13.64-95
  Line 4 sets the rule that any url be forbidden
  
  So, save your bandwidth by denying access to your content from unauthorized viewers (bots)
  
  --
  
  "Lame" - Galaxar
2. Re:Wager by BrynM · 2006-12-19 16:50 · Score: 2, Informative
  
  There's an easier way. You can hand mod_access netblocks and more. This method will avoid eating cycles with mod_rewrite. If you can put it in your conf instead of .htaccess, you'll save even more time/processing. Just put it in for your doc root. From my httpd.conf:
  <Directory "/var/www/htdocs/"> # BRYN'S DENIALS # allresearch.com deny from 209.73.228.160/28 # branddimensions.com user-agent: BDFetch deny from 204.92.59.0/24 # cyveillance.com deny from 63.148.99.224/27 deny from 65.118.41.192/27 # www.markwatch.com user-agent: markwatch deny from 204.62.224.0/22 deny from 204.62.228.0/23 deny from 206.190.160.0/19 # nameprotect.com user-agent: NPBot deny from 12.40.85.0/24 deny from 12.148.196.128/25 deny from 12.148.209.192/26 deny from 12.175.0.32/28 # rocketinfo.com deny from 209.167.132.224/28 # END BRYN'S DENIALS </Directory>
  Now I gotta look up IPs for these clowns... damn copyright ambulance chasers... arin.net here I come!
  
  --
  US Democracy:The best person for the job (among These pre-selected choices...)
Can't they just use google or torrent sites? by LiquidCoooled · 2006-12-19 04:33 · Score: 3, Informative

Can't they just use google or torrent sites?
If users can find items they want, presumably the copyright holders could use the same methods...

--
liqbase :: faster than paper
1. Re:Can't they just use google or torrent sites? by owlnation · 2006-12-19 05:07 · Score: 2, Funny
  
  And the opposite situation shows why this tool is a waste of time.
  
  Imagine a tool where you could reliably return accurate and search results for images and video. Does this exist yet? No, as one who searches the web daily for pics and video for my own sordid uses, let me assure you that it most certainly does not yet exist.
  
  And what an horrific waste to have such a tool - if it works - for policing content for copyright violations. Bearing in mind also that such "violations" are no such thing in some countries, regardless of the imperial arrogance of media companies.
  
  As always, and tell your family and friends, only buy music directly from the artist or secondhand. It's the only way to win.
buh by lucky130 · 2006-12-19 04:36 · Score: 5, Insightful

"as little as a few sentences of text or a few seconds of audio or video"

Like quotations in a paper, or video snippets in an educational presentation?
1. Re:buh by NeutronCowboy · 2006-12-19 05:38 · Score: 4, Insightful
  
  You're assuming anyone is going to manually verify any of the results. From my experience with people using monitoring software (especially non-techies who are simply consumers of the technology, but who provided the money for it), the vast majority of them are simply going to call their lawyers when they see the dashboard light up. I see vast letter writing campaigns come from this, with little actual infringing being prosecuted.
  
  This is a scary product. Not so much because of the technology behind it, but because of how it is going to be implemented and (ab)used.
  
  --
  Those who can, do. Those who can't, sue.
Fighting an avalanche with a snow shovel by TheWoozle · 2006-12-19 04:42 · Score: 4, Insightful

Doesn't this merely serve to point out the absurdity of "Intellectual Property"?

--
Insisting on "correct" English is like saying that there is only one, definitive recipe for chili.
Raise. by Tackhead · 2006-12-19 04:44 · Score: 3, Funny

> Anybody care to place a friendly wager that they're not going to honor robots.txt?
127.0.0.1: $ cat robots.txt
# robots.txt for 127.0.0.1 # This file is copyright 2006 by me. User-agent: AttributorCorporationDMCABot Disallow: *
And if they do honor robots.txt, I'll be able to sue the fuckers for infringing on my copyright, because they must have read it in order to honor it.
1. Re:Raise. by rhartness · 2006-12-19 05:00 · Score: 2, Insightful
  
  You know, I've actually had a thought along those lines in trying to explain to untechnologically savvy individuals why Digital Rights laws are screwed up and that handling digital content on the web is a grey area. Consider the following.
  
  Most web sites have a copyright statement on them some where (even this one!). Technically speaking, if I go to that web site, my browser copies the page along with all it's media content and caches it. Since many of those sites do not have a terms of service posted allowing the viewing of the content through regular web browsing my computer is therefore violating copyright laws, right?
  
  Every single web user out there is breaking the law!
2. Re:Raise. by Mayhem178 · 2006-12-19 05:12 · Score: 5, Funny
  
  127.0.0.1: $ cat robots.txt
  # robots.txt for 127.0.0.1
  # This file is copyright 2006 by me.
  User-agent: AttributorCorporationDMCABot
  Disallow: *
  
  Hahaha! You screwed up! I have your IP address now! I will send 127.0.0.1 to every company that uses the sniffer and tell them the person at that IP is an evil, evil person who exploits innocent people for their own profit and power!
  
  --
  "You will pay for your lack of vision..." - Emperor Palpatine to Ray Charles
3. Re:Raise. by FooAtWFU · 2006-12-19 05:32 · Score: 3, Interesting
  
  You joke, of course, of course, but there are tools out there to detect when a bot is abusing your site and not following robots.txt. The usual technique is to hide a few links in your page, and also have these links blocked by robots.txt. When a user visits the link, they're banned from viewing the site. (Sometimes, a CAPTCHA-like utility for unblocking yourself is presented along with the 403 page, in the event that a particularly curious user manages to find the link and activate it manually.)
  
  --
  The World Wide Web is dying. Soon, we shall have only the Internet.
4. Re:Raise. by Kamiza+Ikioi · 2006-12-19 10:56 · Score: 2, Insightful
  
  True, but there's a way around that as well. Any robot service worth its weight in fiber has more than one IP, and can have multiple subnets. Best way is to dump robots.txt links to a separate subnet, have it check later in the day. If the IP gets banned, it can check by trying to access the main page, see if it starts getting errors. It can then mark "booby-trap" sites on a list, and route around either the specific triggers or actually honor the robots.txt.
  
  You have to have more links than they have IPs to stop a full scan. Of course, if even one link bans, they can just pay a guy to sit on a few major ISP provider accounts and manually check your robot links. Then they don't care if you ban, because you'll have to ban entire regions of the world as they bounce around with multiple dynamic IPs. If you have this as automated subnet banning, you'd actually help them out by allowing them to set your bans across major ISPs... especially if you have any content they deem questionable, you just gave them a way to shut you down remotely.
  
  Of course, the rule here is to never ever automate subnet bans on a public access site... but then you still can't stop them either way.
  
  --
  I8-D
Yeah by Hijacked+Public · 2006-12-19 04:45 · Score: 3, Interesting

FTFA:
If it works, it's a fantastic invention

Its purpose aside, yes, it would be a fantastic thing to be able to scan the entire web and reliably identify the context and content of any specific media file type. Video, audio, image, etc. Particularly if it could identify purposely obfuscated content.
I'm in what is almost certainly a tiny minority of Slashdotters in that I actually create copyrightable material rather than only consume it. I'm again in the minority in that I think copyrights are a good thing and again in the minority in that I can separate out the purpose of copyrights and the evil actions of the legal arms of **AA companies.
Regardless, while scanning the internet for improperly used material sounds great on paper this will probably end up being as effective as finding water with a divining rod. The current tactic of locking down things at the hardware and OS levels will get more support from the media companies, not that they seem all that good at choosing tactics when the internet is involved.

--
"Sacrifice for the good of The State" - The State
1. Re:Yeah by jedidiah · 2006-12-19 04:59 · Score: 3, Insightful
  
  There's a wide gulf between copyright being a good idea in concept and being sensibly implemented in it's current form.
  
  Not everyone that creates content thinks that draconian enforcement attempts are a good idea, or even in the best interests of those that create content.
  
  If your work can't survive in the marketplace, which includes the prospect of everyone on the planet getting to use it for free, then perhaps you should get some sort of more conventional day job.
  
  The difference between a game that sells 50K and one that sells 5 Million has nothing to do with DRM.
  
  --
  A Pirate and a Puritan look the same on a balance sheet.
2. Re:Yeah by AdamKG · 2006-12-19 05:17 · Score: 4, Interesting
  
  and again in the minority in that I can separate out the purpose of copyrights and the evil actions of the legal arms of **AA companies.
  Let's make one thing clear: the RIAA/MPAA lawsuits are not, in any way, shape, or form, an abuse, negative side of, misapplication or malicious use of Copyrights. They fulfill the role of Copyrights in the first place; they are the logical end result of a system that says citizens are allowed to distribute ideas (or expressions of ideas), then stop any further distribution of them.
  
  The **AA lawsuits are ridiculous, yes. But the ridiculous part is not the litigation itself, it's the laws on which the lawsuits are brought under.
  
  --
  groupthink: It's good for self-esteem.
3. Re:Yeah by kanweg · 2006-12-19 05:26 · Score: 3, Interesting
  
  I'm a patent attorney and no stranger to IP. Having said that, any IP law is, or at least should be, a balance to on the one hand freedom to operate (both for IP users and for IP creators) and on the other hand a means for compensation for IP creators. For patents, that balance is not there for patents on software. Also for patents, at least they last for 20 years max. For copyright, that balance is not there. And I'm curious to hear whether you think it is a good thing that whatever you create is still under copyright more than 40 years after you die.
  
  Bert
Software is in beta by Weaselmancer · 2006-12-19 04:46 · Score: 2, Funny

Attributor plans to help automate the interaction between content owners and those using their content on the Web, though it declines to specify how.

And apparently being written by underpants gnomes.

--
Weaselmancer
rediculous.
Some interesting questions... by PingSpike · 2006-12-19 04:46 · Score: 4, Insightful

Great, now all the torrent sites will require captcha verification too! ;P

Actually, can they even scan torrents without downloading the entire file? And whats to stop everyone from just blocking them from accessing their websites? Are they going to go in covertly, pretending to be actual users? I can see every legit website blocking their access as well, why pay for bandwidth to supply that?

Sure, youtube can be more efficiently attacked...but youtube has been dancing in front of the cannons since its inception, we all knew it was going to get shot eventually.
search by hash? by straponego · 2006-12-19 04:47 · Score: 3, Interesting

Does Google allow searching by md5sum or equivalent? I'm sure they have the capability. While not as impressive as what this company claims, it'd also be more reliable for unaltered media files.
But it looks like the real "innovation" these guys are pushing toward is fully automated filing of lawsuits. I think that was in Accelerando, which is fantastic, and which you can download it free.
1. Re:search by hash? by Johann+Lau · 2006-12-19 05:45 · Score: 4, Informative
  
  "Unaltered media files" are the exception, not the rule. Changing even a bit of metadata (stripping exif from an image, changing an mp3 tag) would change the checksum, not to mention things like putting things into an archive, resizing images, (re)recompressing music.
  
  But yeah, it might make sense for Google to become "aware" of unique content and variations of it.. but I doubt they'd ever use that openly for (aiding in) hunting down copyright infringement, simply for PR reasons.
2. Re:search by hash? by stivi · 2006-12-19 06:26 · Score: 2, Interesting
  
  Hm, what about computing checksum of the actual media contents? For example, compute checksum only for sound data in MP3 or image data in image files, ignore all other data/metadata. Usualy media files are containers for smaller objects or data streams... Resampled or modified contents would not be detected though.
  
  --
  First they ignore you, then they laugh at you, then they fight you, then you win.
Re:Dupe by AKAImBatman · 2006-12-19 04:49 · Score: 2, Interesting

Pretty sure this is a dupe, or so closely related to an earlier story as to not matter.

It's not a dupe. (Unless you count anything that appears on Digg first to be a dupe.) However, it's also not the first story of its kind. About a gazillion companies have formed with the exact same business plan (save for the "hotness" at the time being digital music) and about a gazillion of those companies have failed to develop software that catches anything but the most obvious infractions.

Every so often, some RIAA/MPAA fair-haired boy manages to get funding for yet another attempt. He then fails miserably and the cycle repeats. You'd think the investors would learn. Unfortunately, they keep getting dazzled by the latest, buzzword-compliant technologies.

--
Javascript + Nintendo DSi = DSiCade
Re:Dupe by Maximum+Prophet · 2006-12-19 04:53 · Score: 3, Interesting
Since copyright lasts a long time and doesn't depend on being defended like trademark, there will be some allowances "for promotional reasons" like this:
1. Leak copywritten material in easy to copy format to places where it will be copied
2. Watch viral marketing campaign take over
3. Profit
4. Wait 'til revenue falls
5. Find infringers using new scan tools
6. Sue them
7. Profit more!!!
--
All ideas^H^H^H^H^Hprocesses in this post are Patent Pending. (as well as the process of patenting all postings)
Re:i don't like robots.txt anyway. by FooAtWFU · 2006-12-19 04:57 · Score: 5, Informative

You're absolutely right that "if you don't want it on the public Web, don't put it there in the first place" -- but there are still times when you have a legitimate reason that you don't want a page indexed, downloaded, or otherwise visited by a robot. Dynamically generated content is one example reason; sometimes certain pages can be a big drain on your website, and you'd prefer not to have every spider in the world hitting them up every few minutes.
Let's take a fun legitimate site like, oh... Wikipedia:
# Folks get annoyed when VfD discussions end up the number 1 google hit for # their name. See bugzilla bug #4776 # en: Disallow: /wiki/Wikipedia:Articles_for_deletion/ Disallow: /wiki/Wikipedia%3AArticles_for_deletion/ Disallow : /wiki/Wikipedia:Votes_for_deletion/ Disallow: /wiki/Wikipedia%3AVotes_for_deletion/ Disallow: /wiki/Wikipedia:Pages_for_deletion/ Disallow: /wiki/Wikipedia%3APages_for_deletion/ Disallow: /wiki/Wikipedia:Miscellany_for_deletion/ Disallow : /wiki/Wikipedia%3AMiscellany_for_deletion/ Disall ow: /wiki/Wikipedia:Miscellaneous_deletion/ Disallow: /wiki/Wikipedia%3AMiscellaneous_deletion/ Disallo w: /wiki/Wikipedia:Copyright_problems Disallow: /wiki/Wikipedia%3ACopyright_problems
(They also disallow certain specially generated pages like Special:Random, and any of the pages which actually let you edit the site).
Let's see, what are some other sites? Ooh. Take a look at Slashdot's robots.txt! (disallows a variety of fun pages.) Microsoft's? How about whitehouse.gov? Google?

--
The World Wide Web is dying. Soon, we shall have only the Internet.
It's just a tool by 91degrees · 2006-12-19 04:58 · Score: 2, Insightful

As long as it respects basic internet rules of conduct (including respecting robots.txt), then this is ethically neutral.

It all depends on how it's used. Many companies would prefer to avoid coypyright infringing material, and will take it down if the existence is pointed out to them. Many companies will simply be asking others to remove material which clearly and flagrantly breaches their copyright. This is perfectly reasonable behaviour.
what's their probability of false alarm? by Anonymous Coward · 2006-12-19 05:05 · Score: 2, Insightful

This may be much less helpful than its promoters claim.

First of all, what's the their probability of a false alarm? Even if they false alarm fairly infrequently, the vast amount of content on the Web means they could easily have a flood of false alarms, in addition to whatever actual copies are found. The user of the system is then going to have to have human beings sift through that flood to identify what's A) really a copy, B) whether that copy is infringing or not, and C) if so, is it worth taking action against the infringer?

The above may be more trouble/expense than it's worth in many cases.

Not that the RIAA always bothers to verify actual infringement has taken place before suing, but some organizations may be a little more ethical, or at least a little less trigger-happy.
If you value your "property" so much... by Anonymous Coward · 2006-12-19 05:06 · Score: 2, Insightful

...then do not put it to the Internet.
In fact, burn it to a DVD and lock it up to a safe, and never talk about it. That way nobody else will ever have access to your "intellectual property".
Re:Negotiate Monitization? by FireFury03 · 2006-12-19 05:09 · Score: 2, Funny

If the industry had their way, rap music would have never happened

I don't understand... your post seems to imply this is a Bad Thing?

--
http://blog.nexusuk.org
A real use on /. by EmbeddedJanitor · 2006-12-19 05:53 · Score: 2, Funny

The editors could run this tool just on /. to check for dupes!

--
Engineering is the art of compromise.
Re:i don't like robots.txt anyway. by mandelbr0t · 2006-12-19 06:32 · Score: 5, Informative
Dynamically generated content is one example reason; sometimes certain pages can be a big drain on your website
And dynamic content is, of course, the answer. If I'm going to put up copyrighted content in the future, I'd use one of a dozen schemes that regenerate the download link on a per-session basis. Obviously they're not going to honour robots.txt, but why are your links readable by such a basic spider? You need to:
1. Disallow anonymous downloads. You need to be logged onto the site to download anything, torrent or otherwise
2. Use a CAPTCHA to prevent spiders from signing up for said accounts
3. Use the session id to generate unique download links on a per-session basis
4. Change the key on your BitTorrent tracker every 12-24 hours. This will require that a downloader get the latest torrent from the original website (which requires login), reducing the impact of a leaked torrent
5. Compress and possibly encrypt the content so that it's less obvious what it is
Anyone who follows the above steps (and most sites already do most or all of this) won't be found by the spider. Period.

The only thing I can think of that this product would be useful for is to find people who have blatantly copied my website, but I'm sure you could find those people equally easily with Google.

mandelbr0t
--
"Please describe the scientific nature of the 'whammy'" - Agent Scully
His IP is my IP to by Anonymous Coward · 2006-12-19 07:02 · Score: 2, Funny

and whenever I go out, the FBI begins to shout Title 17 U.S.C...
What concerns me: by botlrokit · 2006-12-19 07:28 · Score: 2, Interesting

I'm bothered by this type of scenario:

"Dear [webmaster]:

It has come to our attention that your website, [sh*touttaluck.com], does not meet compliance in terms of a variety of copyright laws of the United States and other countries. Infractions indicated by our software include, but are not limited to:

Images created with an unregistered copy of Adobe Photoshop
Flash files created with an unregistered copy of Macromedia Studio MX 2004
PDFs created with an unregistered copy of Adobe Acrobat Professional
Content and structure created with an unregistered copy of Macromedia Studio MX 2004
Content and structure created with an unregistered copy of Microsoft Office Frontpage 2003
Images created with an unregistered copy of . . . "

...starting to see what I'm going with? I understand they're likely talking about copyrighted content such as prior art images or mp3 files, or maybe even damaging company secrets that are leaked by a whistleblower, and then redistributed for the intent of airing dirty laundry, but I'm thinking about the structure of a page itself. A person group or company who solicits a webpage to be created by a web design studio would now have to ensure that the studio itself is in compliance, or the products they use to create the pages are legal. That's where I get all nervous.
1. Re:What concerns me: by PPH · 2006-12-19 09:00 · Score: 2, Funny
  
  ...html created with an unregistered copy of vi.
  
  --
  Have gnu, will travel.
I've experienced it from both sides. by bcrowell · 2006-12-19 09:20 · Score: 2, Informative

I've experienced this from both sides.

I have a bunch of my books on the web, and every once in a while I do a search on some text from my own books to see who else is mirroring them. The books happen to be copylefted (dual-licensed GFDL/CC-BY-SA), but I'd like to know who's mirroring them, and check whether they're violating the license. A lot of people just seem to be hoarding the PDF files on their university servers, maybe because they're afraid my web site will disappear; that's flattering. One guy was selling them on CDs on e-bay, violating my license (claimed they were PD, didn't propagate the license). Another guy translated them to html, with lots of errors, changed the license to a more restrictive one, and put his own ads up; he fixed the licensing violation when I complained, and in a way it was a good thing, because it motivated me to make my own html versions (which are now bringing me a significant amount of money from adsense every month). One kind of annoying thing about mirroring is that the people who are mirroring never bother to update their mirrors, but in general I just figure there's no such thing as bad publicity :-)

From the other side, I once received an e-mail from a museum in the UK that was complaining that I was using a 17th century oil painting of Isaac Newton. I guess they own the original, and they may also have been the ones who did the scan that I found in a google image search, but under U.S. law (Bridgeman Art Library, Ltd. v. Corel Corp.), a realistic reproduction of a PD two-dimensional art work is not copyrightable. What really surprised me was that they came across it at all, because at that time I think my book was only in PDF format, and hadn't been indexed by google because the file size was too big.

The whole thing doesn't seem negative to me in general. It makes just as much sense as people doing a vanity search in Google before they apply for a job, or authors watching their amazon.com sales rankings obsessively. I guess the most obvious potential for abuse would be if they send a nastygram to your webhost, and your webhost is a low-end one that figures it's not worth their time to keep your account, so they just shut off your account.

--
Find free books.