Web Scanning Technology for Copyright Violations
eldavojohn writes "I've heard a lot of talk about software being used to detect pirated media anywhere on the web, but haven't seen a lot of details. PhysOrg has a good article on one of the tools out there. Automatic Copyright Infringement Detection (ACID) boasts a patented technology dubbed 'meaning-based computing' that is reportedly capable of finding relationships among 1,000 different types of files. The important thing is that this is not tagging-based searching. 'Autonomy's search technology uses automatic hyperlinking and link clustering that the company claims isn't built into keyword search engines. According to the company, this technology allows computers to perform searches with greater context, so it finds a wider range of related documents or research citations than is possible from keyword searches.' For more details on how this magic works, check out Autonomy's patent and the many patents by its subdivision, Virage."
And how well does this work if people encrypt their files and send the keys separately?
This technology sounds like it's stuck behind the buzzword "meaning-based media," which seems to just be an abstract notion of finding and sorting media without profiling, hashing, fingerprinting, tagging, watermarking, sourcing, or naming (in other words, by going on bullshit notions and intuition. "Oh, it looks copyrighted.")
More importantly, it looks like it can't do anything unless the target is somewhere on the Web and is reasonably active. The darknets and private trackers are still safe.
~ C.
Sure, they have a patent, and if they actually implement what's in the patent it's meaningful to look at... but more often than not, the patent is much broader than the actual application, or the patent isn't even being used.
If I looked at patents to determine what a business was capable of, I would be driving a car that gets 100's of miles to the gallon!
Sometimes the best solution is to stop wasting time looking for an easy solution.
I find it ironic how stuff like this ends up being the among the more practical applications for AI. I mean, science fiction is usually about robots taking over. Instead, we end up with an internet full of bots trying to sell viagra, bots trying to block viagra, bots trying to break captchas, bots trying to detect copyright infringement, p2p systems to insure privacy, and so on.
I don't think this sort of searching for pirated content is going to be terribly effective, though. I mean, it might be able to catch the blatant stuff like youtube, but ultimately, they're never going to kill p2p, especially once private trackers become more common.
Wouldn't it be nice if we could consider factors of weight, proportion and intent with a more case by case methodology. All this nailing everything down attitude is tiresome. I miss the 80s and early 90s live and let live freedom. People need to grow a sense of humor and realize what the internet is; THE global corral of info - get on...
All those buzzwords. Apparently somebody has a system that can characterize and match images and video. That's reasonable enough, it's been done before, and the question is how good the new one is. The article gives zero help in that direction.
From the same source: "Nanogenerator provides continuous power by harvesting energy from the environment". It's a variation on the piezoelectric generator concept, like a piezo fire starter.
Sure, but did they Trademark the Patent that looks for Copyright by IP Freely? (apologies to Bart Simpson).
Not to complain about the article too much, but is there anyone out there who didn't find it completely contradictory and useless?
As far as I can tell, the article starts off by saying that they have a wonderful system to inspect and compare the video content of a clip against a HUGE database (eg. tens of thousands of hours of copyrighted movies, TV series, music). And, that they know how to read _any_ media format (eg. an AVI using some particular codec embedded into a Word document which is zipped....) The suggestion is that the software could "read" a Youtube video clip, and recognize that it contains a few minutes of a Jay Leno monologue. Needless to say, they don't explain how they might possibly do this - because, as far as I can tell, they can't. Not even close.
If you look at the patents, they're pretty much all about text or metadata searching. For example, they seem to have found an innovative way to find keywords to categorize a document....by scanning for words in the document! Or of categorizing a video file...by looking at metadata (eg. comments) embedded in the file. The only amazing thing about these algorithms is that some dimbulb in the patent office decided to give them a 20 year monopoly on something people have been doing for decades.
Did their software detect the patent that it is infringing upon? Bastards!
Back in the early days of cars, most folks thought the red flag act was entirely justified.
Sorry, but we've hit a new age of abundance. With the overwhelming percentage of internet users using LimeWire, BitTorrent etc, attempts to sustain a manufactured scarcity in the face of this abundance will similarly fade away into obsolescence.
The copyright enforcement versus piracy arms race will make for interesting history courses in future decades. I can see the courses now - "The Rise And Fall Of Intellectual Property".
I'm looking forward to blowing my grandkids' minds when I tell them about the era when information wasn't free.
-- In the beginning was the WORD, and the WORD was UNSIGNED, and the main(){} was without form and void...
Hopefully, that means no one will be foolish enough to pay to use it.
When I moderate, I only use "-1, Overrated". That way, I never get meta-moderated!
Copyright is kinda like a cryptosystem, where the copyrighted piece of content is the key. Only one person can hold copyright to a given key, and the key makes the keyspace around it derivative work space, where new creations need the permission from the copyright holders of each derivative space that the new key belongs to.
A copy key gives the copyright holder the privilege to deny anyone else the use of that key and the derivative space around it. Derivative works include works that include a copyrighted work in it. In mathematical terms, a derivative work is a work that can be said to have the copyrighted work as one of its elements. Hence there exists a minimum set of copyrightable content, that give full coverage over the derivative space, i.e. every work not in the minimum set is either in the public domain, not copyrightable, or a derivative work of a minimum set element.
You can copyright digital information. The set of digital information is enumerable, so the minimum set for digital information is also enumerable. Since compressed works are copyrightable, you can copyright the combination of a decompresser and compressed information. This makes is feasible to mount a minimum set attack against a mode of creation.
Write a decompresser that creates copyrightable works that belong to the minimum set of the mode of creation under attack. The uncompressed works must be unique for each number (compressed information) that is smaller than the size of the minimum set. Now, create a file with all numbers from zero to minimum set size - 1. Or even better, create a file with just the minimum set size and have an another decompresser that creates the file with all numbers up to it. On completion of these two decompressers and the file with the minimum set size, every work of the mode of creation after that time is a derivative work of your minimum set.
Since you need to extract a certain work from the minimum set to prove copyright infringement, you also require a compressor for the mode of creation. The compressor works like a digitalizer, you quantize the information in the infringing work so that it maps to the closest pattern in the minimum set. Then use that number to fetch the corresponding work from the copyrighted minimum set.
You may use this information to create your own minimum sets under the condition that you release each minimum set under the GPL.
Seems every week some company comes up with a way to detect copyright violations or terrorists or naughty pictures or some other buzzworthy topic that will get them paid suitcases full of money.
Until I see some sort of evidence that they can do it, I rank the claims along with those who claim that they can tell what people are thinking by where they scratch.
"Trademarks are the heraldry of the new feudalism."
None of their previous ventures into web spidering's worked very well. It's likely all that will be needed to create a false negative in this case is a little name obfuscation, and there will be an unacceptable rate of false positives...
This kind of detection is difficult if not impossible, as others posed, what if the copy is encrypted? or what if it is altered to make it difficult to find even using complex Image Processing algorithms? these algorithms may fail to detect it as a copy even if it has something like a 10% shift in hue or saturation, same can happen with video, will this system detect if i copy a video and change the color tones from full color to sepia?
- Yes, but does it run Lunix?
So how does it determine the direction in which the copying took place?
-
1,2: This is the standard TFIDF method. TF means 'text frequency', you give each word a weight equal to its frequency in the document. IDF means 'inverse document frequency', if a word is rare, you give it more weight. Typically this is done with the logarithm, btw.
-
4,5,6: This is extremely general. But it sounds like any of a myriad of methods to generate 'higher-order-features'. For example, by using a nonlinear kernel function.
-
7&9: Sounds like a way to measure the importance of a feature. Many such methods are already in use, for example, mutual information (MI).
-
8: In other words, a 'stoplist'. Nice way to make it sound really complicated and useful, though.
Skimming the rest of the patent, I don't see much substance. But I admit I didn't go through all of it. Perhaps someone else will have more patience....just like it doesn't catch you burning a CD and giving it to your friend physically. Or the Scouts singing "Happy Birthday."
However it may well do what it is designed to do, finding copyright infringement on the web. Autonomy are a serious company working on pattern recognition, not some fly-by-night cowboys. This copyright-finding thing would just be a side application of their core technology.
Not another bot sucking down my bandwidth at my expense! :-(
Publishers using this tool will presume that any found copies are infringing examples of copyright violation. But what happens when a work "created" and copyrighted in 2006 turns out to be "infringed" by something created in 2000? If the pubisher's "original" copyrighted work turns out to not be so original after all, then things could get sticky. I wonder how many cases of plagiarism will be uncovered in which the publisher/copyright holder becomes the defendant.
Two wrongs don't make a right, but three lefts do.
Whenever I see words "intelligence", "meaning", or "understanding" used to describe software, that's how I know it's a bunch of baloney.
It's called Google.
-mcgrew
It claims to follow hyperlinks. Does it obey robots.txt on the destination site? I sense possible legal disputes.
Err, so "meaning-based media" means it maintains "ACID semantics..."?
I guess you don't consider Google, IBM, Apple or Microsoft to be serious companies then.
Interesting tidbit though - this Autonomy patent is a US one, they wouldn't get a patent on this in their own home country of the UK, where software patents are (currently) not allowed.
from TFA:
"it can detect whether a portion of a copyrighted video or audio tract has been overlaid or stored as part of a new and original media file. "
This means if your feature-length documentary has a bit of perfectly legal ad footage in it, it will be flagged. Hopefully media distribution companies like YouTube won't rely on this technology exclusively, or there will be a lot of false positives.
My truck is like a series of tubes.
I used to work for Autonomy. They were a bunch of shits. Heres an article
they didn't like very much:
Life in the Autonomy sweatshop
Or:
Stress Is More Fun
Following a successful interview at Autonomy Headquarters in Cambridge
on March 24th, I was offered employment and agreed to start work on
May 22nd. Despite this being a huge upheaval involving a large outlay
of money (since no relocation fee was offered), I decided to make the
move from Woking to the Cambridge area.
At first, everything went well, and I was impressed by the company:
free lunches on Friday, TV during the 2006 England World Cup matches,
and even occasion social events (the Cambridge beer festival and
go-karting). But soon, the facade began to shatter and Autonomy
revealed itself as a company focussed on money and power. Visitors to
the company are insulated from this cut-throat attitude by the fun of
seeing red-bellied piranhas in the reception, and having board rooms
named after James Bond villains. The induction process consisted of
"Sign this contract, give us your bank details, heres your desk, here
are your co-workers, here is your staff handbook (about 12 pages), and
the sandwich van is up the road." About ten minutes in total: no
mention of quality, IT policy etc.
My first job was to devour the documentation concerning the company's
Virage product (a video archive and logging system) to write DLL
plug-ins to facilitate a company's video and audio analysis. This
involved reading through massive Software Development Kit documents,
and trying some of the sample plug-ins (bluescreen detection, for
instance). Everything was fine apart from the so-called IT Support: it
took days to get my email sorted out, and weeks to get my Windows XP
workstation activated.
Sat next to me was Pieter, a Dutch (?) developer, whose voice was so
quiet I couldn't discern it over the noise of the fans and general din
in the office. Whenever I asked him for help, I passed him the
keyboard and asked him to type in the relevant stuff. He seemed to be
doing work on voice recognition, on the SoftCell system.
My manager was Abigail Betley, of whom more later.
My first job was write a plug-in that could detect and indentify
logos, or on-screen idents shown during a TV programme's broadcast. I
quickly identified many technical papers, and a simple method that
would isolate a non-animated logo from the rest of the screen. I tried
out this method and it worked fine. At this point, Abigail (Abby) went
on holiday for about 5 days or so, leaving me to doing some more work
(including coming up with Software Requirements, which to this day I
never received any feedback about). The only sore point was Abby's
colleague, Unai Ayo, a Spanish national, who phoned me up the day
before Abby was due back asking if "I had anything" and "whether logos
were correctly identified". Now, this was about a week and a half
AFTER I joined. Those of you who have done any research into Logo
Detection know that it is technically very complicated; indeed, one
scientific paper related how a Neural Net had been used (with about
89% accuracy) - and here I was 10 days after joining the company and
this major technical problem was expected to be sorted out.
Alarm bells started ringing here.
Anyway, Abby came back and I was temporarily told to park the Logo
Detector (so that it could be handed over to the Neurodynamics team
downstairs, who had experience in image identification - car number
plates etc.). I was starting to get concerned as it looked like "my"
My web domain.