Checksumming Webpages Patented
Just when you thought nothing else stupid could be patented,
Wahfuz noted a story running about a company called Pumatech who has apparently patented storing a checksum of a webpage to determine if it has updated or not. I guess from now on everyone who wants to detect changes in web pages will need to store full copies of the pages in question, because I'm sure nobody thought of anything so complex as piping it through md5 and saving the output.
Ever heard the story of the CD-WOM? It was a device consisting of two blocks of ordinary wood and a cable connecting it to the user's PC. CD media was placed between the two blocks and data was written to the CD. The process was foolproof (I challenge you to prove to me that no data was written to write-only media!)
That's about how useful storing a checksum of a webpage would be without *doing* anything with the data. Sure, the checksum exists, but if you don't bother to do anything with it, the data is as worthless as a CD-WOM. Obviously, someone creating MD5 hashes of all their webpages would also build some sort of system around it to make use of those hashes!
- A.P.
--
Forget Napster. Why not really break the law?
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
I'm sure nobody thought of anything so complex as piping it through md5 and saving the output.
Yeah- this is one of those "Why didn't I think of that?" things- but I have yet to hear of a web cache or proxy that uses md5sums instead of last-modified headers- are there any out there? And if so, wouldn't that count as the all-important prior art?
Just because something seems simple once somebody else thought of it doesn't mean it wasn't a good idea in the first place.
If you read the press release, the patent isn't on storing checksums of HTML pages, but is for storing checksums of sections of a page between pre-identified HTML nodes.
Now, perhaps there is prior art for this, but its a damn good idea and I sort of doubt it because I've been around the block a few times and haven't seen ANY caching mechanisms that can determine if a page has changed based on a checksum calculated from just a portion of the page (presumably so things like today's date on a page doesn't affect the state of the cache).
That seems pretty damn innovative to me. I'm no big fan of software patents, but as software patents go, this is a lot more justifiable than most.
So flame away, but there is a lot of posturing going on here about prior art, and none of them seem to come close.
And, unfortunately, probably perfectly valid in the US where something as stupid as software patents can be "valid".
I quote:
a checksum generator, coupled to receive the fresh copy of the document from the periodic fetcher, for generating a fresh checksum of a portion of the fresh copy of the document and comparing the fresh checksum to the original checksum, the checksum generator signaling a detected change to the remote client when the fresh checksum does not match the original checksum,
Note the bold part. Contrary to the inflamatory headlines, this patent does NOT cover blindly checksumming webpages, but rather strategically checksumming the critical part of a page, so the fluff doesn't affect the cache status.
I used to work at Pumatech. (Actually, I worked in the wireless web-browsing end of things, as an engineeer)
Anyways, we were checking our emails one day (this was about 6 months ago) and there's some big "congratulations" email - we got another pattent!
A large portion of the company is based out of synchronization software. (Synchronize your PIM, Laptop, whatever) We'd just received a patent on a revolutionary new technique - time based syncing! Sync data, based on their TIME STAMPS!
We had a good laugh.
--
--
#include <malloc.h>
free(your.mind);
Actually, a number of technologies relevant to nuclear weapons were patented prior to and during the Manhattan Project. For some reason, Mr. Stalin failed to adhere to such Intellectual Property law as might have existed at that time. Now that I think about it, I can't imagine a notion more antithetical to the Communist Manifesto than intellectual "property".
Learn to spell: nickel, missile, lose, solely, amendment, speech, kernel, probably, ridiculous, deity, hierarchy, versus
If they're using a simple checksum, then someone should figure out how to fool it--add like a comment field to a webpage with the correct characters to make the checksum the same.
If they're using md5sums, well, I guess this won't work.
Just because something seems simple once somebody else thought of it doesn't mean it wasn't a good idea in the first place.
And just because they (allegedly) were the first to think of it, doesn't mean it's patentable.
Patents are supposed to be given only for things that aren't "obvious to anyone skilled in the art". In practice, this isn't assessed well by the patent office, but that's another can of worms.
I have thousands of MD5 sums stored from web pages and various files linked to web pages along w/ many of the original files. I've been sucking such info off the net and using MD5 sums to verify unique these files for a couple years at least. Never even considered the lame ass idea of patenting such a thing. Damn maybe I should patent all my shell scripts. :)
At what price learning? At what cost wisdom? The price is a man's peace of mind, and the cost is his life.
Likewise. Company I used to work for did something very similar, using a CRC calculated using the text of a web page to determine web page "identity". I would be surprised if the Lycos (or Altavista, or Webcrawler, or Hotbot...) spiders didn't do something very similar.
Which brings up an interesting question - if, by 1997, there were enough companies implementing this sort of "technology" already, then can't it be argued that the Pumatech patent is obviously invalid because at the time they applied for it, it was already in use by multiple companies... which seems to me to indicate that their "innovative" technology is "obvious to a practioner skilled in the arts".
"Great men are not always wise: neither do the aged understand judgement." Job 32:9
You have neglected one significant cost. These *** patents make it much more difficult for a small company. A small company won't have cross-license agreements, won't have a large legal staff, won't get a "good-buddy" licensing price, and is generally operating on a shoe-string budget anyway.
... and it is, whether intentional or not. Because of this, I tend to think of these "spurious" patents as a large evil. Not the biggest one, but not a small one either.
So this is one of the factors that causes many new companies to fold. Think of it as a social control mechanism
Caution: Now approaching the (technological) singularity.
I think we've pushed this "anyone can grow up to be president" thing too far.
Yes indeed. Text is so highly differentiated that if you know about doing something to the whole thing, doing something to a part of it is patentworthy. ????
...", but that was the idea behind it.)
You have an extremely low standard for what should be patentable. Considering the cost of defending against a patent, if trivialities are patentable, soon only the rich will be able to legally initiate any action. Is this a social good? Is it in compliance with the constitutional provisions enabling the patent law? (I don't remember the precise pharsing, sorry. It isn't "To promote the general welfare
E.g.: There may be no prior are in the archives of the patent law covering eating using a metalic or otherwise ridgid, or somewhat stiff, divided instrument to convey the nutritive material from a holding container to the grinding apparatus. Should this be patentable?
Caution: Now approaching the (technological) singularity.
I think we've pushed this "anyone can grow up to be president" thing too far.
Can they testify that they have been doing this since prior to Feb 18, 1999?
now we need to go OSS in diesel cars
check_www is a series of scripts and filters that I created under the GPL last year to automatically advise me of when web pages change, popping up alert boxes and pre-loaded browsers as apropriate. It includes filters to remove unwanted constantly changing information and search for terms. It is available on http://olliver.family.gen.nz/check_www.tgz Ironically, I was alerted to this article by it. Vik :v)
Now I know you can go after the police for malicious prosecution, and I know people have sued to recover court costs before. Could something like that be used to go after companies that file obvious patents that have been in use for a long time?
Say you're an independant coder, and you create a way to check if a file is current using checksums, and you use it on your personal web site, never thinking about it. Years later a company patents exactly what you're doing.
A normal reaction might be to yell and scream about how you were already doing it and how the patent is worthless. What about if you instead copied their product, using their supposedly patented technology. Seeing that, they'd come after you for patent violations. You could then show you were using the algorithm for much longer than them. Then, after you won the case, you could sue them to recover the costs associated with defending the case.
I dunno, maybe some variation on this might work. It sure would be nice to be able to turn the screws on the screwers.
Disclaimer: I am not a lawyer liscensed in your jurisdiction or in any other jurisdiction. I'm not a lawyer at all, and I'm probably not even in your country. If I were in your jurisdiction and were a lawyer I'd probably not want to give out free legal advice anyhow... but who knows what I'd do, cuz I'd probably be pretty depressed at being a lawyer.
rsync does a block by block checksum of a file, then searches another file for matching blocks, thus making it a generalisation of this idea to /any/ file. It's been around for a /long/ time - the mailing list archives go back to 1991.
rproxy applies the rsync protocol to http caching. I first heard about it at CALU in July 1999, and checked out some cvs code that worked at that time.
The general idea has been floating around for ages, though - look on the rproxy site for links to other people's ideas about this kind of thing.
This /is/ yet another case of a really dumb patent.
himi
--
My very own DeCSS mirror.
% telnet slashdot.org 80
Trying 64.28.67.150...
Connected to slashdot.org.
Escape character is '^]'.
HEAD / HTTP/1.0
HTTP/1.1 200 OK
Date: Tue, 24 Apr 2001 05:22:53 GMT
Server: Apache/1.3.12 (Unix) mod_perl/1.24
Connection: close
Content-Type: text/html
Connection closed by foreign host.
--
Terrorists can attack freedom, but only Congress can destroy it.
Nee Arrowpoint, the web balancers Slashdot itself uses.
It stores an MD5 checksum of a webpage to determine if the page it retrieved is complete. This is part of its timing mechanism to determine load. Pretty sure they did this prior to Feb. 99.
Yeah- this is one of those "Why didn't I think of that?" things
No, it isn't.
but I have yet to hear of a web cache or proxy that uses md5sums instead of last-modified headers- are there any out there?
No, because that's a completely different question.
Just FYI, this has been going on for _ages_ There was a 'web page change detector' available back in my 14.4kbps modem days (early 1995 - I can't remember what it was called, tho - been too damn long) that used this very technique... you fed a URL into a CGI, and it would poll the page every so often and email you if it had changed. And guess what? It used a checksum of the page to determine if it had changed (since storing all those pages would just take way too much storage space.)
This is _NOT_ new, and it's _NOT_ non-obvious.
Ask web crawlers designers, When I was working on a web crawler, I wondered what would happen when pages got updated and how I would go about getting the latest update, so I had the crawler stop a page with the date it was fetched and a checksum of the page. If a page hasn't been fetched in 10 days and is crawled, it is fetched, the checksum is compared, and if different it is parsed for potential new links/keywords... This is so obvious, I am sure that google and major search engines probably do this.
------ Curiosity killed the cat. {satisfaction brought it back | it didn't die ignorant | lack of it is killing mankind
http://www.geek-girl.com/ids/1995/0306.html
lots of postings here from 1995 about tripwire and it's predecessors. . .
maybe the USPTO should post their patent requests to slashdot and let us find the prior art before they issue patents.
How about a site like http://find-prior-art.com that pays out money to the first people to find prior art for patent requests?
I can think of at least two excellent reasons off the top of my head.
First, it's a considerable expense and hassle. Patent attorneys are not optional - the claims have to be properly worded for the USPTO office to accept them *and* to prevent some business from stealing your idea by rewording an ineffectual claim ever so slightly. If you're a business and want to create market entry barriers to your competition, $10-20k might be a good investment. If you're a working stiff, that's a lot harder to justify. If you're still in college, forget it!
Second, by seeking patents for "obvious" things we're implicitly accepting the validity of all other obvious patents. A sadly too common analogy is elections in corrupt regimes - you can organize a voter boycott because the election is corrupt, you can run your own candidate, but you can't do both.
For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
He's an idiot don't expect him to actually think about things like that.
He thinks that if you disagree about a patent you are a communist. What kind of a moron thinks like that?
War is necrophilia.
I don't know why anyone needs this. There are expiration dates and conditional loading of pages if expired already defined in HTTP/1.1 (Rfc 2068) so instead of creating a hash a server honouring requests such as 'If-Modified-Since' would perfectly do the job. There is also an entity tag already defined in the faq. Deducating it from a hash is one possible solution to create such a hash. Encoding the document location and the date of the last change another.
But in general a server using the last modification date of the file as 'Last-modified:' header would well do the job. Else an entity-tag would do the job. The hash would only make sense, if the Document could be retrieved under different URLs. Even then sensible creation of an entity Tag would do the job.
Then there is the Content-MD5 field for an integrity check (from rfc 2068):
The Content-MD5 entity-header field, as defined in RFC 1864 [23], is an MD5 digest of the entity-body for the purpose of providing an end-to-end message integrity check (MIC) of the entity-body. (Note: a MIC is good for detecting accidental modification of the entity-body in transit, but is not proof against malicious attacks.)
This is in the rfc dated January 1997. There are also guidelines, how Proxies or clients should use these Tags to check for expired Documents. It's all there.
"By the way if anyone here is in advertising or marketing... kill yourself." -- Bill Hicks
I mean, how ridiculous can it get? You look up something you deem a good idea, then modify it slightly and patent? Note that the method in the faq doesn't refer to patents and thus is probably not patented. The authors thought it obvious to mark the document with tags to deduce date of last modification, a unique id (for documents retrieved under this url) and a checksum for integrity check. Now some morons come along, see it already done, do it on parts and get a patent.
I would like to patent transporting morons. In parts.
"By the way if anyone here is in advertising or marketing... kill yourself." -- Bill Hicks
Besides, the US let out the REAL secret at Alamagordo, Hiroshima, and Nagasaki. Namely, that it was possible to build a working atomic bomb. Once the Russkis had that, the rest was engineering. They already knew the theory.
General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
Why would you want to checksum a file to see if it's changed? As a web server, the time stamp is adaquate to determine if it's changed, and as a web browser or web proxy, HEAD is adaquate to check the time stamp.
While we're at it. I'm going to rush to the patent office and see if I can "patent" 64bit date time stamps, so I have a lead in on the next big crisis!
-Michael
-Michael
Did the patent office even try a Google search before stamping its approval on this patent?
Obviously not: http://www.google.com/search?q=web+checksum
Hit #2 is prior art: "BIBLINK.Checksum - an MD5 message digest for Web pages" . Note that: "This article last updated/links checked on 23-Sept-1998"
Not that I figure prior art will be hard to come by for this, but I did this in a Squeak/Smalltalk for a CS project my sophomore year in college, 1998. And they've been using this project for several years of this class.
25% Funny, 25% Insightful, 25% Informative, 25% Troll
Taking a look at the patent content, it's not as simple as running the page through a checksum generator. This wouldn't work with some dynamicaly-generated pages, for example, because their dates of creation will change every time.
The process in the patent allows you to select a portion of the web page, and then the server only tracks changes in that portion. It also generates a checksum for each portion of content between HTML tags, and it is smart enough not to tell you that the content changed if certain sections got reordered, but the content's the same. It will also show you exactly which portions changed, since it has a separate checksum for each section.
It's not fusion power, but it's an ok idea, and I don't think anyone has used it before. So, let them have the patent.
----------
Never underestimate the bandwidth of a 747 filled with CD-ROMs.
This is the same company that developed and sold the synchronization software that supposedly worked with the Palm HotSynch app to allow synchronization to other schedulers. Their conduit software worked once you took the days required to figure out how to install it correctly.
It figures that they'd come up with yet another harebrained scheme....
-drin
The posting begins, "Just when you thought nothing else stupid could be patented" . . . um, hello? Why the heck would ANY of us think that? Did I miss the story about the patent office coming to its senses?
---
"This message is composed of 100% recycled electrons."
Isn't this just doing stuff similar to what strong validators a là Entity Tags in HTTP requests and responses use for determining whether a page has been changed (i.e. is in the cache) or not?
The only difference I can see is that they generate an Etag like entity for tect highlighted by the user as well as the entire webpage. Doesn't seem worthy of a patent though.
--
Claim 1 of the patent reads:
1. A change-detection web server comprising:
a network connection for transmitting and receiving packets from a remote client and a remote document server;
a responder, coupled to the network connection, for communicating with the remote client, the responder registering a document for change detection by receiving from the remote client a uniform-resource-locator (URL) identifying the document, the responder fetching the document from the remote document server and generating an original checksum for a checked portion of the document, the checked portion being less than the entire document;
archival storage means, coupled to the responder, for receiving the URL and the original checksum from the responder when the document is registered by the remote client, the archival storage means for storing a plurality of records each containing a URL and a checksum for a registered document;
a periodic fetcher, coupled to the archival storage means and the network connection, for periodically re-fetching the document from the remote document server by transmitting the URL from the archival storage means to the network connection, the periodic fetcher receiving a fresh copy of the document from the remote document server,
a checksum generator, coupled to receive the fresh copy of the document from the periodic fetcher, for generating a fresh checksum of a portion of the fresh copy of the document and comparing the fresh checksum to the original checksum, the checksum generator signaling a detected change to the remote client when the fresh checksum does not match the original checksum,
whereby a change in the document is detected by comparing a checksum for the checked portion of the document, wherein changes in portions of the document outside the checked portion are not signaled to the remote client.
So, the usual flame-before-reading crowd isn't entirely unjustified. (That's not to endorse flaming before reading, much less thinking, but hey, even a blind pig finds the occasional acorn.)
Oh, btw, the priority date is January 14, 1997. Leave it to the guys who do the press release to give the wrong impression of when the thing was invented. Not that doing a checksum and not recording non-changes wasn't just as obvious in 1997 as 1999.
Anyways, its a silly patent. Checksums are a pretty fundamental thing to do! I don't even think my last company tried to patent it because it was so blatantly obvious!
Ahem ... no, they have patented a system for creating, storing, and using the checksum. An entire system, not just the storage of a checksum. Once again, alarmist headlines from /. I think we'd all appreciate it if these stories had accurate headlines.
--- Math illiteracy affects 8 out of every 5 people.
Lawyers can be like any other consultant. A lot of their advice can be such that it requires the constant presence of a lawyer to keep you out of legal trouble. I don't trust 'em any farther than I can thrown 'em.
nope, that would involve...ummm...technical competence
A new and improved diffAgent server has been released which includes additional mediators. "A diffAgent watches information sources available via the web and e-mails you when it detects changes. In particular, it can:
- Watch your FedEx package for you and e-mail you when it sees the words
"Package has been Delivered!" (make a package watcher agent)
- Monitor a list of query results at a search service like Altavista to
see when new pages on your topic appear (make a web topic watcher
agent)
- Keep track of news articles on a topic and mail you when it finds new
ones (make a news topic watcher agent)
- Mail you when your name appears in a list of papers at an electronic
archive (make a web page watcher agent)
- Tell you when the word "snow" appears on the Pittsburgh weather page
(make a web page watcher agent)
8/15/96diffAgent had two modes. In the first mode, it stored a CRC checksum of the page, periodically compared checksums, and notified you of changes.
In the second mode, it stored the whole page, ran diff --context=3 over it to detect changed lines, and then grep'd for user-specified words of interest.
I believe The NetMind web page was already up at that time, but they may not have had all of the features important to the patent. IMO, the NetMind technology is not worth a patent, but it is a bit beyond the diffAgent, and not entirely trivial to implement even if it is trivial to think of.
If you patenet it, they will come.
I'm going to go back in my box and will think within the limits of my box: MS Sucks Linux Good I read too much Slashdot.
I've ALWAYS used checksums to do that kind of stuff. Unfortunately in scripts that aren't distributed publicly, but cripes, any damn fool could come up with that idea!
Another trick I've used is in scripts that generate static .html pages from a database: take the data used in the page (not the page itself), and make an md5 of the concatenation. Since most md5 routines can take data in chunks, you can generate it as you're getting the data. Then save the md5sum in a comment at the top. Then in the future you can compare with md5sum of the page with the md5sum of the data. If there is a "last modified" date on the page or something this will only update it when the data changes.
I also use this trick for an automatic DNS updating script that creates zone files from a master data file. Can't just update the zone files every time because then the serial numbers would be updated constantly.
So if anybody patents this silly idea (maybe they already have?), I've been using it for like eight years!! I'm publicly announcing it here on /.!!
Blah.
Besides I don't use NetMind anymore, I use SpyOnIt.
I note that Linux Focus already uses md5 to allow mirrors to check for updates to the pages. See that here.
Did the patent office even try a Google search before stamping its approval on this patent?
My Greasemonkey scripts for Digg &
Now to get on topic, does the patent office do any background checking on anything dealing with a computer program? Or do they just assume that since this was the first they read about this function, that it is obviously the first time it was implemented?
The HTTP protocol itself has had Jeff Moghul's cahce optimization protocol in it since at least 1996.
It is yet another bogus patent. Time to use the proposal I made of issuing a civil action for perjury against people making fraudulent patent claims. I suspect that approach would cut down on the number of bogus applications.
Looking for an Information Security student project suggestion?
Try http://dotcrimeManifesto.com/
The danger of patents like these is not, IMHO, that someone is going to ask you to pay a license fee for your two line Perl program that uses checksumming but that when you really invent something original and worthwhile, patent protection will have been rendered meaningless by people simply ignoring it.
I believe people who work hard and ethically have a right to their billion dollars.
Hello? Heelllooo?!
No one makes a billion dollars by working 100,000 times harder than someone making 10K.
They make a billion dollars by having a horde of people who are earning 10K work for them. Check out Nike.
Phil Knight doesn't work any harder than the Vietnamese girls who make the shoes. Those girls are not *lazy*.
He makes his money by siphoning off the value from their labor, since they work in a corrupt government where unions and occupational safety codes are written by dictators who have no interest in protecting these "lazy" poor people.
There is no relationship, for example, between executive compensation and productivity.
What really lets people make huge amounts of money is not hard work (the mexicans who wash the dishes in the restaurant where you dine are working very hard) and it's not intelligence (the college prof's who taught you are probably pulling in 60K on average. The grad students are making 15-20K) but it's being able to position yourself into a role where you either manage people, or money, or both. Or maybe get a fat government monopoly on something (i.e. patents) that others use and skim off of their income. That, or just let your money "work" for you.
In either case the key to making big bucks is to park your behind right in the middle of some productivity intersection, and start taking tolls..
And if any one objects, there will always be Ayn Rand worshipping idealogues such as yourself to keep up the PR war, believing that this is somehow the ethical way to do business.
When in doubt, have a man come through a door with a gun in his hand.
Does an individual deserve to own a patent on checksumming? Surely not. But is there an argument to be made for collective ownership of the patent? I believe there is.
You see, when a patent is granted to an individual, the benefits aren't accrued solely by the individual. The entire society benefits, because that country now possesses a citizen who owns the patent and can wield it against other countries' citizens. The GNP is in whole raised because of efforts like these.
You can imagine how much richer the US economy would have been if we'd managed to patent the transistor before Japan got its own electronics markets running. You can imagine how much safer the world would be from nuclear warfare if the US had successfully patented atomic weapons before the Russians got their own projects going. Though the lifespan of a patent is only about 18 years, that would have been enough time to get some diplomatic solutions in place and prevent the escalated arms races of the Cold War.
What does this have to do with checksumming? Not much, I'm afraid. That's a stupid patent and we all know it. But let's not cut off our nose to spite our face when so much good can be done by a proper patent system.
Ha! I just patented 1-Click check sums... The rest of you will have to use the inferior "2-click" check sum...
RC