As a sysadmin who runs a Web search engine, I'm worried about the fact that the broad definition of "addressing, routing, and signaling" information in the new law would include not only my web logs, but also the search terms in the query string attached to each entry in those logs. It's time to PKZIP them all up with password protection.
What happens if the FBI comes knocking for my logs, with a court order signed by a judge who was required to sign without any showing of probable cause (yes, that's in the law), and they want all my logs on the speculation that certain search terms may be of interest to a particular criminal investigation?
And what would happen if they confiscated my computer, and then came back and asked for the PKZIP password to unzip the logs, and I said, "Gosh, I seem to have forgotten it?"
Of course, this new law also allows them to put Carnivore on my upstream provider, and I wouldn't even know about it. So I guess maybe it's silly to zip up my logs after all.
The problem is not merely that file extensions launch programs, and the association between extension and application is difficult to change.
The larger problem is this: new application software for Windows is typically file-extension oriented, and it's Microsoft that defines the important extensions. For example, I was evaluating a Windows full-text desktop document indexer recently, written by a small Windows development house. It was fast (written in assembly), and it could even do PDF and ZIP files.
But then I discovered that the years of files I had saved under legacy systems, starting with DOS, were completely invisible to this package. They were ASCII files, and I used my own file-naming conventions for the extension, so they weren't easily convertible to *.txt files. I had just been punished by this application for not going along with the Redmond game plan.
And here's another nightmare:
Consider, if you will, what happens when you ask Explorer to save a web page to disk. It uses a huge filename, and saves the images in separate directories. There's basically no way to get the thing back from the disk without using Explorer. That's why I take the trouble to Lynx-strip everything I want to archive, and put it into ASCII with a short filename.
Have you ever considered what it would be like to convert to Linux if all the filenames on your Windows system were around 80 bytes or so? Both Windows and Linux will accept filenames up to 254 bytes, but no one except a masochist would ever use a command-line system on filenames that long.
It's a conspiracy, I tell you. You gotta use a mouse, you gotta be using it in Explorer, and you gotta be interested in approved Microsoft files only, or you can forget it.
The liberals in Congress think they're sounding like civil
libertarians with their new, modified stand on Internet
surveillance. They say that the authorities should be allowed
warrantless taps to find out where you surfed, but not what you did
once you got there. The FBI has a right to know that you went to
Amazon, for example, but without a warrant they don't have a right
to know what books you bought. The legal distinction here is from
the old days: a "pen register" would record the number you dialed,
but not the conversation itself, and therefore qualified for a
looser legal standard.
But pundits don't realize that 99 percent of your Web activity can
be reconstructed from the Web's equivalent of "pen register"
information. The search terms you enter into search engines are
attached to the address itself. Do you believe that the FBI will
want this portion of the URL excluded simply because they don't
have probable cause? If and when the NSA is authorized to monitor
the backbone, do you expect that they will chop off the URL at the
question mark, so that this information is kept out of their
keyword-analysis supercomputers? Not likely.
My reading of the provisions of the new Anti-Terrorism Act of 2001
suggests that a single, one-time certification by a federal
law-enforcement official that such information is needed in a
criminal investigation, without any showing of probable cause, is
enough to require a court to issue an order allowing a pen-register
tap on any Internet service provider presented with the order,
throughout the entire U.S. The definition of this "pen-register or
trap and trace device" information has been expanded for the
Internet. It now includes "other dialing, routing, addressing, and
signaling information reasonably likely to identify the source of a
wire or electronic communication (but not including the contents of
such communication)."
For example, some federal official could conceivably serve Google,
or any other search engine, with a court order demanding log
information for all those who searched for particular persons or
particular combinations of search terms. The "query strings"
consisting of the users' search terms are, in all standard HTTP
server logs, included along with the user's domain or IP number.
One hopes that search engines would be inclined to challenge such
an order. But we may never know, because if they decide to
cooperate with the new law, their public relations office won't be
announcing this. The bottom line is that the phrase, "but not
including the contents of such communication," might be useful for
excluding the body of e-mail messages, but is mostly irrelevant for
Web surfing. This poor wording in the new law may mean that search
engines can no longer claim privacy at any level.
If someone wanted to redesign the entire Web for the express
purpose of surveillance, they couldn't do a better job than what we
already have. The profile that could be compiled if one had a list
of all the Web sites you visited, or all the search terms you've
used on Google, would be very revealing. The latter scenario is
more worrisome, because the former scenario, short of a
comprehensive backbone tap, would imply an order served locally at
your own ISP. You'd almost have to be pre-targeted by the
authorities. But a tap on a general search engine would amount to a
global sweep for information. Google currently gets about 110
million searches every day, most of which are from outside the U.S.
It would be tempting for the feds to monitor this traffic.
Am I the only one on earth who thinks that Judge Thomas Penfield Jackson knew what he was doing when he talked to the press?
Fact one: Jackson split off his statement of facts from his statement of law. The statement of facts was very competent and comprehensive.
Fact two: He didn't start talking to the press until after the statement of facts had been filed.
Fact three: This case is huge; it's the biggest anti-trust case in 100 years. There is an enormous amount of money behind Microsoft. Jackson knew that there was zero possibility that the case would be a slam-dunk on appeal. The higher courts have to answer to that much money to some extent; it's impossible to ignore it in a capitalist culture, particularly when soft money elects (read: appoints) Congress and the President.
BE IT RESOLVED, that Jackson had to throw a bone to Microsoft. He gives something for Microsoft to chew on, and he gives the higher courts a way to appear that they are carefully considering the legitimate interests of Capital.
So Jackson threw a fake drug-store doggie bone to Microsoft by talking to the press. They chewed happily on it for a year, virtually ignoring the hefty record of facts that had been compiled at the trial. The issue of Jackson's prejudice was highlighted by Microsoft in their appeal, setting the stage for the Circuit to unanimously affirm almost all of the important facts, while slapping down Judge Jackson.
All the media bite down on the same fake doggie bone. Microsoft thinks they won, because the appellate decision was superficially ambivalent so that the media don't bother reading the record. Meanwhile, the important facts are upheld unanimously, and it seems unlikely at this point that the Supremes will even hear the case.
Here we are, 15 months after Judge Jackson's structural remedy. The remedy is thrown out, but the case was so big that something had to be thrown out in any event. Jackson's victory is that he got the least significant decision he made thrown out, by carefully orchestrating the appearance of prejudice according to a precise time table.
Here we are, 15 months later, and Microsoft has run out of delaying tactics. A new judge will be appointed who is required to do something major, because the facts have been affirmed unanimously and are not within his purview to challenge. He has merely to appear fair-minded, and his decision with respect to remedies will stick.
After reading the decision, I feel that it has implications for the legality of Google's practice of offering a cache copy.
The equivalent of the freelancer here would be the webmaster who owns a copyright on his website. Google creates a copy of files (and now thumbnail images as well) that is fragmented from the original context of the site.
The big difference is that Google never contracted with the website to begin with. All Google has is the implied permission of a failure by the webmaster to "opt out" with the robots.txt or META no-archive option. It seems to me that Google should pursue safer ground, and change their cache policy to an "opt in." They could easily do this by requiring a special Google-specific permission file on the site before flagging any of the files as cachable.
The essential points for the website owners are these: 1) the cache copy shows a fragment of his site out of context, and 2) the site owner loses control over distribution, and 3) the failure to opt out is not the same as signing a contract with Google.
Many website owners like the referrals they get from Google, but don't like the loss of control represented by Google's cache. In other words, they aren't in a position to exclude Google entirely with their robots.txt. As for the META no-archive for each individual file, this is clumsy, and may in fact flag the site for adverse scrutiny from Google.
Anyone who claims that everything is on the Internet is not much interested in history. Events prior to 1995 are poorly covered. Events prior to about 1985, when newspapers began putting their contents online (usually available on a fee basis through Dialog or Nexis), are even more poorly covered. Even today, very few newspapers, magazines, and journals have issues prior to 1980 digitized.
Here's an example: Let's say that you are interested in the issue of the CIA's influence on and involvement with academia. Try a "AND" search for "CIA" and "campus," or "CIA on campus."
You will get lots of hits about the Culinary Institute of America, and student life there,
and very little about the Central Intelligence Agency and academia. Why?
It's because the CIA on campus issue began in the late 1960s, and peaked twice since then -- once in 1977 and again in 1987. All three of these predate the web.
A site was just started less than two months ago to remedy this situation: http://www.cia-on-campus.org/
Almost all of the 27 documents currently posted there had to be OCRed from dead-tree records. Some of these articles were difficult to track down, and some were so yellowed from age that OCR didn't work.
Yet no one can claim that the issue of the CIA on campus is no longer important in 2001.
This idea is worth some thought. The basic problem is that the richness of the pages we produce in response to a name search is the very thing that is making it worthwhile to have our names represented on Google. A Google-referred user immediately appreciates what our site has to offer -- data visualization of interlinks between names, with clustering, cluster-click selection, etc.
If this richness is available to the Google user who arrives at our search results page via Google, then the same richness is available to the original crawler that put up the page.
But I appreciate the suggestion, and it may well be that some balance could be achieved that would bar Google from the "richness" but keep it open, available, and apparent for everyone else. We already do something like this -- the program that does the visualization is blocked to Google, so that the links Google gets are from a program that doesn't have to generate GIFs with client-side image maps, nor Java applets with cluster-clicking.
I run a site that's a cumulative name index of 700 books
and thousands of clippings. The indexing started in 1983.
For any name, you can get all the other names that share
pages with that name throughout the entire database. In
other words, each name search produces a page that contains
anywhere from several to several hundred additional names
-- all pre-linked directly to their own searches, which do
the same thing. You get the idea.
It's a bot's worst nightmare. But if you are Google, with
lots of crawlers to sic on the task, it quickly can become
my nightmare instead of Google's. Indeed, Google doesn't
seem to care much.
Last October I noticed that Google was inclined to stumble
into our cgi-bin on rare occasions, and actually do a
decent job of delivering referrals to the name data that it
got from us. I lifted the robots.txt exclusion to see what
would happen. No other bots have even delivered referrals
as consistently as Google, so I can only assume that Google
is the only bot that's even serious about going after the
dynamic web.
Either that, or their algorithms do a much better job on
our names, which are all listed as surname-first throughout
our site. If you search for a name in the news as Firstname
Lastname without quotes, Google will put our Lastname,
Firstname high on the list due to two facts: Our name is
part of the anchor description and they give link data more
points, and secondly, the two words are close to each other
and this adds to the score (even though they are backwards).
Google has come by once a month since ever since I lifted
the robots.txt. Each time they spend about 10 days solid,
24/7, with from three to five crawlers, chasing all the
name searches. The rate from all the crawlers together for
those 10 days varies from about two name searches per
second to several per minute.
It's very erratic during that time; the crawlers don't talk
to each other, and there's no detectable pattern that
they're following. They don't manage to get through the
entire database of 115,000 names by any means. There is an
incredible amount of waste and duplication.
I had to install a load-sensitive thermostat so that when
our server hits a certain load threshhold and it's Google
calling, it starts delivering "Server too busy" responses
instead of the search that was requested. That seems to
work pretty well, but they get all those "Server too busy"
messages stored in their cache copy for that name.
To put it bluntly, their bots are dumber than toast, and
if you don't watch them, they can turn your server into
toast.
Last November I wrote to Larry Page and offered to send him
the damn database on CD-ROMs, in discrete HTML files using
any specification he cared to define, so that his crawlers
wouldn't have to load down our servers once per month.
Mr. Page never responded. The letter was e-mailed, faxed,
and snail-mailed. Someone from google.com did a Larry Page
search shortly after I faxed it, so I'm pretty sure they
read the thing. I offered these CD-ROMs for free, and I
didn't ask for any changes in PageRank or any other
considerations. It would simply mean that I can get my
names onto Google efficiently and comprehensively, without
enduring that 10-day orgy once a month.
My point is that there is no real effort at Google to make
any sort of accommodation on a case-by-case basis with the
so-called "deep web." Until that happens, sites such as
mine have difficulty in allowing Google's crawlers to run
amuck once per month. We have other customers to consider.
Since Larry Page was touting his PageRank in early 1998, I'm wondering now about the application date on the PageRank patent application. The papers he published while in the computer science department at Stanfard regarding PageRank fall under the "one year rule" for the PTO. From the time of the first such publication (or presentation at a conference), he has one year only to file an application. Otherwise, the patent application is DOA at the PTO.
Yes, there are two versions of the new Google toolbar. Yes, even if you download the advanced version with PageRank, you can easily turn it off.
Yes, it is quite likely that the 23 million search requests that Google handles every day, any of which result in a Google cookie with a unique ID in it generated by Google (assuming you don't already have such a cookie), are not personally identifiable at this time.
But added to these 23 million requests per day, are now the PageRank surfing history lists. These use the same Google cookies. If you don't have one already, one gets set immediately the first time you visit any page after your toolbar is installed with "advanced features" activated.
Most people don't know anything about cookies. Google is well on its way to building the best database in the world on search terms and surfing patterns.
What happens if someone buys Google and changes their current privacy policy?
And consider this ugly little fact:
The PageRank toy on the toolbar is a trick. It's only significant to less than one digit, ranking almost all non-porn sites between 5 and 9 on a scale of 1 to 10. The real PageRank is significant to at least 4 digits, according to a paper by Brin and Page delivered at an April, 1998 conference. You are potentially giving up a lot of privacy for this bogus PageRank toy in their toolbar.
And finally, put this in your pipe and smoke it -- all Google cookies expire on January 17, 2038.
Now I ask you, how does a 37-year cookie help Google improve their customer service? Why not a two-week expiration date? Why not a non-persistent cookie that lasts for the current browser session only?
How long will it be before Google's data gets connected to personally-identifiable information?
Compuserve, owned by AOL, has been doing this for over a year. The situation with Compuserve is a bit different, since the initial dial-up connection is done via a proprietary protocol. Only after connection is established, can you minimize the Compuserve shell that's laid over Explorer and get on with your web surfing, whether with Netscape, or a Windows version of Lynx, or another version of Explorer.
In the 15 months I've been using Compuserve, I've noticed "portal creep," as CS takes advantage of the fact that they have a captive audience immediately after dialup. Now there are typically four layers of ads and assorted crap that you have to a)wait until they download (you don't dare abort too early, or the connection won't persist), and b) get mouse-finger RSI after they all download, just from hitting the X boxes in the upper right corner.
Meanwhile, you stay very well-informed about Britney Spears' latest hairdo!
AOL apparently looked enviously at the Compuserve situation after they acquired it, and said to themselves, "Hey, we can do the same thing by redesigning the browser!"
Using Explorer 5.0, I asked for my cookies at the
bottom of the page at http://www.pir.org/nocookie.html
(JavaScript has to be enabled for that test). The site
sent me a test spam that picked up my Slashdot cookie
that I asked for, using the cross-domain cookie reading
bug in Explorer. It sent back the cookie from my disk
in a second email. It has my Slashdot user ID in it AND
my Slashdot password, in plain text. All you need to do
is to make two passes with a URL decoder to get the
hex codes converted.
So all the Slashdot cookies on MS Explorer have been
available for cross-domain reading for a long time,
and anyone who uses a favorite password on Slashdot
has been asking for trouble all that time.
Slashdot is legally liable for failure to exercise
due diligence with personally-identifiable information
provided to Slashdot by its users. If Taco will take
the trouble to contact Andover's lawyer, we can expect
Slashdot to email everyone in their database, advising
them to change their password.
There was some discussion of this a few months ago, but
Slashdot wasn't excited about the issue. I complained
to the CEO of the company that owns Slashdot, but didn't
hear back from him. (Any lawyers out there are welcome
to a copy of my fax to Andover CEO Bruce Twickler,
describing Slashdot's password vulnerability. It is
dated June 10, 2000 and describes precisely the demo
mentioned above in the first paragraph. Send a self-
addressed stamped envelope to PIR, PO Box 680635,
San Antonio TX 78268-0635.)
According to discussions on slashcode.com about three
months ago, some plans were afoot to improve the
situation, but no one seemed to be in any hurry.
The whole point of using a one-way hash to store the
password on the server is so that if your server gets
hacked, you don't have to notify all your users with
an email that says, essentially, "Dear user: We're
incompetent, and your password has been compromised.
Please change it."
This would be doubly annoying if the PDF files are raster images, rather than text. If it went into PDF as text, you can get Adobe's plug-in for the visually impaired, and export to a TXT or HTML file. You'll have wrap, edit, and do some formatting with an editor, but at least the file is now readable and portable.
The PDF raster images are a nightmare to convert. You have to print them and then scan them, and then start editing. And the files are so bloated that you can't easily move them around on the Net. It seems to me that the publishers are really being ugly if the PDF files are raster images, but only being cheap and lazy if the 100 pages went in as text and can be taken out as text. In the latter case, I think they'd end up losing any advantage they might think they have, due to illegal sharing of the material between interested parties, privately, over the Net.
When MIT's Media Lab was founded in 1985 by Nicholas
Negroponte, the Lab emphasized computers and multimedia.
Ten years later it began its silly season with "Things that
Think" (chips in shoes or clothing that communicate with
the wearer, for example). But just then the Internet
materialized out of nowhere and caught the Lab with its
micropants down. Judging from its website, by now the MIT
Media Lab has made up for lost time by promoting projects
that expand e-commerce.
More interesting than anything the Lab has ever produced
is the fact that it's funded by big business. The Lab's annual
budget in 1995 was $25 million, mostly from 95 corporate
sponsors, half of which are overseas. While the Lab claims
that sponsors cannot dictate the research, it's also true
that grad students have to sign a nondisclosure agreement
before receiving aid, and sponsors often fund research that
is proprietary. Given this history, it's not surprising
that since the Internet arrived, the Lab has been chasing
the dot-com rainbow. But one has to ask: What about the
public sector? Where's the vision? Does anyone at the Media
Lab care?
This OpenMind project smells more like a rat than a mouse.
A computer knows only one thing, and it's the only thing
it is likely to ever know without insanely massive databases,
along with bloated fuzzy-logic programs that go by the name
of "artificial intelligence," but are really thinly-disguised
variants of brute force.
A computer knows this: one is not equal to zero.
Slashdot should try to stay clear of trendy hype backed by
big bucks. That includes Wired magazine, which received
start-up money from Nicholas Negroponte.
Try surfing a few porn sites, and then look at your cookies from hitbox.com. You will discover that hitbox.com saves the URLs and/or titles of some of the pages you surfed in plain text in your cookie.
So you can end up with plain text such as "Wild_Bondage" in your cookies.
I asked the general counsel and chief privacy officer of hitbox.com's parent company to at least start encrypting this info in the cookie, on the grounds that cross-domain cookie reading is possible for anyone (86 percent of the online population) who uses Explorer. That was a month ago. They checked out the demo I recommended, according to the logs, but never answered my e-mail. The demo is at http://www.pir.org/nocookie.html (toward the bottom of the page).
I was blown away by AltaVista when I discovered it in late 1995. Then the signal-to-noise ratio became problematic over the years. One year ago I was blown away by Google. There's hope! This new algorithm of weighting based on the extent to which other sites link to a page seems to work!
Then I realized that it works for our site (a nonprofit site offering public-sector information) only because we've been on the Net for over five years, and by now we've built up a fair number of links to us on other sites.
But I worry about someone coming online for the first time. They won't get listed in Google simply on the basis of their newness. They could offer a REAL cure for cancer, and no one would find them if they used Google.
So while Google has restored my faith in bots, I still feel that some attention ought to be paid to obscure public-sector information by the Google folks, apart from their automated rankings. Maybe a special wetware screening process for nonprofit sites.
I've seen too many sites with good public-sector information on them, and counters on the bottom that actually work, and then I make the computation and discover that they are getting only two or three hits per day. I think the Internet ought to be a bit easier than that for nonprofits with useful information.
There are some other dimensions to this case that no one has mentioned, and they concern journalistic ethics.
In the first place, NYT reporter James Risen is essentially taking this position when he claimes to John Young that the families of those named may be at risk:
"You say that we have no clothes on. But unnamed, independent Iran experts we consulted say that we do indeed have clothes on. Therefore, if you assume that we are naked, you are responsible for endangering the lives of others."
A complaint I lodged with Mr. Risen in the past suggests to me that he is fully capable of spinning when he says that he consulted independent Iran experts who told him that the families of those named would be at risk. Indications are that these so-called "independent" sources could well be connected to the U.S. intelligence community -- the same people who use classification and secrecy to cover up incompetence and/or avoid accountability.
Therefore, I conclude that these families are not at risk, and Mr. Risen is protecting his sources, or his career, for entirely separate reasons. I asked him to put his experts on the record, but he didn't respond so I have no idea who they are.
I consulted an Iran expert who says that Iran hardly needs to read the New York Times to figure out what happened in 1953, and that the report would probably strike them as boring.
A newspaper has no business playing redaction games for what may be ulterior motives. I'd prefer that they skip the report entirely rather than establish a redaction precedent for journalism professionals. We'll eventually get the information another way -- from Iran, if we have to.
Passwords in cookies should be a one-way hash, such as the crypt() function performs. If you pick up the cookie for site configuration info, this is sufficiently unimportant so that no further verification of the user's ID is required. That's okay, and that's convenient.
But instead of picking up the cookie for the user's ID for posting under his nickname, you should require a login clear-text password. That way copying the cookie won't facilitate a stolen ID. You get the best of both worlds -- convenience and sufficient security. Right now, you get zero security because Slashdot is lazy.
The problem is with cookies. Destroy the reputation of cookies and you put a major dent in client-side persistence, as well as in overcommercialization on the Net. By the way, Slashdot has chocolate-stained fingers on this issue too:
How to post to Slashdot under your boss's Slashdot name, and at the same time find out if he surfs porn sites.
1) Your boss must use Explorer 4.0 or later on a Win32 platform, and use an HTML-enabled email client, and have JavaScript enabled. This is the default for perhaps 70 percent of the corporate environment. Better yet, if your boss uses Outlook Express, all he has to do preview your innocent-looking email message. He may not notice anything happening. Or, you may get exposed and fired. It's your problem.
2) Go to http://www.pir.org/nocookie.html and read up on this exploit. Near the bottom of the page, enter YOUR email address for this exploit. But first read up on some more details on the next page, under "How to steal your boss's New York Times password and find out if he surfs porn sites." Mentally substitute Slashdot where it reads NYT on this page, except that Slashdot has NOT bothered to improve their password system, while the NYT at least started doing this one week after the Explorer vulnerability was discovered. (Slashdot uses a double URL-encoding scheme. But this demonstration appears to decode several times if necesary, merely to get the QUERY_STRING back to something approximating the cookie that it started with. Therefore, your Slashdot password ends up in plain text in this demonstration. Slashdot cannot claim to be using encryption of any sort.)
3) After you enter YOUR email (because YOU will receive your boss's cookie report), click on the domains you want. Slashdot uses at least two, www.slashdot.org and slashdot.org, so be sure to click both to get both. Add a few porn sites; hitbox.com is always a good bet.
4) You get the spam from the demonstration, which you must receive on an email client that is NOT automatically enabled for HTML. Then you paste the JavaScript code into another email for your boss and send it. As soon as your boss previews this email (there is NO attachment), you will get emailed a report on the cookies you clicked. You boss probably won't notice anything happened, as he ignores yet another astute missive from you.
5) If your boss changes his Slashdot password after discovering that someone posted in his name, you have to do this again to get his new password. And again and again, until a) Slashdot starts encrypting their cookies, or b) you get discovered and fired. It's an even bet; it's been almost a year since someone complained about Slashdot's lack of cookie encryption, and they still haven't addressed the issue.
6) Microsoft is the baddest boy. They've been aware of the security problems of making email clients automatically enabled for receiving HTML since spring, 1996. Yet so far there has been no indication that they plan to make this feature switchable, and switched OFF by default.
The materialist doctrine concerning the changing of circumstances and upbringing forgets that circumstances are changed by men and that it is essential to educate the educator himself. This doctrine must, therefore, divide society into two parts, one of which is superior to society. The coincidence of the changing of circumstances and of human activity or self-changing can be conceived and rationally understood only as _revolutionary practice_. -- 3rd Theses on Feuerbach
As a sysadmin who runs a Web search engine, I'm worried about the fact that the broad definition of "addressing, routing, and signaling" information in the new law would include not only my web logs, but also the search terms in the query string attached to each entry in those logs. It's time to PKZIP them all up with password protection.
What happens if the FBI comes knocking for my logs, with a court order signed by a judge who was required to sign without any showing of probable cause (yes, that's in the law), and they want all my logs on the speculation that certain search terms may be of interest to a particular criminal investigation?
And what would happen if they confiscated my computer, and then came back and asked for the PKZIP password to unzip the logs, and I said, "Gosh, I seem to have forgotten it?"
Of course, this new law also allows them to put Carnivore on my upstream provider, and I wouldn't even know about it. So I guess maybe it's silly to zip up my logs after all.
The problem is not merely that file extensions launch programs, and the association between extension and application is difficult to change.
The larger problem is this: new application software for Windows is typically file-extension oriented, and it's Microsoft that defines the important extensions. For example, I was evaluating a Windows full-text desktop document indexer recently, written by a small Windows development house. It was fast (written in assembly), and it could even do PDF and ZIP files.
But then I discovered that the years of files I had saved under legacy systems, starting with DOS, were completely invisible to this package. They were ASCII files, and I used my own file-naming conventions for the extension, so they weren't easily convertible to *.txt files. I had just been punished by this application for not going along with the Redmond game plan.
And here's another nightmare:
Consider, if you will, what happens when you ask Explorer to save a web page to disk. It uses a huge filename, and saves the images in separate directories. There's basically no way to get the thing back from the disk without using Explorer. That's why I take the trouble to Lynx-strip everything I want to archive, and put it into ASCII with a short filename.
Have you ever considered what it would be like to convert to Linux if all the filenames on your Windows system were around 80 bytes or so? Both Windows and Linux will accept filenames up to 254 bytes, but no one except a masochist would ever use a command-line system on filenames that long.
It's a conspiracy, I tell you. You gotta use a mouse, you gotta be using it in Explorer, and you gotta be interested in approved Microsoft files only, or you can forget it.
The liberals in Congress think they're sounding like civil
libertarians with their new, modified stand on Internet
surveillance. They say that the authorities should be allowed
warrantless taps to find out where you surfed, but not what you did
once you got there. The FBI has a right to know that you went to
Amazon, for example, but without a warrant they don't have a right
to know what books you bought. The legal distinction here is from
the old days: a "pen register" would record the number you dialed,
but not the conversation itself, and therefore qualified for a
looser legal standard.
But pundits don't realize that 99 percent of your Web activity can
be reconstructed from the Web's equivalent of "pen register"
information. The search terms you enter into search engines are
attached to the address itself. Do you believe that the FBI will
want this portion of the URL excluded simply because they don't
have probable cause? If and when the NSA is authorized to monitor
the backbone, do you expect that they will chop off the URL at the
question mark, so that this information is kept out of their
keyword-analysis supercomputers? Not likely.
My reading of the provisions of the new Anti-Terrorism Act of 2001
suggests that a single, one-time certification by a federal
law-enforcement official that such information is needed in a
criminal investigation, without any showing of probable cause, is
enough to require a court to issue an order allowing a pen-register
tap on any Internet service provider presented with the order,
throughout the entire U.S. The definition of this "pen-register or
trap and trace device" information has been expanded for the
Internet. It now includes "other dialing, routing, addressing, and
signaling information reasonably likely to identify the source of a
wire or electronic communication (but not including the contents of
such communication)."
For example, some federal official could conceivably serve Google,
or any other search engine, with a court order demanding log
information for all those who searched for particular persons or
particular combinations of search terms. The "query strings"
consisting of the users' search terms are, in all standard HTTP
server logs, included along with the user's domain or IP number.
One hopes that search engines would be inclined to challenge such
an order. But we may never know, because if they decide to
cooperate with the new law, their public relations office won't be
announcing this. The bottom line is that the phrase, "but not
including the contents of such communication," might be useful for
excluding the body of e-mail messages, but is mostly irrelevant for
Web surfing. This poor wording in the new law may mean that search
engines can no longer claim privacy at any level.
If someone wanted to redesign the entire Web for the express
purpose of surveillance, they couldn't do a better job than what we
already have. The profile that could be compiled if one had a list
of all the Web sites you visited, or all the search terms you've
used on Google, would be very revealing. The latter scenario is
more worrisome, because the former scenario, short of a
comprehensive backbone tap, would imply an order served locally at
your own ISP. You'd almost have to be pre-targeted by the
authorities. But a tap on a general search engine would amount to a
global sweep for information. Google currently gets about 110
million searches every day, most of which are from outside the U.S.
It would be tempting for the feds to monitor this traffic.
Why not alter our genes to prevent misanthropic scientists from taking over the world?
Am I the only one on earth who thinks that Judge Thomas Penfield Jackson knew what he was doing when he talked to the press?
Fact one: Jackson split off his statement of facts from his statement of law. The statement of facts was very competent and comprehensive.
Fact two: He didn't start talking to the press until after the statement of facts had been filed.
Fact three: This case is huge; it's the biggest anti-trust case in 100 years. There is an enormous amount of money behind Microsoft. Jackson knew that there was zero possibility that the case would be a slam-dunk on appeal. The higher courts have to answer to that much money to some extent; it's impossible to ignore it in a capitalist culture, particularly when soft money elects (read: appoints) Congress and the President.
BE IT RESOLVED, that Jackson had to throw a bone to Microsoft. He gives something for Microsoft to chew on, and he gives the higher courts a way to appear that they are carefully considering the legitimate interests of Capital.
So Jackson threw a fake drug-store doggie bone to Microsoft by talking to the press. They chewed happily on it for a year, virtually ignoring the hefty record of facts that had been compiled at the trial. The issue of Jackson's prejudice was highlighted by Microsoft in their appeal, setting the stage for the Circuit to unanimously affirm almost all of the important facts, while slapping down Judge Jackson.
All the media bite down on the same fake doggie bone. Microsoft thinks they won, because the appellate decision was superficially ambivalent so that the media don't bother reading the record. Meanwhile, the important facts are upheld unanimously, and it seems unlikely at this point that the Supremes will even hear the case.
Here we are, 15 months after Judge Jackson's structural remedy. The remedy is thrown out, but the case was so big that something had to be thrown out in any event. Jackson's victory is that he got the least significant decision he made thrown out, by carefully orchestrating the appearance of prejudice according to a precise time table.
Here we are, 15 months later, and Microsoft has run out of delaying tactics. A new judge will be appointed who is required to do something major, because the facts have been affirmed unanimously and are not within his purview to challenge. He has merely to appear fair-minded, and his decision with respect to remedies will stick.
Stick that in your ear, Microsoft.
After reading the decision, I feel that it has implications for the legality of Google's practice of offering a cache copy.
The equivalent of the freelancer here would be the webmaster who owns a copyright on his website. Google creates a copy of files (and now thumbnail images as well) that is fragmented from the original context of the site.
The big difference is that Google never contracted with the website to begin with. All Google has is the implied permission of a failure by the webmaster to "opt out" with the robots.txt or META no-archive option. It seems to me that Google should pursue safer ground, and change their cache policy to an "opt in." They could easily do this by requiring a special Google-specific permission file on the site before flagging any of the files as cachable.
The essential points for the website owners are these: 1) the cache copy shows a fragment of his site out of context, and 2) the site owner loses control over distribution, and 3) the failure to opt out is not the same as signing a contract with Google.
Many website owners like the referrals they get from Google, but don't like the loss of control represented by Google's cache. In other words, they aren't in a position to exclude Google entirely with their robots.txt. As for the META no-archive for each individual file, this is clumsy, and may in fact flag the site for adverse scrutiny from Google.
Anyone who claims that everything is on the Internet is not much interested in history. Events prior to 1995 are poorly covered. Events prior to about 1985, when newspapers began putting their contents online (usually available on a fee basis through Dialog or Nexis), are even more poorly covered. Even today, very few newspapers, magazines, and journals have issues prior to 1980 digitized.
Here's an example: Let's say that you are interested in the issue of the CIA's influence on and involvement with academia. Try a "AND" search for "CIA" and "campus," or "CIA on campus."
You will get lots of hits about the Culinary Institute of America, and student life there,
and very little about the Central Intelligence Agency and academia. Why?
It's because the CIA on campus issue began in the late 1960s, and peaked twice since then -- once in 1977 and again in 1987. All three of these predate the web.
A site was just started less than two months ago to remedy this situation: http://www.cia-on-campus.org/
Almost all of the 27 documents currently posted there had to be OCRed from dead-tree records. Some of these articles were difficult to track down, and some were so yellowed from age that OCR didn't work.
Yet no one can claim that the issue of the CIA on campus is no longer important in 2001.
This idea is worth some thought. The basic problem is that the richness of the pages we produce in response to a name search is the very thing that is making it worthwhile to have our names represented on Google. A Google-referred user immediately appreciates what our site has to offer -- data visualization of interlinks between names, with clustering, cluster-click selection, etc.
If this richness is available to the Google user who arrives at our search results page via Google, then the same richness is available to the original crawler that put up the page.
But I appreciate the suggestion, and it may well be that some balance could be achieved that would bar Google from the "richness" but keep it open, available, and apparent for everyone else. We already do something like this -- the program that does the visualization is blocked to Google, so that the links Google gets are from a program that doesn't have to generate GIFs with client-side image maps, nor Java applets with cluster-clicking.
I run a site that's a cumulative name index of 700 books
and thousands of clippings. The indexing started in 1983.
For any name, you can get all the other names that share
pages with that name throughout the entire database. In
other words, each name search produces a page that contains
anywhere from several to several hundred additional names
-- all pre-linked directly to their own searches, which do
the same thing. You get the idea.
It's a bot's worst nightmare. But if you are Google, with
lots of crawlers to sic on the task, it quickly can become
my nightmare instead of Google's. Indeed, Google doesn't
seem to care much.
Last October I noticed that Google was inclined to stumble
into our cgi-bin on rare occasions, and actually do a
decent job of delivering referrals to the name data that it
got from us. I lifted the robots.txt exclusion to see what
would happen. No other bots have even delivered referrals
as consistently as Google, so I can only assume that Google
is the only bot that's even serious about going after the
dynamic web.
Either that, or their algorithms do a much better job on
our names, which are all listed as surname-first throughout
our site. If you search for a name in the news as Firstname
Lastname without quotes, Google will put our Lastname,
Firstname high on the list due to two facts: Our name is
part of the anchor description and they give link data more
points, and secondly, the two words are close to each other
and this adds to the score (even though they are backwards).
Google has come by once a month since ever since I lifted
the robots.txt. Each time they spend about 10 days solid,
24/7, with from three to five crawlers, chasing all the
name searches. The rate from all the crawlers together for
those 10 days varies from about two name searches per
second to several per minute.
It's very erratic during that time; the crawlers don't talk
to each other, and there's no detectable pattern that
they're following. They don't manage to get through the
entire database of 115,000 names by any means. There is an
incredible amount of waste and duplication.
I had to install a load-sensitive thermostat so that when
our server hits a certain load threshhold and it's Google
calling, it starts delivering "Server too busy" responses
instead of the search that was requested. That seems to
work pretty well, but they get all those "Server too busy"
messages stored in their cache copy for that name.
To put it bluntly, their bots are dumber than toast, and
if you don't watch them, they can turn your server into
toast.
Last November I wrote to Larry Page and offered to send him
the damn database on CD-ROMs, in discrete HTML files using
any specification he cared to define, so that his crawlers
wouldn't have to load down our servers once per month.
Mr. Page never responded. The letter was e-mailed, faxed,
and snail-mailed. Someone from google.com did a Larry Page
search shortly after I faxed it, so I'm pretty sure they
read the thing. I offered these CD-ROMs for free, and I
didn't ask for any changes in PageRank or any other
considerations. It would simply mean that I can get my
names onto Google efficiently and comprehensively, without
enduring that 10-day orgy once a month.
My point is that there is no real effort at Google to make
any sort of accommodation on a case-by-case basis with the
so-called "deep web." Until that happens, sites such as
mine have difficulty in allowing Google's crawlers to run
amuck once per month. We have other customers to consider.
Since Larry Page was touting his PageRank in early 1998, I'm wondering now about the application date on the PageRank patent application. The papers he published while in the computer science department at Stanfard regarding PageRank fall under the "one year rule" for the PTO. From the time of the first such publication (or presentation at a conference), he has one year only to file an application. Otherwise, the patent application is DOA at the PTO.
Yes, there are two versions of the new Google toolbar. Yes, even if you download the advanced version with PageRank, you can easily turn it off.
Yes, it is quite likely that the 23 million search requests that Google handles every day, any of which result in a Google cookie with a unique ID in it generated by Google (assuming you don't already have such a cookie), are not personally identifiable at this time.
But added to these 23 million requests per day, are now the PageRank surfing history lists. These use the same Google cookies. If you don't have one already, one gets set immediately the first time you visit any page after your toolbar is installed with "advanced features" activated.
Most people don't know anything about cookies. Google is well on its way to building the best database in the world on search terms and surfing patterns.
What happens if someone buys Google and changes their current privacy policy?
And consider this ugly little fact:
The PageRank toy on the toolbar is a trick. It's only significant to less than one digit, ranking almost all non-porn sites between 5 and 9 on a scale of 1 to 10. The real PageRank is significant to at least 4 digits, according to a paper by Brin and Page delivered at an April, 1998 conference. You are potentially giving up a lot of privacy for this bogus PageRank toy in their toolbar.
And finally, put this in your pipe and smoke it -- all Google cookies expire on January 17, 2038.
Now I ask you, how does a 37-year cookie help Google improve their customer service? Why not a two-week expiration date? Why not a non-persistent cookie that lasts for the current browser session only?
How long will it be before Google's data gets connected to personally-identifiable information?
Wake up, people.
Compuserve, owned by AOL, has been doing this for over a year. The situation with Compuserve is a bit different, since the initial dial-up connection is done via a proprietary protocol. Only after connection is established, can you minimize the Compuserve shell that's laid over Explorer and get on with your web surfing, whether with Netscape, or a Windows version of Lynx, or another version of Explorer.
In the 15 months I've been using Compuserve, I've noticed "portal creep," as CS takes advantage of the fact that they have a captive audience immediately after dialup. Now there are typically four layers of ads and assorted crap that you have to a)wait until they download (you don't dare abort too early, or the connection won't persist), and b) get mouse-finger RSI after they all download, just from hitting the X boxes in the upper right corner.
Meanwhile, you stay very well-informed about Britney Spears' latest hairdo!
AOL apparently looked enviously at the Compuserve situation after they acquired it, and said to themselves, "Hey, we can do the same thing by redesigning the browser!"
Using Explorer 5.0, I asked for my cookies at the
bottom of the page at http://www.pir.org/nocookie.html
(JavaScript has to be enabled for that test). The site
sent me a test spam that picked up my Slashdot cookie
that I asked for, using the cross-domain cookie reading
bug in Explorer. It sent back the cookie from my disk
in a second email. It has my Slashdot user ID in it AND
my Slashdot password, in plain text. All you need to do
is to make two passes with a URL decoder to get the
hex codes converted.
So all the Slashdot cookies on MS Explorer have been
available for cross-domain reading for a long time,
and anyone who uses a favorite password on Slashdot
has been asking for trouble all that time.
Slashdot is legally liable for failure to exercise
due diligence with personally-identifiable information
provided to Slashdot by its users. If Taco will take
the trouble to contact Andover's lawyer, we can expect
Slashdot to email everyone in their database, advising
them to change their password.
There was some discussion of this a few months ago, but
Slashdot wasn't excited about the issue. I complained
to the CEO of the company that owns Slashdot, but didn't
hear back from him. (Any lawyers out there are welcome
to a copy of my fax to Andover CEO Bruce Twickler,
describing Slashdot's password vulnerability. It is
dated June 10, 2000 and describes precisely the demo
mentioned above in the first paragraph. Send a self-
addressed stamped envelope to PIR, PO Box 680635,
San Antonio TX 78268-0635.)
According to discussions on slashcode.com about three
months ago, some plans were afoot to improve the
situation, but no one seemed to be in any hurry.
The whole point of using a one-way hash to store the
password on the server is so that if your server gets
hacked, you don't have to notify all your users with
an email that says, essentially, "Dear user: We're
incompetent, and your password has been compromised.
Please change it."
This would be doubly annoying if the PDF files are raster images, rather than text. If it went into PDF as text, you can get Adobe's plug-in for the visually impaired, and export to a TXT or HTML file. You'll have wrap, edit, and do some formatting with an editor, but at least the file is now readable and portable.
The PDF raster images are a nightmare to convert. You have to print them and then scan them, and then start editing. And the files are so bloated that you can't easily move them around on the Net. It seems to me that the publishers are really being ugly if the PDF files are raster images, but only being cheap and lazy if the 100 pages went in as text and can be taken out as text. In the latter case, I think they'd end up losing any advantage they might think they have, due to illegal sharing of the material between interested parties, privately, over the Net.
When MIT's Media Lab was founded in 1985 by Nicholas
Negroponte, the Lab emphasized computers and multimedia.
Ten years later it began its silly season with "Things that
Think" (chips in shoes or clothing that communicate with
the wearer, for example). But just then the Internet
materialized out of nowhere and caught the Lab with its
micropants down. Judging from its website, by now the MIT
Media Lab has made up for lost time by promoting projects
that expand e-commerce.
More interesting than anything the Lab has ever produced
is the fact that it's funded by big business. The Lab's annual
budget in 1995 was $25 million, mostly from 95 corporate
sponsors, half of which are overseas. While the Lab claims
that sponsors cannot dictate the research, it's also true
that grad students have to sign a nondisclosure agreement
before receiving aid, and sponsors often fund research that
is proprietary. Given this history, it's not surprising
that since the Internet arrived, the Lab has been chasing
the dot-com rainbow. But one has to ask: What about the
public sector? Where's the vision? Does anyone at the Media
Lab care?
This OpenMind project smells more like a rat than a mouse.
A computer knows only one thing, and it's the only thing
it is likely to ever know without insanely massive databases,
along with bloated fuzzy-logic programs that go by the name
of "artificial intelligence," but are really thinly-disguised
variants of brute force.
A computer knows this: one is not equal to zero.
Slashdot should try to stay clear of trendy hype backed by
big bucks. That includes Wired magazine, which received
start-up money from Nicholas Negroponte.
Try surfing a few porn sites, and then look at your cookies from hitbox.com. You will discover that hitbox.com saves the URLs and/or titles of some of the pages you surfed in plain text in your cookie.
So you can end up with plain text such as "Wild_Bondage" in your cookies.
I asked the general counsel and chief privacy officer of hitbox.com's parent company to at least start encrypting this info in the cookie, on the grounds that cross-domain cookie reading is possible for anyone (86 percent of the online population) who uses Explorer. That was a month ago. They checked out the demo I recommended, according to the logs, but never answered my e-mail. The demo is at http://www.pir.org/nocookie.html (toward the bottom of the page).
I was blown away by AltaVista when I discovered it in late 1995. Then the signal-to-noise ratio became problematic over the years. One year ago I was blown away by Google. There's hope! This new algorithm of weighting based on the extent to which other sites link to a page seems to work!
Then I realized that it works for our site (a nonprofit site offering public-sector information) only because we've been on the Net for over five years, and by now we've built up a fair number of links to us on other sites.
But I worry about someone coming online for the first time. They won't get listed in Google simply on the basis of their newness. They could offer a REAL cure for cancer, and no one would find them if they used Google.
So while Google has restored my faith in bots, I still feel that some attention ought to be paid to obscure public-sector information by the Google folks, apart from their automated rankings. Maybe a special wetware screening process for nonprofit sites.
I've seen too many sites with good public-sector information on them, and counters on the bottom that actually work, and then I make the computation and discover that they are getting only two or three hits per day. I think the Internet ought to be a bit easier than that for nonprofits with useful information.
There are some other dimensions to this case that no one has
mentioned, and they concern journalistic ethics.
In the first place, NYT reporter James Risen is essentially
taking this position when he claimes to John Young that the
families of those named may be at risk:
"You say that we have no clothes on. But unnamed, independent
Iran experts we consulted say that we do indeed have clothes on.
Therefore, if you assume that we are naked, you are responsible
for endangering the lives of others."
A complaint I lodged with Mr. Risen in the past suggests to me
that he is fully capable of spinning when he says that he
consulted independent Iran experts who told him that the
families of those named would be at risk. Indications are that
these so-called "independent" sources could well be connected to
the U.S. intelligence community -- the same people who use
classification and secrecy to cover up incompetence and/or avoid
accountability.
Therefore, I conclude that these families are not at risk, and
Mr. Risen is protecting his sources, or his career, for entirely
separate reasons. I asked him to put his experts on the record,
but he didn't respond so I have no idea who they are.
I consulted an Iran expert who says that Iran hardly needs to
read the New York Times to figure out what happened in 1953, and
that the report would probably strike them as boring.
A newspaper has no business playing redaction games for what may
be ulterior motives. I'd prefer that they skip the report
entirely rather than establish a redaction precedent for
journalism professionals. We'll eventually get the information
another way -- from Iran, if we have to.
Passwords in cookies should be a one-way hash, such as the crypt() function performs. If you pick up the cookie for site configuration info, this is sufficiently unimportant so that no further verification of the user's ID is required. That's okay, and that's convenient.
But instead of picking up the cookie for the user's ID for posting under his nickname, you should require a login clear-text password. That way copying the cookie won't facilitate a stolen ID. You get the best of both worlds -- convenience and sufficient security. Right now, you get zero security because Slashdot is lazy.
The problem is with cookies. Destroy the reputation of
cookies and you put a major dent in client-side persistence,
as well as in overcommercialization on the Net. By the way,
Slashdot has chocolate-stained fingers on this issue too:
How to post to Slashdot under your boss's Slashdot name,
and at the same time find out if he surfs porn sites.
1) Your boss must use Explorer 4.0 or later on a Win32
platform, and use an HTML-enabled email client, and
have JavaScript enabled. This is the default for perhaps
70 percent of the corporate environment. Better yet,
if your boss uses Outlook Express, all he has to do
preview your innocent-looking email message. He may
not notice anything happening. Or, you may get exposed
and fired. It's your problem.
2) Go to http://www.pir.org/nocookie.html and read up on
this exploit. Near the bottom of the page, enter YOUR
email address for this exploit. But first read up on
some more details on the next page, under "How to
steal your boss's New York Times password and find
out if he surfs porn sites." Mentally substitute
Slashdot where it reads NYT on this page, except that
Slashdot has NOT bothered to improve their password
system, while the NYT at least started doing this
one week after the Explorer vulnerability was
discovered. (Slashdot uses a double URL-encoding
scheme. But this demonstration appears to decode
several times if necesary, merely to get the QUERY_STRING
back to something approximating the cookie that it
started with. Therefore, your Slashdot password ends
up in plain text in this demonstration. Slashdot cannot
claim to be using encryption of any sort.)
3) After you enter YOUR email (because YOU will receive
your boss's cookie report), click on the domains you
want. Slashdot uses at least two, www.slashdot.org
and slashdot.org, so be sure to click both to get both.
Add a few porn sites; hitbox.com is always a good bet.
4) You get the spam from the demonstration, which you
must receive on an email client that is NOT automatically
enabled for HTML. Then you paste the JavaScript code into
another email for your boss and send it. As soon as your
boss previews this email (there is NO attachment), you
will get emailed a report on the cookies you clicked.
You boss probably won't notice anything happened, as
he ignores yet another astute missive from you.
5) If your boss changes his Slashdot password after
discovering that someone posted in his name, you have
to do this again to get his new password. And again
and again, until a) Slashdot starts encrypting their
cookies, or b) you get discovered and fired. It's
an even bet; it's been almost a year since someone
complained about Slashdot's lack of cookie encryption,
and they still haven't addressed the issue.
6) Microsoft is the baddest boy. They've been aware of
the security problems of making email clients automatically
enabled for receiving HTML since spring, 1996. Yet so
far there has been no indication that they plan to make
this feature switchable, and switched OFF by default.
The materialist doctrine concerning the changing of
circumstances and upbringing forgets that circumstances
are changed by men and that it is essential to educate the
educator himself. This doctrine must, therefore, divide
society into two parts, one of which is superior to society.
The coincidence of the changing of circumstances and of human
activity or self-changing can be conceived and rationally
understood only as _revolutionary practice_.
-- 3rd Theses on Feuerbach