That's an edit distance of 5, pretty large, but still findable with a little approximate matching, especially if it's weighted, to recognize the similarity between @ and a, or i and |.
Most spam contains repeated phrases 40+ characters long. the mistake is to use word-counting techniques which ignore phraseology.
For instance, here are some phrases from spam, circa one year ago:
Please fill out the form below for more information To unsubscribe
To remove your in the Marshall Islands
Please allow 48-72 hours for removal to this email with REMOVE in the the Northern Ratak
the information
thousands of dollars
that you will
this list, please
this advertisement
this email in error
this message, you may email our
this transaction
of thousands of
of EnenKio and
of Eneen-Kio Atoll
of His Majesty
our mailing list
out 5,000 e-mails each for a
opportunity to make
According to one talk I went to, Google uses approximate hashing to find duplicates (easy), and near-duplicates (hard). They may not be using the best methods, but even if they were, I suspect it would be difficult to find all the duplicate pages.
Maybe if they looked for duplicate contexts on each search it would cover a lot of the problem.
For the same reason that some companies get valued below their cash assets - here's an example of one negotiating session:
Negotiator: OK, Bob, you put down the sock puppet. That was a big step. Don't you feel a little better now?
Bob: Yes, but I have to...stickiness...
Negotiator: Focus, Bob, you know that pay-per-click deal won't work; we already discussed that right? So what I need you to do now is put down the checkbook. You can still have it near you, but just put it on the floor.
Bob: Um...B2B...Unghhh...No!
Negotiator: Look, Bob, there's no way out of this. The acquisition fell through, the stock is delisted, your IP consists of a domain name and a cookie recipe. The best thing you can do now is put the checkbook down. It'll be so much easier if you put the checkbook down. Now, Bob.
Bob: I'm so tired...so tired...I used to be...Wired.
Negotiator: I know Bob, we're all tired. It's natural, we all make mistakes.
Bob: I have an idea. A reverse split. I always have big ideas!
Negotiator: That's right Lenny, I mean Bob. You always have big ideas.
(from "Of Pointing Devices and Men", one of Steinbeck's lesser-known works)
I love the smell of irony in the morning. Before breakfast, I actually sat down and wrote this letter to the editor:
Dear Editor,
I cannot, in all honesty, deny others in third-world nations the chance to compete for jobs with US workers. Outsourcing is a vital part of our tech industry future.
However I note no enthusiasm on the part of Silicon Valley companies to outsource their most expensive, and often least productive workers, their presidents, CEO's and senior management.
In fact it is these same people who are calling, not for more competitive replacements for their own jobs, but for more and cheaper engineers for their tech mills. Why is that, I wonder?
I'm still plugging away. Check out my article in Dr. Dobb's in December '03. It's mostly about the core indexing technique, and I haven't gotten around to putting a regex module on top of it. Right now it'll find any substring in the source file, and it's fairly straightforward to expand that to regexps - I just need a module that expands a regexp into a bunch of strings, more or less.
There are a number of applications for the technology; I'm just too broke to work on them much.
Now that you mention it, Google supposedly hired the search architect from citeseer, who was discouraged because Google kept getting better results. As I heard it, citeseer is great at finding the links (citations) between papers, but Google has eclipsed them in the link analysis and horsepower departments.
FYI, they keep everything from their crawl in an archive; it's not really a cache. They do duplicate detection on pages, and something like 30% of them are dupes. The link farms are most likely not duplicates; it's easy enough to scramble each copy to avoid detection.
The MSN deal paid something like $14 million up front, and they got MSN to look the other way while LS sold slots in their directory feed. It was slimy, but profitable, and for LS, this was all they needed. No other search engine made a similar deal with LS - even those that used the LS directory got it for almost nothing up-front, and only a share of the clickthrough revenue.
Unfortunately even a monopoly like MS had to wake up someday, and realize that the revenue can be theirs alone, for the cost of a few developers and a billing system, and the directory wasn't much value anyway, so the whole deal went in the dumper.
For LS, there wasn't much alternative, as their management never had much technical leadership ability, and they mainly focused on what they were good at, which was talking companies into having unprofitable joint ventures with them (eg British Telecom).
LS tried to get some technical acumen by laying off their developers and buying other companies, but this strategy never lead to any good technology, to put it mildly.
Actually, I just RTFM'd comparator, and it's a fairly weak signature-based algorithm which examines (only) 3-line, overlapping chunks. An extra space every three lines would defeat the default version, although there is a switch to ignore whitespace. Still, an extra semicolon, or a few tweaked comments or variable names would fool it.
A simple blocksort of the code would find the longest common substrings, without the reliance on 3-line units that comparator has. By "simple", I mean:
blocksort( char * text, int textsize) { char *blocks[textsize];
for( int i = 0; i (less than) textsize; i++ )
blocks[i] = text + i;
for( int i = 0; i (less than) textsize-1; i++ )
{
int lcp = strcmp( blocks+i, blocks + i + 1 );
if( lcp (greater than) maxlcp )
{
maxlcp = lcp;
maxpos = i;
}
};
printf( "A maximal repeated substring is at position %d and length %d\n", blocks[maxpos]-text, maxlcp );
}
This structure (a suffix array) can also be used to find if a repeated substring occurs in both of the source trees, or just one, and to list all repeats beyond a certain size, and so on.
That's a bummer. I was talking a few months ago to a company about doing approximate matching on media titles. This is kind of like exact matching, but with a few errors allowed, eg
"Lord of the Rings" "The Lord of the Rings"
"Tolkein" "J.R. Tolkein" "JRR Tolken"
etc.
The deal fell through, but it's good to hear there's a need for this type of thing.
They've even put some individual researchers' names in for sponsored links: Udi Manber Gene Myers
It's interesting, as they seem to have some things but not others. The suffix array stuff is useful for full-text indexing, which they may be interested in, but they don't flag searches on more recent topics in the field.
All the email, etc. goes into a file partition which combines indexing and primary storage.
Data can be searched for any string, in a few milliseconds, without the delays of scanning with grep or other search tools. Applications like statistical filters can get, eg, counts of a given string very quickly, or match incoming email against stored messages to identify the correct classification.
Actually, this doesn't exist yet, but I was working on some indexing algorithms a while ago and realized that it's feasible.
I have some ideas, like taking a regular lamp and putting it inside a glass lantern/box, but there are probably some problems with getting the temperature right.
Let's see:
Gen3r@c v|agar@
Gener@c v|agar@
Generic v|agar@
Generic viagar@
Generic viagr@
Generic viagra
That's an edit distance of 5, pretty large, but still findable with a little approximate matching, especially if it's weighted, to recognize the similarity between @ and a, or i and |.
Most spam contains repeated phrases 40+ characters long. the mistake is to use word-counting techniques which ignore phraseology.
For instance, here are some phrases from spam, circa one year ago:
Please fill out the form below for more information
To unsubscribe
To remove your
in the Marshall Islands
Please allow 48-72 hours for removal
to this email with REMOVE in the
the Northern Ratak
the information
thousands of dollars
that you will
this list, please
this advertisement
this email in error
this message, you may email our
this transaction
of thousands of
of EnenKio and
of Eneen-Kio Atoll
of His Majesty
our mailing list
out 5,000 e-mails each for a
opportunity to make
Or, as they say at NASA, the dilithium crystals canna take much more, Captain!
According to one talk I went to, Google uses approximate hashing to find duplicates (easy), and near-duplicates (hard). They may not be using the best methods, but even if they were, I suspect it would be difficult to find all the duplicate pages.
Maybe if they looked for duplicate contexts on each search it would cover a lot of the problem.
I love the smell of irony in the morning. Before breakfast, I actually sat down and wrote this letter to the editor:
Dear Editor,
I cannot, in all honesty, deny others in third-world nations the chance to compete for jobs with US workers. Outsourcing is a vital part of our tech industry future.
However I note no enthusiasm on the part of Silicon Valley companies to outsource their most expensive, and often least productive workers, their presidents, CEO's and senior management.
In fact it is these same people who are calling, not for more competitive replacements for their own jobs, but for more and cheaper engineers for their tech mills. Why is that, I wonder?
Regards,
No, it was one of those guys standing around in the Home Depot parking lot.
And I can't find it either.
I'm still plugging away. Check out my article in Dr. Dobb's in December '03. It's mostly about the core indexing technique, and I haven't gotten around to putting a regex module on top of it. Right now it'll find any substring in the source file, and it's fairly straightforward to expand that to regexps - I just need a module that expands a regexp into a bunch of strings, more or less.
There are a number of applications for the technology; I'm just too broke to work on them much.
Now that you mention it, Google supposedly hired the search architect from citeseer, who was discouraged because Google kept getting better results. As I heard it, citeseer is great at finding the links (citations) between papers, but Google has eclipsed them in the link analysis and horsepower departments.
They also have romantic, swooning sex by page 70.
Are you listening, Don Knuth?
FYI, they keep everything from their crawl in an archive; it's not really a cache. They do duplicate detection on pages, and something like 30% of them are dupes. The link farms are most likely not duplicates; it's easy enough to scramble each copy to avoid detection.
Warfare at the Speed of Thought?
They used only standard quantum mechanics to design the weapon.
If a computer became sentient and developed the ability to read these images, would lawyers argue for its right to exist?
We'll get a Spielberg movie out of this yet.
The MSN deal paid something like $14 million up front, and they got MSN to look the other way while LS sold slots in their directory feed. It was slimy, but profitable, and for LS, this was all they needed. No other search engine made a similar deal with LS - even those that used the LS directory got it for almost nothing up-front, and only a share of the clickthrough revenue.
Unfortunately even a monopoly like MS had to wake up someday, and realize that the revenue can be theirs alone, for the cost of a few developers and a billing system, and the directory wasn't much value anyway, so the whole deal went in the dumper.
For LS, there wasn't much alternative, as their management never had much technical leadership ability, and they mainly focused on what they were good at, which was talking companies into having unprofitable joint ventures with them (eg British Telecom).
LS tried to get some technical acumen by laying off their developers and buying other companies, but this strategy never lead to any good technology, to put it mildly.
Actually, I just RTFM'd comparator, and it's a fairly weak signature-based algorithm which examines (only) 3-line, overlapping chunks. An extra space every three lines would defeat the default version, although there is a switch to ignore whitespace. Still, an extra semicolon, or a few tweaked comments or variable names would fool it.
A simple blocksort of the code would find the longest common substrings, without the reliance on 3-line units that comparator has. By "simple", I mean:
blocksort( char * text, int textsize)
{
char *blocks[textsize];
for( int i = 0; i (less than) textsize; i++ )
blocks[i] = text + i;
qsort( blocks, textsize, sizeof(*blocks), &strcmp );
int maxpos = 0;
int maxlcp = 0;
for( int i = 0; i (less than) textsize-1; i++ )
{
int lcp = strcmp( blocks+i, blocks + i + 1 );
if( lcp (greater than) maxlcp )
{
maxlcp = lcp;
maxpos = i;
}
};
printf( "A maximal repeated substring is at position %d and length %d\n", blocks[maxpos]-text, maxlcp );
}
This structure (a suffix array) can also be used to find if a repeated substring occurs in both of the source trees, or just one, and to list all repeats beyond a certain size, and so on.
A handful of the stuff will turn a gallon of water into gel almost instantly, and it has a shimmering, translucent appearance.
Good for cleaning up toilet overflows, too.
That's a bummer. I was talking a few months ago to a company about doing approximate matching on media titles. This is kind of like exact matching, but with a few errors allowed, eg
"Lord of the Rings"
"The Lord of the Rings"
"Tolkein"
"J.R. Tolkein"
"JRR Tolken"
etc.
The deal fell through, but it's good to hear there's a need for this type of thing.
Inverted Index
Page Rank
Suffix Array
They've even put some individual researchers' names in for sponsored links:
Udi Manber
Gene Myers
It's interesting, as they seem to have some things but not others. The suffix array stuff is useful for full-text indexing, which they may be interested in, but they don't flag searches on more recent topics in the field.
Google Temptation Island!
Well, yes, but, according to Moore's law, we now have 36% more storage to store the dupe.
All the email, etc. goes into a file partition which combines indexing and primary storage.
Data can be searched for any string, in a few milliseconds, without the delays of scanning with grep or other search tools. Applications like statistical filters can get, eg, counts of a given string very quickly, or match incoming email against stored messages to identify the correct classification.
Actually, this doesn't exist yet, but I was working on some indexing algorithms a while ago and realized that it's feasible.
An outdoor lava lamp.
I have some ideas, like taking a regular lamp and putting it inside a glass lantern/box, but there are probably some problems with getting the temperature right.
Wait...Costa Rica?