K-Man · Slashdot Mirror

Re:Spamkiller doesn't care on Filter-foiling Gibberish Becoming A Spam Staple · 2004-01-13 16:09 · Score: 3, Interesting

Let's see:

Gen3r@c v|agar@
Gener@c v|agar@
Generic v|agar@
Generic viagar@
Generic viagr@
Generic viagra

That's an edit distance of 5, pretty large, but still findable with a little approximate matching, especially if it's weighted, to recognize the similarity between @ and a, or i and |.

Most spam contains repeated phrases 40+ characters long. the mistake is to use word-counting techniques which ignore phraseology.

For instance, here are some phrases from spam, circa one year ago:

Please fill out the form below for more information
To unsubscribe
To remove your
in the Marshall Islands
Please allow 48-72 hours for removal
to this email with REMOVE in the
the Northern Ratak
the information
thousands of dollars
that you will
this list, please
this advertisement
this email in error
this message, you may email our
this transaction
of thousands of
of EnenKio and
of Eneen-Kio Atoll
of His Majesty
our mailing list
out 5,000 e-mails each for a
opportunity to make

Re:Digital watch design on NASA Scientists Get Custom 24h39m-per-day Watches · 2004-01-13 09:03 · Score: 1

Or, as they say at NASA, the dilithium crystals canna take much more, Captain!

Re:What about Existing Data? on IBM vs. Content Chaos · 2004-01-12 12:09 · Score: 1

Better yet, a mental image. Can't wait for the IBM brochure.

Re:One Net to Rule Them All on IBM vs. Content Chaos · 2004-01-12 09:45 · Score: 1

According to one talk I went to, Google uses approximate hashing to find duplicates (easy), and near-duplicates (hard). They may not be using the best methods, but even if they were, I suspect it would be difficult to find all the duplicate pages.

Maybe if they looked for duplicate contexts on each search it would cover a lot of the problem.

Re:Managers taking hostages? on The Walking Dead of Silicon Valley · 2004-01-08 09:53 · Score: 1

For the same reason that some companies get valued below their cash assets - here's an example of one negotiating session:

Negotiator: OK, Bob, you put down the sock puppet. That was a big step. Don't you feel a little better now?
Bob: Yes, but I have to...stickiness...
Negotiator: Focus, Bob, you know that pay-per-click deal won't work; we already discussed that right? So what I need you to do now is put down the checkbook. You can still have it near you, but just put it on the floor.
Bob: Um...B2B...Unghhh...No!
Negotiator: Look, Bob, there's no way out of this. The acquisition fell through, the stock is delisted, your IP consists of a domain name and a cookie recipe. The best thing you can do now is put the checkbook down. It'll be so much easier if you put the checkbook down. Now, Bob.
Bob: I'm so tired...so tired...I used to be ...Wired.
Negotiator: I know Bob, we're all tired. It's natural, we all make mistakes.
Bob: I have an idea. A reverse split. I always have big ideas!
Negotiator: That's right Lenny, I mean Bob. You always have big ideas.

(from "Of Pointing Devices and Men", one of Steinbeck's lesser-known works)

Pure Irony on Tech Firms Defend Moving Jobs Overseas · 2004-01-08 08:55 · Score: 1

I love the smell of irony in the morning. Before breakfast, I actually sat down and wrote this letter to the editor:

Dear Editor,

I cannot, in all honesty, deny others in third-world nations the chance to compete for jobs with US workers. Outsourcing is a vital part of our tech industry future.

However I note no enthusiasm on the part of Silicon Valley companies to outsource their most expensive, and often least productive workers, their presidents, CEO's and senior management.

In fact it is these same people who are calling, not for more competitive replacements for their own jobs, but for more and cheaper engineers for their tech mills. Why is that, I wonder?

Regards,

Re:Holy cow on Tech Firms Defend Moving Jobs Overseas · 2004-01-08 08:48 · Score: 1

No, it was one of those guys standing around in the Home Depot parking lot.

That was me on Better Search Results Than Google? · 2004-01-05 12:43 · Score: 1

And I can't find it either.

I'm still plugging away. Check out my article in Dr. Dobb's in December '03. It's mostly about the core indexing technique, and I haven't gotten around to putting a regex module on top of it. Right now it'll find any substring in the source file, and it's fairly straightforward to expand that to regexps - I just need a module that expands a regexp into a bunch of strings, more or less.

There are a number of applications for the technology; I'm just too broke to work on them much.

Re:FYI - try CiteSeer instead of Google on Great Computer Science Papers? · 2003-11-16 07:15 · Score: 1

Now that you mention it, Google supposedly hired the search architect from citeseer, who was discouraged because Google kept getting better results. As I heard it, citeseer is great at finding the links (citations) between papers, but Google has eclipsed them in the link analysis and horsepower departments.

Re:Don't read the originals on Great Computer Science Papers? · 2003-11-16 07:06 · Score: 3, Funny

Harlequin romance novels express the same ideas in much easier to read language.

They also have romantic, swooning sex by page 70.

Are you listening, Don Knuth?

Re:But does anyone use them? on Microsoft Looks At Other Search Engines · 2003-11-04 16:20 · Score: 1

FYI, they keep everything from their crawl in an archive; it's not really a cache. They do duplicate detection on pages, and something like 30% of them are dupes. The link farms are most likely not duplicates; it's easy enough to scramble each copy to avoid detection.

Shouldn't it be... on Warfare at the Speed of Light · 2003-10-20 11:41 · Score: 1

Warfare at the Speed of Thought?

Re:Gravity-free? on Warfare at the Speed of Light · 2003-10-20 10:40 · Score: 1

They used only standard quantum mechanics to design the weapon.

So, The Philosophical Question Is on Baffling the Spam Bots · 2003-10-19 19:44 · Score: 0

If a computer became sentient and developed the ability to read these images, would lawyers argue for its right to exist?

Hold on Clippy! on AI Sues for Its Life in Mock Trial · 2003-10-19 16:37 · Score: 2, Funny

We'll get a Spielberg movie out of this yet.

It was MSN - easy money on Slashback: Lamo, Trilogy, Searching · 2003-10-08 16:29 · Score: 1

The MSN deal paid something like $14 million up front, and they got MSN to look the other way while LS sold slots in their directory feed. It was slimy, but profitable, and for LS, this was all they needed. No other search engine made a similar deal with LS - even those that used the LS directory got it for almost nothing up-front, and only a share of the clickthrough revenue.

Unfortunately even a monopoly like MS had to wake up someday, and realize that the revenue can be theirs alone, for the cost of a few developers and a billing system, and the directory wasn't much value anyway, so the whole deal went in the dumper.

For LS, there wasn't much alternative, as their management never had much technical leadership ability, and they mainly focused on what they were good at, which was talking companies into having unprofitable joint ventures with them (eg British Telecom).

LS tried to get some technical acumen by laying off their developers and buying other companies, but this strategy never lead to any good technology, to put it mildly.

Re:This is how to do it on SGI Compares Linux & System V Source Code · 2003-10-06 09:32 · Score: 1

Actually, I just RTFM'd comparator, and it's a fairly weak signature-based algorithm which examines (only) 3-line, overlapping chunks. An extra space every three lines would defeat the default version, although there is a switch to ignore whitespace. Still, an extra semicolon, or a few tweaked comments or variable names would fool it.

A simple blocksort of the code would find the longest common substrings, without the reliance on 3-line units that comparator has. By "simple", I mean:

blocksort( char * text, int textsize)
{
char *blocks[textsize];

for( int i = 0; i (less than) textsize; i++ )
blocks[i] = text + i;

qsort( blocks, textsize, sizeof(*blocks), &strcmp );

int maxpos = 0;
int maxlcp = 0;

for( int i = 0; i (less than) textsize-1; i++ )
{
int lcp = strcmp( blocks+i, blocks + i + 1 );
if( lcp (greater than) maxlcp )
{
maxlcp = lcp;
maxpos = i;
}
};

printf( "A maximal repeated substring is at position %d and length %d\n", blocks[maxpos]-text, maxlcp );

}

This structure (a suffix array) can also be used to find if a repeated substring occurs in both of the source trees, or just one, and to list all repeats beyond a certain size, and so on.

This stuff is great on Hydrophilic Powder Used To Save Library Books · 2003-10-01 13:24 · Score: 2, Informative

I got a 5-lb bag from watersorb.com, and it has to be one of the most amusing substances I've seen in a while.

A handful of the stuff will turn a gallon of water into gel almost instantly, and it has a shimmering, translucent appearance.

Good for cleaning up toilet overflows, too.

Re:They need to do better than their own site on Amazon to Take on Google? · 2003-09-26 15:58 · Score: 1

That's a bummer. I was talking a few months ago to a company about doing approximate matching on media titles. This is kind of like exact matching, but with a few errors allowed, eg

"Lord of the Rings"
"The Lord of the Rings"

"Tolkein"
"J.R. Tolkein"
"JRR Tolken"

etc.

The deal fell through, but it's good to hear there's a need for this type of thing.

Other fun Google recruiting methods on Google Code Jam 2003 Announced · 2003-09-17 05:39 · Score: 2, Interesting

I've noticed they also bought some of their own keywords pertaining to certain CS topics, eg:

Inverted Index
Page Rank
Suffix Array

They've even put some individual researchers' names in for sponsored links:
Udi Manber
Gene Myers

It's interesting, as they seem to have some things but not others. The suffix array stuff is useful for full-text indexing, which they may be interested in, but they don't flag searches on more recent topics in the field.

And next... on Google Code Jam 2003 Announced · 2003-09-17 05:12 · Score: 2, Funny

Google Temptation Island!

Re:dupe on Turing Award Winner On The Future of Storage · 2003-09-17 04:56 · Score: 1

Well, yes, but, according to Moore's law, we now have 36% more storage to store the dupe.

A Searchable File System on How Do You Organize Your Data? · 2003-09-02 16:18 · Score: 1

All the email, etc. goes into a file partition which combines indexing and primary storage.

Data can be searched for any string, in a few milliseconds, without the delays of scanning with grep or other search tools. Applications like statistical filters can get, eg, counts of a given string very quickly, or match incoming email against stored messages to identify the correct classification.

Actually, this doesn't exist yet, but I was working on some indexing algorithms a while ago and realized that it's feasible.

My Challenge to the World on Build Your Own Lava Lamp · 2003-08-29 18:21 · Score: 1

An outdoor lava lamp.

I have some ideas, like taking a regular lamp and putting it inside a glass lantern/box, but there are probably some problems with getting the temperature right.

Ah, yes, San Jose on Profile of An Internet Bookie · 2003-08-18 04:06 · Score: 4, Funny

The newly remodeled airport is surrounded by chain hotels, freshly paved roads and shiny corporate plazas. After that it goes rapidly downhill.

Yes, that's San Jose in a nutshell!

Wait...Costa Rica?

Slashdot Mirror

User: K-Man

Comments · 495