Agreed. Emailchemy was only the program I found that was able to convert my 2GB Outlook Express archive to other formats. (I happened to move to mail.app on the Mac, but Emailchemy supports many other output formats, including pure mbox.) By comparison, Thunderbird choked in importing this archive.
I actually used a beta version of Emailchemy that retained the hierarchical structure of my (approximately 3000 folders). This was the biggest sticking point with most tools I looked at.
I worked with Bob Hall at Bell Labs and AT&T Labs. He's had a long interest in helping people manage their personal email, as evidenced by his other research papers and patents. He's one of the good guys.
Though I don't have any first hand knowledge, I think there's a small chance this patent might actually end up getting used against spammers. AT&T runs an ISP, so is in a very good position to know if one of their users is sending out messages that use hashing techniques for avoiding duplicate detection. They can simply forbid this, of course, but having the patent is another weapon, particularly against large abusers.
In contrast to what some other posters indicated, hash-based duplicate detection is widely used by ISPs, and spammers do widely use anti-hashing techniques. I recently did some consulting work designing anti-anti-hashing techniques, but have already seen spammers use anti-anti-anti-hashing. And so it goes.
Saw CBR in Chicago, last year I think. "Robots" is a bit generous to describe the technology. Think more like the classic one-man band with electronics, and that gives is all a silly/fun character rather than a BDSM enslaved-by-robots feel. Don't know enough about punk to comment on the quality of the music. Very very loud, so that I could make out the lyrics at all.
A large telecom company I once worked for donated some software developed by their research lab to a university, had it formally appraised for value, and took a tax deduction. I gather the legal paperwork was substantial, but so was the deduction. Anybody know of any companies that have contributed to open source in this fashion?
No one who wants to protect their intellectual property - including their right to give it away - should use the IBM patent server. Do you really want to let the world's most aggressive pursuer and licenser of patents see your query...?
Dave
say, what's this about needing a sig to avoid losing your last line?
The idea would be that pages that get through and shouldn't (in someone's opinion) are used as positive training examples for a neural net. Pages that don't get through and should are used as negative examples. One trains the neural net to distinguish between positive and negative examples, using words, phrases, etc. as input features. The big advantage of these techniques is that they can more or less gracefully combine hundreds of different clues, something that is difficult for a human to do by hand. They still make mistakes of course - question is whether the numbers of mistakes is reasonable, in your opinion.
There's a huge literature on using neural nets and other machine learning techniques to train systems for distinguishing between all sorts of text content. See the sections on text categorization in _Machine Learning_ by Mitchell or _Foundations of Statistical Natural Language Processing_ by Manning & Schuetze. There's also a survey paper at http://faure.iei.pi.cnr.it/~fabrizio/, and an upcoming workshop described at http://www.daviddlewis.com/events/otc2001/
Dave
At least half the problems with NDAs and IP agreements that I see in my consulting work result from the company using a lawyer who doesn't specialize in IP. The result can be bizarre: draconian claims in one part of the agreement and gaping holes in others, a document they think is an NDA but really makes me their employee, etc. Unfortunately, I sometimes end up paying my IP attorney to rewrite their documents for them, just to protect myself!
Dave
Absolutely! There's a number of techniques already known in the information retrieval research community (relevance feedback in particular) that aren't being exploited in current web search engines, and would make a big difference. What's holding them back is usually some combination of efficiency problems and lack of a good interface metaphor for allowing naive users to effectively use the technique. As others have pointed out, most people don't even use phrases.
I think it's a non-issue whether the criteria the engines used are publicized or not. There's enough index spammers out there that any weaknesses in the criteria get discovered, exploited, and patched fairly quickly.
Agreed. I (not a lawyer) am pretty sure both you and your company's attorneys are legally obliged to mention any relevant prior art they're aware of. So go become aware. --Dave
Agreed. Emailchemy was only the program I found that was able to convert my 2GB Outlook Express archive to other formats. (I happened to move to mail.app on the Mac, but Emailchemy supports many other output formats, including pure mbox.) By comparison, Thunderbird choked in importing this archive.
I actually used a beta version of Emailchemy that retained the hierarchical structure of my (approximately 3000 folders). This was the biggest sticking point with most tools I looked at.
Dave
Any opinions on which organization is most worthy of support with my limited donations: EFF or PubPat or someone else?
Dave
Though I don't have any first hand knowledge, I think there's a small chance this patent might actually end up getting used against spammers. AT&T runs an ISP, so is in a very good position to know if one of their users is sending out messages that use hashing techniques for avoiding duplicate detection. They can simply forbid this, of course, but having the patent is another weapon, particularly against large abusers.
In contrast to what some other posters indicated, hash-based duplicate detection is widely used by ISPs, and spammers do widely use anti-hashing techniques. I recently did some consulting work designing anti-anti-hashing techniques, but have already seen spammers use anti-anti-anti-hashing. And so it goes.
Dave
Dave
Saw CBR in Chicago, last year I think. "Robots" is a bit generous to describe the technology. Think more like the classic one-man band with electronics, and that gives is all a silly/fun character rather than a BDSM enslaved-by-robots feel. Don't know enough about punk to comment on the quality of the music. Very very loud, so that I could make out the lyrics at all.
Well, Joe Haldeman in _Worlds_ wrote about the *really* last US blackout.
And of course Arthur C. Clarke in "The Nine Billion Names of God" wrote about the really last universal blackout.
A large telecom company I once worked for donated some software developed by their research lab to a university, had it formally appraised for value, and took a tax deduction. I gather the legal paperwork was substantial, but so was the deduction. Anybody know of any companies that have contributed to open source in this fashion?
Dave
No one who wants to protect their intellectual property - including their right to give it away - should use the IBM patent server. Do you really want to let the world's most aggressive pursuer and licenser of patents see your query...? Dave say, what's this about needing a sig to avoid losing your last line?
The idea would be that pages that get through and shouldn't (in someone's opinion) are used as positive training examples for a neural net. Pages that don't get through and should are used as negative examples. One trains the neural net to distinguish between positive and negative examples, using words, phrases, etc. as input features. The big advantage of these techniques is that they can more or less gracefully combine hundreds of different clues, something that is difficult for a human to do by hand. They still make mistakes of course - question is whether the numbers of mistakes is reasonable, in your opinion. There's a huge literature on using neural nets and other machine learning techniques to train systems for distinguishing between all sorts of text content. See the sections on text categorization in _Machine Learning_ by Mitchell or _Foundations of Statistical Natural Language Processing_ by Manning & Schuetze. There's also a survey paper at http://faure.iei.pi.cnr.it/~fabrizio/, and an upcoming workshop described at http://www.daviddlewis.com/events/otc2001/ Dave
Right - AT&T's lab is called, creatively, "AT&T Labs". Lucent's lab is "Bell Labs" Dave
At least half the problems with NDAs and IP agreements that I see in my consulting work result from the company using a lawyer who doesn't specialize in IP. The result can be bizarre: draconian claims in one part of the agreement and gaping holes in others, a document they think is an NDA but really makes me their employee, etc. Unfortunately, I sometimes end up paying my IP attorney to rewrite their documents for them, just to protect myself! Dave
Absolutely! There's a number of techniques already known in the information retrieval research community (relevance feedback in particular) that aren't being exploited in current web search engines, and would make a big difference. What's holding them back is usually some combination of efficiency problems and lack of a good interface metaphor for allowing naive users to effectively use the technique. As others have pointed out, most people don't even use phrases.
I think it's a non-issue whether the criteria the engines used are publicized or not. There's enough index spammers out there that any weaknesses in the criteria get discovered, exploited, and patched fairly quickly.
Dave
Agreed. I (not a lawyer) am pretty sure both you and your company's attorneys are legally obliged to mention any relevant prior art they're aware of. So go become aware. --Dave