Slashdot Mirror


Text-Mining Your E-mail

Misha writes "There have been a number of weeks/months in anyone's life that called for a better organization of your Inbox. filtering and folders work, but it'd be nice to have an text-mining tool running in the background that categorized incoming messages by topic as they arrive. It's nice to see that besides NLP research, there are some great algorithmic advances being done, as seen in this paper. Perhaps even one of them Perl monkeys will quickly hack such a background tool." Note: it's a PostScript file.

217 comments

  1. there is no way to win... by Anonymous Coward · · Score: 1, Interesting

    They'll end up finding a loophole in your filtering, or you'll end up filtering out real emails.

    Only way to win is to kill it from the source. End of story.

    1. Re:there is no way to win... by ldopa1 · · Score: 1

      Doesn't this effectively amount to censorship? I agree that we (as individuals) need to make spamming less cost effective, but just preventing people from emailing (which is what you'd have to do) is censorship, unless of course you're the govt, in which case it's "protecting national interests"...

      --
      The Dopester
      "Yes, I'm a Karma Whore, but I'm doing it to pay my way through school."
    2. Re:there is no way to win... by Anonymous Coward · · Score: 0

      You're kidding, right? Please tell me you are.

      It's no more censorship than me seeing a letter and not opening it. Censorship is when you prevent the person from ever saying it in the first place, not if you decide not to listen to them.

    3. Re:there is no way to win... by plague3106 · · Score: 1

      Um i think i have a right to say 'No, you cannot snail mail me' and 'no you cannot call me.' So why don't i have the right to say 'no you cannot email me.' Its not censorship if the audience does not want to hear what you have to say. Yes they have a right to speech, but that doesn't mean they can call me if i do not wish to listen.

      Censorship requires a 3rd party between the person 'giving' the speech and the person 'receiving' it.

    4. Re:there is no way to win... by tenman · · Score: 2

      you can. Pay for a mail server that let's you administrate the configuration of, and a domain name (don't start with the "I shouldn't have to pay for it" crap, you want free? you get spam!).

      Set up your mail server so that all incoming mail to your domain goes to you. then only give out email addresses such as yourcompanyname@mydomain.com. If companyA.com sells your email address to spammers, you can shut that email address off. you can tell your mail server to reject mail sent to companyA.com@mydomain.com. This is no sure fire way to stop everything, and someone who really wants to send you an email, can make up any string of alpha/numeric charicters and send it to you at your domain, But it's a really nice way to monitor who is selling your address. and you cut the address off when you see that they have compromised your information.

      (note: many companies filter out the name of thier company before they sell thier address list, so your email never really makes it onto the list that the company ships)

    5. Re:there is no way to win... by ldopa1 · · Score: 1

      "Um i think i have a right to say 'No, you cannot snail mail me' and 'no you cannot call me.'"

      Actually, you DON'T have that right. You DO have the right to say "I won't read what you snail mail me" and "No, I won't listen to what you have to say when you do call me." You DO (in the U.S.) have the right to say "You cannot call me again." (see the TCPA, section 11)

      Next time you see your friendly postal carrier, ask him (or her) why you have to get all that junk mail. Ask him/her to just chuck it. Guess what, he/she can't. It's illegal for anyone except the sender or the reciever to trash ANY mail intentionally, and even the sender can't trash it after it's been sent. Once it's in the hands of the USPS, it's actually no longer the sender's property...

      --
      The Dopester
      "Yes, I'm a Karma Whore, but I'm doing it to pay my way through school."
    6. Re:there is no way to win... by wolf- · · Score: 1
      I had it out on the phone with a "gentlemen" yesterday afternoon over the "do not call me". One of his employees, for the 3rd or 4rth time this week, rang ALL my potts lines, hanging up before we could answer them. Then when we managed to answer, asked for me. My wife said "he isn't available". So he talked to her about our being homeowners (we arent) and how his company (no good bastard telemarkters) can make us a deal on a new home or upgrades. She said she was not only not interested in doing business with him, she didnt want to be called again, and wanted on a no call list. She asked for his name, his phone number (at which point he hung up)...

      Got to love caller id. Got the name, checked the website, called up and asked to speak to the owner.

      I got David Parton on the line, who informed me that if I didnt want him to call me, that I should be put on the worthless GA No Call List. I informed him that pursuant to 47 U.S.C. Section 227. that the burden of compliance was on him and not on me. After being handled rudely, unprofessionally by the original caller, the founder had the nerve to DEFEND his people, their proceedures, and his attitudes.

      Why do I say the GA No Call List is worthless? It is the LINE OWNERS who have to pay to be on the list! $5 per line for 2 years. I now have to pay to NOT have my dinner rudely interupted by an annoying unprofessional telemarketer who doesnt even have his "market" correct?

      Sure, why not. I have to manage the loads of spam that comes through our company mail servers. I have to pay for the storage, the transport and processing of this junk. Why shouldnt I have to pay for telemarketers also?

      The use of "marketers" in regards to spammers and "telemarketers" has been used loosly. As their shotgun methods of product placement doesnt match nearly anything taught in a true marketing course.

      --
      ----- LoboSoft specializes in Digital Language Lab
    7. Re:there is no way to win... by plague3106 · · Score: 1

      (don't start with the "I shouldn't have to pay for it" crap, you want free? you get spam!).

      I do pay jackass. My email comes with my internet connection.

      Should i pay EXTRA to block telemarketer calls? I don't think so, and i dont' think i should have to pay not to be bothered. It is after all a right i have.

    8. Re:there is no way to win... by tenman · · Score: 2

      I do pay jackass. My email comes with my internet connection.

      I'm sorry I didn't make this clear. I'm not saying that you need to pay for email, I'm saying that you will have to pay extra to get your email filtered at the domain level.

      what you do about telemarketers is your business, and outside the scope of this thread

    9. Re:there is no way to win... by RennieScum · · Score: 2

      That sucks. While Louisiana is chock full of corrupt politicians, they occasionally make it work for their residents. Our no call list requires that business who make unsolicited calls to Louisiana residents subscribe to it (to the tune of $800). Hefty fines for those that call a number that's on the list.

      I get really pissed off when I get a call on my prepaid, expensive per minute cell phone, especially while I'm driving/riding my bike/sleeping/whatever. I'll have to figure out a way to simulate a car crash sound, so I can scream in agony, and then hang up.

      I'm really curious how the legality of this works...the state controls access to it's phone lines under their conditions? Will they actually have the power to impose these fines on businesses that don't pay what amounts to their telemarketing tax? Our state constitution is based on Napoleonic Code...

      See also for the FAQ for weasel^H^H^H^H^H^Htelemarketers

      --
      ...Time is the best teacher, unfortunately it kills all of its students.
  2. What I want by clion999 · · Score: 2, Funny

    Here's to the researchers. I would like:

    * An email box that lets me extract the threads with my friends.
    * An email box that automatically ages the files effectively archiving them. Some of my mail folders/files are huge now and it takes too long to append them when new mail arrives.

    Yes, I realize I should get off my butt and do this, but it's faster to post on slashdot.

    1. Re:What I want by Anonymous Coward · · Score: 0

      Some of my mail folders/files are huge now and it takes too long to append them when new mail arrives.

      Maybe you should use a mail program written by someone how *knows* about fopen(filename, "a"), rather than by someone who thinks you have to load everything in RAM and rewrite everything to disk.

    2. Re:What I want by quigonn · · Score: 2

      In fact, at least point 2 can be easily realized using mutt.

      --
      A monkey is doing the real work for me.
    3. Re:What I want by Anonymous Coward · · Score: 0

      Don't short yourself karma, tell us how! I use mutt and I didn't know this...

    4. Re:What I want by Col.+Klink+(retired) · · Score: 2

      Use nmh. Messages are stored in separate files rather than an entire folder in one file. You can then auto-archive by date with something like:

      refile `pick +inbox -before '1 apr 2002'` -src +inbox +archive

      --

      -- Don't Tase me, bro!

    5. Re:What I want by Jobe_br · · Score: 2, Informative

      For your second point:

      An email box that automatically ages the files effectively archiving them. Some of my mail folders/files are huge now and it takes too long to append them when new mail arrives.

      you could switch to using the Maildir format instead of the typical single-file 'mbox' format. Maildir is popularly used by the qmail MTA as well as courier-imap. I run all my email servers in this matter and I've noticed significant speed improvements in mailboxes that have many messages.

      Maildir maintains three directories, of which 2 are significant: cur and new. Any new messages delivered into the Maildir mailbox is placed in the "new" directory, once its been read, its moved into the "cur" directory. Each message is its own file, so no speed penalty is invoked for appending messages to mailboxes with many messages. Of course, all these different directories and such are transparent to the end-user, Maildir capable MUAs (for console users) and of course Maildir capable IMAP/POP systems are freely available (qmail does SMTP+sendmail wrapping and includes a basic POP3 daemon; courier-imap does IMAPv4 amongst other things; all the apps lend themselves to be used in an SSL via stunnel environment)

      Just a thought ... :)

    6. Re:What I want by Bamfsog · · Score: 2

      I know this is the wrong place to point this out, but Oulook does what you are asking for.

      You can sort a folder by user/subject/date, and there is a built in thread view. You can also use the autoarchive feature, or manually archive messages in X folder(s) older than Y date.

    7. Re:What I want by nosferatu-man · · Score: 5, Informative

      Welcome to Gnus. Have a sandwich.

      (jfb)

      --
      To spur "enterprise Linux," Big Bang, the distributed two-phase commit.
    8. Re:What I want by swordboy · · Score: 2

      Here's what I want:

      A google plug-in for my mail client.

      Thanks in advance!

      --

      Life is the leading cause of death in America.
    9. Re:What I want by antiher0 · · Score: 2

      gnus has been doing this for years... as well as other neat things like mail scoring (similar to news scoring) so that mail you don't want to read gets filtered to the bottom of your list or (if you tell it to) doesn't even show up at all. Similarly, mail that you most want to read (based on past response) gets bubbled up to the top. gnus also supports mail expiry (once again, similar to news) so that old mail gets Handled(TM).

    10. Re:What I want by doom · · Score: 2
      Use nmh. Messages are stored in separate files rather than an entire folder in one file. You can then auto-archive by date with something like: refile `pick +inbox -before '1 apr 2002'` -src +inbox +archive
      Yeah, I was wondering a bit about what "text mining" your email is supposed to be about exactly...

      Personally, I use mh (using the emacs mh-rmail frontend). I refile stuff automatically typically just based on the '-from' (using commands much like the above pick/refile). And if I'm looking for something I remember seeing awhile back, a grep on one or two mail folders (which are just directories full of text files for us mh users) does a pretty good job...

      I won't say that there's no way to improve on this, but any fancy system that someone proposes has got to beat some pretty effective simple tools...

      I mean, if you're really after identifying a burst of activity on a given topic... wouldn't a combination of text searches and visual scans of subject headers sorted by date get you 90% of the way there?

      While we're on the subject, anyone taken a look at this old jwz idea: Intertwingle

    11. Re:What I want by doom · · Score: 2


      Postfix is also supposed to support the
      maildir format.

    12. Re:What I want by Mark+Wilkinson · · Score: 1

      To spur "enterprise Linux," Big Bang, the distributed two-phase commit.

      What does this mean?

    13. Re:What I want by nosferatu-man · · Score: 2

      You're the first person who's ever asked me that.

      It's a line generated by a text disassociator that I wrote and then pointed at a bunch of articles from those stupid rah-rah-rah Business 2.0 style rags a couple of years back. I wish I'd have saved the rest of the generated text: it was hilarious.

      Best,
      'jfb

      --
      To spur "enterprise Linux," Big Bang, the distributed two-phase commit.
  3. PS-PDF Document format conversion by Misha · · Score: 5, Informative
    --



    I was thinking of how to intentionally fail my drug test... It would make a good memoir story someday.
    1. Re:PS-PDF Document format conversion by d3xt3r · · Score: 1

      If you run linux ps2pdf works nicely as well.

    2. Re:PS-PDF Document format conversion by DeadSea · · Score: 2

      PS-PDF is great for quickly mirroring webpages. I'm suprised that I don't see more people doing it here on slashdot to get some quick karma when sites get slashdotted. You have the webpage open in your browser (because you got there before the crowd). First you print it to a postscript file (netscape does this nicely). Then you run it through ps2pdf or some other tool like this and you have have the webpage (with all the pictures) mirrored in a single file. My friends were doing this on sept 11 when all the news sites were going down. Anything one of us saw, we all saw.

    3. Re:PS-PDF Document format conversion by KingKire64 · · Score: 1

      Here is a converted PDF from that website http://wheel.compose.cs.cmu.edu:8001/cgi-bin/brows e/objreal/BPtemp4701.1019671810,0

      Right Click and Save as PDF it has a temp File name

      --
      "All I can tell the "lesser of two evils" folks is that if they keep voting for evil, they'll keep getting evil."-Lp.org
    4. Re:PS-PDF Document format conversion by billnapier · · Score: 1

      Why not just read it using a PostScript viewer?

    5. Re:PS-PDF Document format conversion by daeley · · Score: 2

      If using Mac OS X, one could do the same thing in any printing-capable browser (or any other program). Use the Print command and click on the "Preview" button in the dialog. This automatically creates a PDF version of the document, which can be saved and uploaded.

      --
      I watched C-beams glitter in the dark near the Tannhauser gate.
    6. Re:PS-PDF Document format conversion by ravenwing_np · · Score: 1

      Alright, that explains why the load average on my server is up to 30 and 40. There is just an old Sparc4 doing the bulk of the converstion. If things go slowly, now you know.

    7. Re:PS-PDF Document format conversion by gasull · · Score: 1
      Here's a link to a terribly useful site for converting your postscripts and word docs into pdf or jpeg.

      Just click here (Converting postscript to pdf).

  4. Re:Link to a postscript file? by Anonymous Coward · · Score: 0

    just get ghostview i mean, u have to get acrobat to read pdf, right? unless you're on OS X or something... get ghostview, clicking on the ps document in mozilla magically invokes it, thus u prevail

  5. The importance of E-mail history by Phred_Johnston · · Score: 2, Insightful

    I'm sure I'm not alone in saying that having a good history of well filtered incoming, and especially just about all of my Outgoing (Outbox) available for searching. My Outbox has been a lifesaver several times when someone claims that they didn't have that (electronic) discussion with me. It's great to quote "in a message sent... ...I asked you to...".

    1. Re:The importance of E-mail history by Liora · · Score: 1

      One question... I agree that it is useful to have an outbox archive, but what do you do if they still say they didn't get that or say that? Admittedly I can print out their email, but that can be fabricated and then printed, or I can forward them their own original message, but that can be fabricated, etc... Am I missing something?

      --
      Liora
    2. Re:The importance of E-mail history by ManDude · · Score: 1

      I'm on the other end of ""in a message sent... ...I asked you to..." IT, corporate, and sales send a lot of email and it becomes imposible to keep track of everything that they want you to do or not do. I am to the point where I skim very quickly anything that they send. I am expecting an email any time for one of them saying ""in a message sent... ...I asked you to..." and all I can say is "f?(k-you".

  6. Filtering by pyrrho · · Score: 1

    That feature in the description is not text mining, just filtering.

    --

    -pyrrho

  7. Yet another reason for.. by Dr+Caleb · · Score: 4, Informative
    Lotus Notes.

    It automagically does full text indexing of all specified databases. To it, your Inbox is just another database.

    --
    "History doesn't repeat itself, but it does rhyme." Mark Twain
    1. Re:Yet another reason for.. by LeeZard · · Score: 2, Insightful

      That's not the point. The paper is talking about modeling spikes in topic/content of data streams over time. This is the second layer analysis of the meta-data that gets stored in the database.

    2. Re:Yet another reason for.. by Anonymous Coward · · Score: 0

      But notes cant ever sort a messages subject without writing an agent for it.

    3. Re:Yet another reason for.. by ConceptJunkie · · Score: 2

      Upside:

      Lotus Notes does all kinds of things automagically.

      Downside:

      It's _Lotus Notes_, the application that makes Microsoft Office look lean and mean.

      --
      You are in a maze of twisty little passages, all alike.
    4. Re:Yet another reason for.. by Anonymous Coward · · Score: 0
      Lotus Notes

      NOOOOOOOOOOOOOOOOOOOOOOOOOOO....!!!!

      I would not wish the horrific crap which is Notes upon anyone! It is the biggest, bloatedest, most non-functional, slowest, worst (in every way) excuse for a email client that I have ever had the unfortunate displeasure of having to use. It is awful. I think I am understating my position here. I hate it more than all that is evil in the world. Really. It's that bad.

      The company that owns it forces all employees to use it. Yes really they do. My productivity when forced to use Notes drops by exponential levels.

      Why does it suck so bad? Let's see:

      • Windows only. If you want to run it on another OS, you've got to use either a VM (like VMware or similar) or Wine. VMware et. al. only double the bloat and Notes under Wine has many problems (and Notes doesn't need any more problems!!!).
      • Graphical only. Hello, anyone ever heard of remote access (like ssh?!?). Welp, can't check my email because Notes is graphical. Oh, but pine, mutt, etc aren't - that's why I use them (I use pine).
      • Gigantic. Notes takes up hundreds of Megs of memory. Really. If you don't have a 200+ MHz processor and more than say 128M RAM, you're not going to be doing anything with your system except Notes. Of course I guess on Windows, who multitasks anyway? Also, such a bloated program means long startup time (as compared to pine whose startup is easily less than a second).
      • Slow. Not only is it bloated (which slows it, and your entire system, down) but it is also very slow. When I click on "Mail" it sometimes takes up to 15 seconds to get to the mail screen. Then clicking on a specific mail item (to view the email) sometimes takes up to 10 seconds, or sometimes even more! Completely unacceptable.
      • Proprietary. And I don't mean the program here. I mean the email format!!! There is tons of pure crap like setting up meetings, marking things "urgent" and "confidential" which cannot be translated into real email format (besides changing to text). Notes speaks a language its own and the Domino server's Notes to real email translation is not the best. But worst, people who only use Notes do not realize that crap like coloring email doesn't translate to real email! I've gotten countless "Notes" that say "see my replies in blue" where their text is intermixed with what I sent - except in real email it ain't blue!!! I have to think back or use the context to figure out what's mine and what's theirs. And god forbid I am just a CC: on the topic, especially if I didn't see the original! I have no chance of figuring out who said what. The worst part is people do this while CC'ing (external) customers, who of course don't know what the hell they are talking about - if they've never used Notes they wonder why some idiot thinks he can color an email!
    5. Re:Yet another reason for.. by Dr+Caleb · · Score: 3, Informative
      How do you figure that?

      Lotus Notes (5.0.5), as installed on my system is 127M (no modem files etc) with 59M in help.nsf files, and my .NSF file and templates area hair over 12M. MS Office is over 160M, without PPT, and that's just the Program Files\Microsoft Office directory.

      Lotus Notes is pretty clean, so most of it's files are in 1 directory, not spread out over umpteen directories like Office.

      --
      "History doesn't repeat itself, but it does rhyme." Mark Twain
    6. Re:Yet another reason for.. by Dr+Caleb · · Score: 2
      Notes was not meant for email. Read the History of Notes.

      Yes, it can be bloated and slow, but what isn't nowadays, taking into consideration that people need things like calandaring, meeting scheduling and collaborative tools? What else can run on multiple platforms? Outlook?

      If you get to know it, understand it and use it, you never know, you might like it.

      --
      "History doesn't repeat itself, but it does rhyme." Mark Twain
    7. Re:Yet another reason for.. by Anonymous Coward · · Score: 0
      Notes was not meant for email

      Ah! Well that explains why it sucks so bad at its attempt at doing email.

      What else can run on multiple platforms?

      Uh, wha? You call Windoze and OS/2 "multiple platforms"? Uh, ok...oh, and pine and mutt are multi-platform (i.e. most UNIXes and Windoze).

      If you get to know it, understand it and use it, you never know, you might like it

      I am forced to use it by my employer (guess who) and no, I don't like it. No technical person I have ever talked to likes it. Are you management?

    8. Re:Yet another reason for.. by nvainio · · Score: 1

      My mutt is 460 kb, my grep is 75 kb.

    9. Re:Yet another reason for.. by Anonymous Coward · · Score: 0

      Halloooooooo. Newsflash. Domino does IMAP. So if you cant use the mailreader of your choice, blame your sysadmin.

    10. Re:Yet another reason for.. by Dr+Caleb · · Score: 2
      You call Windoze and OS/2 "multiple platforms" Those are two. I also have it running on Unix, AIX and OS/400. Kinda nice since we're an AS/400 shop.

      I am forced to use it by my employer

      {a hush falls over the crowd} You mean your employer actually tells you what you can do at work, while they are paying you! How brutal it must be for you!

      (guess who) and no, I don't like it.

      I can guess, and I probabally used to work for them the past 10 years, which is where I got all my Notes training.

      No technical person I have ever talked to likes it.

      Strange. Any technical person I've ever talked to never turns their back on a good solution, and technical people who understand Notes love it. Especially if they have been "rescued" from Exchange Hell. Are you saying I'm non-technical?

      Are you management? Yes, MIS.

      --
      "History doesn't repeat itself, but it does rhyme." Mark Twain
    11. Re:Yet another reason for.. by Dr+Caleb · · Score: 2
      My car is red and I like Pizza and beer. So?

      --
      "History doesn't repeat itself, but it does rhyme." Mark Twain
    12. Re:Yet another reason for.. by Icculus · · Score: 1

      Believe it or not, I would use most of these words to describe Lotus Notes, or more precisely the feelings it evokes in me both as a user and a developer.

    13. Re:Yet another reason for.. by jonbrewer · · Score: 2

      All you have to do is use Lotus Notes for a few days on an aging PowerMac with 8 or 16MB ram, and you'll give up on it forever. You'll also tell everyone you know 1. what a horrible thing Lotus Notes is, and 2. what a horrible thing a Macintosh is.

      Those at Apple responsible for allowing PowerMacs to ship with System 7.5.x and less than 32MB ram should be banned from the industry. When an OS by default takes more ram than a system has, and is coupled with an application like Lotus Notes, which is hungry, nothing good can ever happen.

      This is, IMNSHO, a good part of the reason that so many corporations ditched their Macs in the mid-ninteys.

  8. Re:Link to a postscript file? by jordan_a · · Score: 1

    Yeah, a link to an standard document format that you can get viewers for on almost every platform. Damn that's soo inconsiderate, where's those word documents?

  9. The ultimate spam blocker? by ldopa1 · · Score: 2

    This would be an awesome tool to block spam. If this program could look at the text of an email message and determine that it is a solicitation of some kind and then drop it into an email "pit" (you know, a folder mapped to /dev/null), that would make my life a LOT easier...

    --
    The Dopester
    "Yes, I'm a Karma Whore, but I'm doing it to pay my way through school."
    1. Re:The ultimate spam blocker? by set · · Score: 1

      SpamAssassin does this already, using a genetic algorithm.

    2. Re:The ultimate spam blocker? by jmb-d · · Score: 2, Informative

      This would be an awesome tool to block spam. If this program could look at the text of an email message and determine that it is a solicitation of some kind

      SpamAssassin will do this part for you.

      --
      In walking, just walk. In sitting, just sit. Above all, don't wobble.
      -- Yun-Men
    3. Re:The ultimate spam blocker? by ldopa1 · · Score: 1

      Do you have a link to this? Any more information you have would be welcome..

      --
      The Dopester
      "Yes, I'm a Karma Whore, but I'm doing it to pay my way through school."
    4. Re:The ultimate spam blocker? by set · · Score: 1

      A simple google search would have found it, but it's spamassassin.org.

  10. Have you tried.... by Anonymous Coward · · Score: 0

    Outlook?

    1. Re:Have you tried.... by Lisias · · Score: 0, Flamebait
      Outloko?

      Yes, I did. But Yesterday a virus came and erased all my mail boxes...

      Humm!!! It's true!! It works!! Now I have no email at all to worry about!!!!

      --
      Lisias@Earth.SolarSystem.OrionArm.MilkyWay.Local.Virgo.Universe.org
  11. a suggestion by 56ker · · Score: 1

    "that categorized incoming messages by topic as they arrive." - you can already sort messages into different folders depending on their topic by setting up rules.

    1. Re:a suggestion by spencerogden · · Score: 2

      This is a little more in depth than just matching strings in headers. This is about determining the topic of and email.

      Spencer

  12. Too much information. by abucior · · Score: 5, Funny

    Personally, I'd prefer that I simply get less email. The fact that we need NLP tools to pre-screen our email for us just shows how information-overloaded our society has become. What I really need is a tool at the sender's end that can pre-screen my email and tell the sender "Don't send this. He just doesn't care!"

    1. Re:Too much information. by ldopa1 · · Score: 1

      If our society has too much information, why did you just add that little tidbit to my stream of conciousness? Are you trying to make my head explode?

      Frankly, there is no such thing as too much information. Information is never created, knowledge is created. The information was always there (and always will be), you're just percieving more, creating knowledge of the information.

      --
      The Dopester
      "Yes, I'm a Karma Whore, but I'm doing it to pay my way through school."
    2. Re:Too much information. by ConceptJunkie · · Score: 2

      Yes, but by your definition, information equals entropy, and there's definitely too much of that. We need some way to reverse that trend.

      I wonder what could be done with a really hot cup of tea...

      --
      You are in a maze of twisty little passages, all alike.
    3. Re:Too much information. by iabervon · · Score: 2

      There's plenty of information I want to get that I don't want to look at as email.

      For example, I'd like to get messages inviting me to events I'm unlikely to go to, and I'd like to have their dates get marked down so that I can see what is happening on a given day if I feel like doing something.

      I'd like to get new addresses for people, but I want to have my addressbook updated instead of seeing the message.

      It would be really convenient to have software that would figure out this sort of information from a human-readable message, since people are likely to want to send it in natural language (and the message probably includes more information that I might want to see if I decide I care.

    4. Re:Too much information. by j.e.hahn · · Score: 1

      You can't. The 2nd law of thermodynamics forbids it.

    5. Re:Too much information. by foniksonik · · Score: 1

      What about forcing marketers to register in a database? New legislation that states that any marketer not registered and still found to be actively marketing will suffer mandatory fines w/ a 3 strikes rule or something for repeat offenders.

      This would mandate and maintain opt-in and opt-out standards while still allowing initial marketing by the business. ie: they could send us one mail, we could decide if we wanted to ever receive from them again and then go to the database and choose to opt-out or later on choose to opt-in again.

      This would also give consumers recourse against 'rogue' marketers. Businesses who are legitimately marketing thier product would not have any problem with this and 'rogue' marketers could be foudn guilty of invasion of privacy, which is what this all amounts to...

      Mod this up if you think it's a legitimate idea, or maybe even start an 'Ask Slashdot' article to see if we could hash out a workable version and push for legislation somehow.

      --
      A fool throws a stone into a well and a thousand sages can not remove it.
    6. Re:Too much information. by phaedrus · · Score: 1

      ahhh, _bored of the rings_

  13. Sort it by domain... by qurob · · Score: 1

    I can sort reports from devices, co-workers, clients....each goes in its own folder....

  14. text inside the PS file - references cut by GutBomb · · Score: 0

    Bursty and Hierarchical Structure in Streams * Jon Kleinberg # Abstract A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise -- that the appearance of a topic in a document stream is signaled by a "burst of activity," with certain features rising sharply in frequency as the topic emerges. The goal of the present work is to develop a formal approach for modeling such "bursts," in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; in some ways, it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them. *A version of this work appears as Cornell Computer Science Technical Report 02-1863 (March 2002). #Department of Computer Science, Cornell University, Ithaca NY 14853. Email: kleinber@cs.cornell.edu. Supported in part by a David and Lucile Packard Foundation Fellowship, an ONR Young Investigator Award, NSF ITR/IM Grant IIS-0081334, and NSF Faculty Early Career Development Award CCR-9701399. 1. 1 Introduction Documents can be naturally organized by topic, but in many settings we also experience their arrival over time. E-mail and news articles provide two clear examples of such document streams: in both cases, the strong temporal ordering of the content is necessary for making sense of it, as particular topics appear, grow in intensity, and then fade away again. Over a much longer time scale, the published literature in a particular research field can be meaningfully understood in this way as well, with particular research themes growing and diminishing in visibility across a period of years. Work in the areas of topic detection and tracking [2, 3, 5, 61, 62], text mining [36, 56, 57, 58], and visualization [26, 43, 60] has explored techniques for identifying topics in document streams comprised of news stories, using a combination of content analysis and time-series modeling. Underlying a number of these techniques is the following intuitive premise -- that the appearance of a topic in a document stream is signaled by a "burst of activity," with certain features rising sharply in frequency as the topic emerges. The goal of the present work is to develop a formal approach for modeling such "bursts," in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. At one level, the approach presented here can be viewed as drawing an analogy with models from queueing theory for bursty network traffic (see e.g. [32]). In addition, however, the analysis of the underlying burst patterns reveals a latent hierarchical structure that often has a natural meaning in terms of the content of the stream. My initial aim in studying this issue was a very concrete one: I wanted a better organizing principle for the enormous archives of personal e-mail that I was accumulating. Abundant anecdotal evidence, as well as academic research [6, 42, 59], suggested that my own experience with "e-mail overload" corresponded to a near-universal phenomenon -- a consequence of both the rate at which e-mail arrives, and the demands of managing volumes of saved personal correspondence that can easily grow into tens and hundreds of megabytes of pure text content. And at a still larger scale, e-mail has become the raw material for legal proceedings [34] and historical investigation [8, 38, 44] -- with the National Archives, for example, agreeing to accept tens of millions of e-mail messages from the Clinton White House [45]. In sum, there are several settings where it is a crucial problem to find structures that can help in making sense of large volumes of e-mail. An active line of research has applied text indexing and classification to develop e-mail interfaces that organize incoming messages into folders on specific topics, sometimes recommending further actions on the part of a user [4, 9, 13, 29, 30, 39, 46, 47, 49, 50, 51, 53, 54] -- in effect, this framework seeks to automate a kind of filing system that many users implement manually. There has also been work on developing query interfaces to fully-indexed collections of e-mail [7]. My interest here is in exploring organizing structures based more explicitly on the role of time in e-mail and other document streams. Indeed, even the flow of a single focused topic 2. is modulated by the rate at which relevant messages or documents arrive, dividing naturally into more localized episodes that correspond to bursts of activity of the type suggested above. For example, my saved e-mail contains over a thousand messages relevant to the topic "grant proposals" -- announcements of new funding programs, planning of proposals, and correspondence with co-authors. While one could divide this collection into sub-topics based on message content -- certain people, programs, or funding agencies form the topics of some messages but not others -- an equally natural and substantially orthogonal organization for this topic would take into account the sequence of episodes reflected in the set of messages -- bursts that surround the planning and writing of certain proposals. Indeed, certain subtopics (e.g. "the process of gathering people together for our large NSF ITR proposal") may be much more easily characterized by a sudden confluence of message-sending over a particular period of time than by textual features of the messages themselves. One can easily argue that many of the large topics represented in a document stream are naturally punctuated by bursts in this way, with the flow of relevant items intensifying in certain key periods. A general technique for highlighting these bursts thus has the potential to expose a great deal of fine-grained structure. Before moving to a more technical overview of the methodology, let me suggest one further perspective on this issue, quite distant from computational concerns. If one were to view a particular folder of e-mail not simply as a document stream but also as something akin to a narrative that unfolds over time, then one immediately brings into play a body of work that deals explicitly with the bursty nature of time in narratives, and the way in which particular events are signaled by a compression of the time-sense. In an early concrete reference to this idea, E.M. Forster, lecturing on the structure of the novel in the 1920's, asserted that . . . there seems something else in life besides time, something which may conveniently be called "value," something which is measured not by minutes or hours but by intensity, so that when we look at our past it does not stretch back evenly but piles up into a few notable pinnacles, and when we look at the future it seems sometimes a wall, sometimes a cloud, sometimes a sun, but never a chronological chart [17]. This role of time in narratives is developed more explicitly in work of Genette [19, 20], Chatman [11], and others on anisochronies, the non-uniform relationships between the amount of time spanned by a story's events and the amount of time devoted to these events in the actual telling of the story. Modeling Bursty Streams. Suppose we were presented with a document stream -- for concreteness, consider a large folder of e-mail on a single broad topic. How should we go about identifying the main bursts of activity, and how do they help impose additional structure on the stream? The basic point emerging from the discussion above is that such 3. bursts correspond roughly to points at which the intensity of message arrivals increases sharply, perhaps from once every few weeks or days to once every few hours or minutes. But the rate of arrivals is in general very "rugged": it does not typically rise smoothly to a crescendo and then fall away, but rather exhibits frequent alternations of rapid flurries and longer pauses in close proximity. Thus, methods that analyze gaps between consecutive message arrivals in too simplistic a way can easily be pulled into identifying large numbers of short spurious bursts, as well as fragmenting long bursts into many smaller ones. Moreover, a simple enumeration of close-together sets of messages is only a first step toward more intricate structure. The broader goal is thus to extract global structure from a robust kind of data reduction -- identifying bursts only when they have sufficient intensity, and in a way that allows a burst to persist smoothly across a fairly non-uniform pattern of message arrivals. My approach here is to model the stream using an infinite-state automaton A, which at any point in time can be in one of an underlying set of states, and emits messages at different rates depending on its state. Specifically, the automaton A has a set of states that correspond to increasingly rapid rates of emission, and the onset of a burst is signaled by a state transition -- from a lower state to a higher state. By assigning costs to state transitions, one can control the frequency of such transitions, preventing very short bursts and making it easier to identify long bursts despite transient changes in the rate of the stream. The overall framework is developed in Section 2. It can be viewed as drawing an analogy to the use of on-off Markov sources in modeling bursty network traffic (see for example the overview article by Kelly [32]), as well as drawing on the formalism of hidden Markov models [48]. Using an automaton with states that correspond to higher and higher intensities provides an additional source of analytical leverage -- the bursts associated with state transitions form a naturally nested structure, with a long burst of low intensity potentially containing several bursts of higher intensity inside it (and so on, recursively). For a folder of related e-mail messages, we will see in Sections 2 and 3 that this can provide a hierarchical decomposition of the temporal order, with long-running episodes intensifying into briefer ones according to a natural tree structure. This tree can thus be viewed as imposing a fine-grained organization on the sub-episodes within the message stream. Following this development, Section 4 focuses on a case in which the document stream is comprised not of e-mail messages but of computer science conference paper titles over the past several decades; the set of bursts in this stream corresponds roughly to the appearance and disappearance of certain terms of interest in the papers. Section 5 discusses the connections to related work in a range of areas, particularly the striking recent work of Swan, Allan, and Jensen [56, 57, 58] on overview timelines, which forms the body of research closest to the approach here. Finally, Section 6 discusses some further applications of the methodology -- how burstiness in arrivals can help to identify certain messages as "landmarks" in a large corpus of e-mail; and how the overall framework can be applied to logs of Web traffic. 4. 2 A Weighted Automaton Model Perhaps the simplest randomized model for generating a sequence of message arrival times is based on an exponential distribution: messages are emitted in a probabilistic manner, so that the gap x in time between messages i and i + 1 is distributed according to the "memoryless" exponential density function f (x) = ffe-ffx, for a parameter ff > 0. (In other words, the probability that the gap exceeds x is equal to e-ffx.) The expected value of the gap in this model is ff-1, and hence one can refer to ff as the rate of message arrivals. Intuitively, a "bursty" model should extend this simple formulation by exhibiting periods of lower rate interleaved with periods of higher rate. A natural way to do this is to construct a model with multiple states, where the rate depends on the current state. Let us start with a basic model that incorporates this idea, and then extend it to the models that will primarily be used in what follows. A two-state model. Arguably the most basic bursty model of this type would be constructed from a probabilistic automaton A with two states q0 and q1, which we can think of as corresponding to "low" and "high." When A is in state q0, messages are emitted at a slow rate, with gaps x between consecutive messages distributed independently according to a density function f0(x) = ff0e-ff0x When A is in state q1, messages are emitted at a faster rate, with gaps distributed independently according to f1(x) = ff1e-ff1x, where ff1 > ff0. Finally, between messages, A changes state with probability p 2 (0, 1), remaining in its current state with probability 1 - p, independently of previous emissions and state changes. Such a model could be used to generate a sequence of messages in the natural way. A begins in state q0. Before each message (including the first) is emitted, A changes state with probability p. A message is then emitted, and the gap in time until the next message is determined by the distribution associated with A's current state. One can apply this generative model to find a likely state sequence, given a set of messages. Suppose there is a given set of n + 1 messages, with specified arrival times; this determines a sequence of n inter-arrival gaps x = (x1, x2, . . . , xn). The development here will use the basic assumption that all gaps xi are strictly positive. We can use the Bayes procedure (as in e.g. [14]) to determine the conditional probability of a state sequence q = (qi1, . . . , qin); note that this must be done in terms of the underlying density functions, since the gaps are not drawn from discrete distributions. Each state sequence q induces a density function fq over sequences of gaps, which has the form fq(x1, . . . , xn) = Qnt=1 fit(xt). If b denotes the number of state transitions in the sequence q -- that is, the number of indices it so that qit 6= qit+1 -- then the (prior) probability of q is equal to ( Y it6=it+1 p)( Yit=it+1 1 - p) = p b(1 - p)n-b = p 1 - p ! b (1 - p)n. 5. (In this calculation, let i0 = 0, since A starts in state q0.) Now, Pr [q | x] = Pr [q] fq(x)P q0 Pr [q0] fq0 (x) = 1Z p1 - p ! b (1 - p)n nY t=1 f it(xt), where Z is the normalizing constant Pq0 Pr [q0] fq0 (x). Finding a state sequence q maximizing this probability is equivalent to finding one that minimizes - ln Pr [q | x] = b ln 1 - pp ! + nX t=1 - ln f it(xt)! - n ln(1 - p) + ln Z. Since the third and fourth terms are independent of the state sequence, this latter optimization problem is equivalent to finding a state sequence q that minimizes the following cost function: c (q | x) = b ln 1 - pp ! + nX t=1 - ln f it (xt)! Finding a state sequence to minimize this cost function is a problem that can be motivated intuitively on its own terms, without recourse to the underlying probabilistic model. The first of the two terms in the expression for c (q | x) favors sequences with a small number of state transitions, while the second term favors state sequences that conform well to the sequence x of gap values. Thus, one expects the optimum to track the global structure of bursts in the gap sequence, while holding to a single state through local periods of non-uniformity. Varying the coefficient on b controls the amount of "inertia" fixing the automaton in its current state. The next step is to extend this simple "high-low" model to one with a richer state set, using a cost model; this will lead to a method that also extracts hierarchical structure from the pattern of bursts. An infinite-state model. Consider a sequence of n + 1 messages that arrive over a period of time of length T . If the messages were spaced completely evenly over this time interval, then they would arrive with gaps of size g = T /n. Bursts of greater and greater intensity would be associated with gaps smaller and smaller than g. This suggests focusing on an infinite-state automaton whose states correspond to gap sizes that may be arbitrarily small, so as to capture the full range of possible bursts. The development here will use a cost model as in the two-state case, where the underlying goal is to find a state sequence of minimum cost. Thus, consider an automaton with a "base state" q0 that has an associated exponential density function f0 with rate ff0 = g-1 = n/T -- consistent with completely uniform message arrivals. For each i > 0, there is a state qi with associated exponential density fi having 6. 0 1 32 g ln n per state 20 1 3 tree representation 0 1 32 bursts b) time optimal state sequence a) q q q q0 1 2 3 qi transition cost transition cost 0 emissions at rateg-1 s i Figure 1: An infinite-state model for bursty sequences. (a) The infinite-state automaton A*s,fl; in state qi, messages are emitted at a spacing in time that is distributed according to f(x) = ffie-ffix, where ffi = g-1si. There is a cost to move to states of higher index, but not to states of lower index. (b) Given a sequence of gaps between message arrivals, an optimal state sequence in A*s,fl is computed. This gives rise to a set of nested bursts: intervals of time in which the optimal state has at least a certain index. The inclusions among the set of bursts can be naturally represented by a tree structure. rate ffi = g-1si, where s > 1 is a scaling parameter. (i will be referred to as the index of the state qi.) In other words, the infinite sequence of states q0, q1, . . . models inter-arrival gaps that decrease geometrically from g; there is an expected rate of message arrivals that intensifies for larger and larger values of i. Finally, for every i and j, there is a cost o/ (i, j) associated with a state transition from qi to qj. The framework allows considerable flexibility in formulating the cost function; for the work described here, o/ (*, *) is defined so that the cost of moving from a lower-intensity burst state to a higher-intensity one is proportional to the number of intervening states, but there is no cost for the automaton to end a higher-intensity burst and drop down to a lower-intensity one. Specifically, when j > i, moving from qi to qj incurs a cost of (j - i)fl ln n, where fl > 0 is a parameter; and when j 0, since all gaps are positive.) If q* is an optimal state sequence in Aks,fl, then it is also an optimal state sequence in A*s,fl . Before proceeding to the proof, here are two key points to note. First, in all the experiments here, an optimal state sequence in A*s,fl can be found by restricting to a number of states k that is a very small constant, always at most 25. Second, some condition requiring gaps to be positive is necessary in order for the theorem to hold, as the following example shows. Suppose that x were to consist of n gaps, each equal to 0, where n is large enough that sn > nfl . Then the state sequence q(j) in which all n states are equal to qj has cost c iq(j) | xj = j(fl ln n) - n ln fj(0) = j(fl ln n) - n ln ffj = j(fl ln n) - nj ln s + n ln g = j(fl ln n - n ln s) + n ln g. 8. For increasing values of j, these costs c iq(j) | xj form a sequence of negative numbers tending to -1, and hence there is no state sequence in A*s,fl that achieves a cost less than or equal to that of all others. When all gaps are positive, however, no such example is possible, since Theorem 2.1 establishes that there is a state sequence in A*s,fl achieving the minimum cost. Proof of Theorem 2.1. Let q* = (q`1 , . . . , q`n) be an optimal state sequence in Aks,fl, and let q = (qi1, . . . , qin) be an arbitrary state sequence in A*s,fl. As always, set `0 = i0 = 0, since both sequences start in state q0; for notational purposes, it is useful to define `n+1 = in+1 = 0 as well. The goal is to show that c (q* | x) = j0 >= j* + 1, then - ln fj00 (xt) >= - ln fj0 (x). Since k = d1 + logs T + logs ffi(x)-1e, one has ffk-1 = g-1sk-1 = nT * sk-1 >= 1T * slogs T +logs ffi(x)- 1 = 1 T T ffi(x) = 1 ffi(x). Since ffi(x)-1 >= x-1t for any t = 1, 2, . . . , n, the index k - 1 is at least as large as the j for which - ln fj(xt) is minimized. It follows that for those t for which it 6= it0 one has- ln fi0t(xt) it0 = k - 1. Combining these inequalities for the state transition costs and the gap costs, one obtains c (q0 | x) = n-1X t=0 o/ (i 0t, i0t+1)!+ nX t=1 - ln f i0t(xt)! 0. Note that although the final computation of an optimal state sequence is carried out by recourse to a finite-state model, working with the infinite model has the advantage that a number of states k is not fixed a priori; rather, it emerges in the course of the computation, and in this way the automaton A*s,fl essentially "conforms" to the particular input instance. 3 Hierarchical Structure and E-mail Streams Extracting hierarchical structure. From an algorithm to compute an optimal state sequence, one can then define the basic representation of a set of bursts, according to a hierarchical structure. For a set of messages generating a sequence of positive inter-arrival gaps x = (x1, x2, . . . , xn), suppose that an optimal state sequence q = (qi1, qi2, . . . , qin) in A*s,fl has been determined. Following the discussion of the previous section, we can formally define a burst of intensity j to be a maximal interval over which q is in a state of index j or higher. More precisely, it is an interval [t, t0] so that it, . . . , it0 >= j but it-1 and it0+1 are less than j (or undefined if t - 1 n). It follows that bursts exhibit a natural nested structure: a burst of intensity j may contain one or more sub-intervals that are bursts of intensity j + 1; these in turn may contain subintervals that are bursts of intensity j + 2; and so forth. This relationship can be represented by a rooted tree \Gamma , as follows. There is a node corresponding to each burst; and node v is a child of node u if node u represents a burst Bu of intensity j (for some value of j), and node v represents a burst Bv of intensity j + 1 such that Bv ` Bu. Note that the root of \Gamma corresponds to the single burst of intensity 0, which is equal to the whole interval [0, n]. Thus, the tree \Gamma captures hierarchical structure that is implicit in the underlying stream. Figure 1(b) shows the transformation from an optimal state sequence, to a set of nested bursts, to a tree. Hierarchy in an e-mail stream. Let us now return to one of the initial motivations for this model, and consider a stream of e-mail messages. What does the hierarchical structure of bursts look like in this setting? I applied the algorithm to my own collection of saved e-mail, consisting of messages sent and received between June 9, 1997 and August 23, 2001. (The cut-off date is chosen here so as to roughly cover four academic years.) First, here is a brief summary of this collection. Every piece of mail I sent or received during this period of time, using my cs.cornell.edu email address, can be viewed as belonging to one of two categories: first, messages consisting of one or more large files, such as drafts of papers mailed between co-authors (essentially, e-mail as file transfer); and second, all other messages. The collection I am considering here consists simply of all messages belonging to the second, much larger category; thus, to a 10. rough approximation, it is all the mail I sent and received during this period, unfiltered by content but excluding long files. It contains 34344 messages in UNIX mailbox format, totaling 41.7 megabytes of ascii text, excluding message headers.1 Subsets of the collection can be chosen by selecting all messages that contain a particular string or set of strings; this can be viewed as an analogue of a "folder" of related messages, although messages in the present case are related not because they were manually filed together but because they are the response set to a particular query. Studying the stream induced by such a response set raises two distinct but related questions. First, is it in fact the case that the appearance of messages containing particular words exhibits a "spike," in some informal sense, in the (temporal) vicinity of significant times such as deadlines, scheduled events, or unexpected developments? And second, do the algorithms developed here provide a means for identifying this phenomenon? In fact such spikes appear to be quite prevalent, and also rich enough that the algorithms of the previous section can extract hierarchical structure that in many cases is quite deep. Moreover, the algorithms are efficient enough that computing a representation for the bursts on a query to the full e-mail collection can be done in real-time, using a simple implementation on a standard PC. To give a qualitative sense for the kind of structure one obtains, Figures 2 and 3 show the results of computing bursts for two different queries using the automaton A*2. Figure 2 shows an analysis of the stream of all messages containing the word "ITR," which is prominent in my e-mail because it is the name of a large National Science Foundation program for which my colleagues and I wrote two proposals in 1999-2000. There are many possible ways to organize this stream of messages, but one general backdrop against which to view the stream is the set of deadlines imposed by the NSF for the first run of the program. Large proposals were submitted in a three-phase process, with deadlines of 11/15/99, 1/5/00, and 4/17/00 for letters of intent, pre-proposals, and full proposals respectively. Small proposals were submitted in a two-phase process, with deadlines of 1/5/00 and 2/14/00 for letters of intent and full proposals respectively. I participated in a group writing a proposal of each kind. Turning to the figure, part (a) is a plot of the raw input to the automaton A*2, showing the arrival time of each message in the response set. Part (b) shows a nested interval representation of the set of bursts for the optimal state sequence in A*2; the intervals are annotated with the first and last dates of the messages they contain, and the dates of the NSF deadlines are lined up with the intervals that contain them. Note that this is a schematic representation, designed to show the inclusions that give rise to the tree \Gamma ; the lengths and centering of the intervals in the drawing are not significant. Part (c) shows a drawing of the resulting tree \Gamma . The root corresponds to the single burst of intensity 0 that is present in any state sequence. One sees that the two children of the root span intervals surrounding the 1These figures reveal that I receive less e-mail per day than many of my colleagues; one contributing factor is that I do not subscribe to any high-volume mailing lists based outside Cornell. 11. 2/1410/28-2/21/0010/28/99- 11/1610/28- 11/1611/2- 11/15 11/9- 7/10/00-10/31/00 7/10-7/14 1/2-2/4 1/2-1/5 0 20 40 60 80 100 120 140 1.4e+06 1.5e+06 1.6e+06 1.7e+06 1.8e+06 1.9e+06 2e+06 2.1e+06 2.2e+06 2.3e+06 2.4e+06 2.5e+06 a) c) Minutes since 1/1/97 Message # b) 2/14 10/28 11/16 1/2/00 11/16 11/15 11/2 11/9 2/4 7/10 2/21 7/10 7/14 10/31 10/28/9910/28 (large proposals) 11/15: letter of intent deadline 2/14: full proposal deadline 0 1 2 3 4 5 1/2 1/5 1/5: pre-proposal deadline (large proposals) 4/17: full proposal deadline(large proposals) 7/11: unofficial notification 9/13: official announcementof awards intensities (small proposals) (small proposal) Figure 2: The stream of all messages containing the word "ITR," analyzed using the automaton A*2. (a) The raw input data: the x-axis shows message arrival time; the y-axis shows message sequence number. (b) The set of bursts in the optimal state sequence for A*2, drawn schematically to show the inclusions that form the tree \Gamma . (Lengths of intervals are standardized and hence not to scale.) Intervals are annotated with starting and ending dates, and the dates of the NSF ITR program deadlines are lined up with the intervals that contain them. (c) A representation of the tree \Gamma , showing inclusions among the bursts. submission deadlines and notification dates, respectively. Moreover, the sub-tree rooted at the first of these children splits further into two sub-trees that are concentrated over a week leading up to the deadline for letters of intent (11/15/99), and four days leading up to the pre-proposal deadline (1/5/00). Finally, note that there is no burst of positive intensity over the final deadline for large proposal, since we did not continue our large submission past the pre-proposal stage. Figure 3 shows an analysis of the stream of all messages containing the word "prelim," which is the term used at Cornell for (non-final) exams in undergraduate courses. One sees that the raw data in this example (part (a) of the figure) exhibits an arguably more regular structure than in the previous example. I taught undergraduate courses in four of the eight semesters covered by the collection of e-mail, and each of these courses had two prelims. 12. prelim 24/11/00 prelim 12/24/00 prelim 24/15/99 prelim 12/25/99 11/13/00prelim 2 1 2 3 40 5 6 7 8 prelim 110/4/00 intensities 0 50 100 150 200 250 300 350 400 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 1.6e+06 1.8e+06 2e+06 2.2e+06 2.4e+06 Minutes since 1/1/97 Message # a) c) b) Figure 3: The stream of all messages containing the word "prelim," analyzed using the automaton A*2. Parts (a), (b), and (c) are analogous to Figure 2, but date annotations are omitted. In part (b), the dates of prelims (exams) are lined up with the intervals that contain them. For the first of these courses, correspondence with students was restricted almost exclusively to a special course e-mail account, and hence very little appears in my own saved e-mail. The remaining three courses are captured very cleanly by the tree \Gamma computed from the optimal state sequence of A*2 (parts (b) and (c) of the figure) -- each course corresponds to a long burst, and each contains two shorter, more intense bursts for the particular prelims. Specifically, the three children of the root are centered over the semesters in which the three undergraduate courses were taught (Spring 1999, Spring 2000, and Fall 2000); and the subtrees below these children split further into two sub-trees each, concentrated either directly over or slightly preceding the two prelims given that semester. Overall, these structures suggest how a large folder of e-mail might naturally be divided into a hierarchical set of sub-folders around certain key events, based only on the rate of message arrivals. The appropriateness of Forster's comments on the time-sense in narratives is also fairly striking here: when organized by burst intensities, the period of time covered 13. in the e-mail collection very clearly "piles up into a few notable pinnacles" [17], rather than proceeding uniformly. 4 Enumerating Bursts Given a framework for identifying bursts, it becomes possible to perform a type of enumeration: for every word w that appears in the collection, one computes all the bursts in the stream of messages containing w. Combined with a method for computing a weight associated with each burst, and for then ranking by weight, this essentially provides a way to find the terms that exhibit the most prominent rising and falling pattern over a limited period of time. This can be applied to e-mail, and it can be done very efficiently even on the scale of the e-mail corpus from the previous section; roughly speaking, it can be performed in a single pass over an inverted index for the collection. Here, however, I consider a different application of this technique: extracting bursts in term usage from the titles of conference papers. Two distinct sources of data will be used here: the titles of all papers from the database conferences SIGMOD and VLDB for the years 1975-2001; and the titles of all papers from the theory conferences STOC and FOCS for the years 1969-2001. The first issue that must be addressed concerns the underlying model: unlike e-mail messages, which arrive continuously over time, conference papers appear in large batches -- essentially, twenty to sixty new papers appear together every half year. As a result, the automaton A*s,fl is not appropriate, since it is fundamentally based on analyzing the distribution of inter-arrival gaps. Instead, one needs to model a related kind of phenomenon: documents arrive in discrete batches; in each new batch of documents, some are relevant (in the present case, their titles contain a particular word w) and some are irrelevant. The idea is thus to find an automaton model that generates batched arrivals, with particular fractions of relevant documents. A sequence of batched arrivals could be considered bursty if the fraction of relevant documents alternates between reasonably long periods in which the fraction is small and other periods in which it is large. Suppose there are n batches of documents; the tth batch contains rt relevant documents out of a total of dt. Let R = Pnt=1 rt and D = Pnt=1 dt. Now, define an automaton B*s,fl as follows, by close analogy with the construction of A*s,fl. For each state qi of B*s,fl , for i >= 0, there is an expected fraction of relevant documents pi. Set p0 = p = R/D, and pi = p0si. Since it does not make sense for pi to exceed 1, the state qi will only be defined for i such that pi = 1; thus, B*s,fl will be a finite-state automaton. One can further restrict B*s,fl to k states, resulting in the automaton Bks,fl. Viewed in a generative fashion, one can imagine state qi in these models as producing a mixture of relevant and irrelevant documents according to a binomial distribution with probability pi. The cost of a state sequence q = (qi1, . . . , qin) in B*s,fl is defined as follows. If the automa14. Word Interval of burst data 1975 SIGMOD -- 1979 SIGMOD base 1975 SIGMOD -- 1981 VLDB application 1975 SIGMOD -- 1982 SIGMOD bases 1975 SIGMOD -- 1982 VLDB design 1975 SIGMOD -- 1985 VLDB relational 1975 SIGMOD -- 1989 VLDB model 1975 SIGMOD -- 1992 VLDB large 1975 VLDB -- 1977 VLDB schema 1975 VLDB -- 1980 VLDB theory 1977 VLDB -- 1984 SIGMOD distributed 1977 VLDB -- 1985 SIGMOD data 1980 VLDB -- 1981 VLDB statistical 1981 VLDB -- 1984 VLDB database 1982 SIGMOD -- 1987 VLDB nested 1984 VLDB -- 1991 VLDB deductive 1985 VLDB -- 1994 VLDB transaction 1987 SIGMOD -- 1992 SIGMOD objects 1987 VLDB -- 1992 SIGMOD object-oriented 1987 SIGMOD -- 1994 VLDB parallel 1989 VLDB -- 1996 VLDB object 1990 SIGMOD -- 1996 VLDB mining 1995 VLDB -- server 1996 SIGMOD -- 2000 VLDB sql 1996 VLDB -- 2000 VLDB warehouse 1996 VLDB -- similarity 1997 SIGMOD -- approximate 1997 VLDB -- web 1998 SIGMOD -- indexing 1999 SIGMOD -- xml 1999 VLDB -- Figure 4: The 30 bursts of highest weight in B22, using titles of all papers from the database conferences SIGMOD and VLDB, 1975-2001. ton is in state qi when the tth batch arrives, a cost of oe(i, t) = - ln "dtr t!p rti (1 - pi)dt-rt# is incurred, since this is the negative logarithm of the probability that rt relevant documents would be generated using a binomial distribution with probability pi. There is also a cost of o/ (it, it+1) associated with the state transition from qit to qit+1 , where this cost is defined precisely as for A*s,fl. A state sequence of minimum total cost can then be computed as in Section 2. In the analysis of conference paper titles here, the main goal is to enumerate bursts of 15. Word Interval of burst grammars 1969 STOC -- 1973 FOCS automata 1969 STOC -- 1974 STOC languages 1969 STOC -- 1977 STOC machines 1969 STOC -- 1978 STOC recursive 1969 STOC -- 1979 FOCS classes 1969 STOC -- 1981 FOCS some 1969 STOC -- 1980 FOCS sequential 1969 FOCS -- 1972 FOCS equivalence 1969 FOCS -- 1981 FOCS programs 1969 FOCS -- 1986 FOCS program 1970 FOCS -- 1978 STOC on 1973 FOCS -- 1976 STOC complexity 1974 STOC -- 1975 FOCS problems 1975 FOCS -- 1976 FOCS relational 1975 FOCS -- 1982 FOCS logic 1976 FOCS -- 1984 STOC vlsi 1980 FOCS -- 1986 STOC probabilistic 1981 FOCS -- 1986 FOCS how 1982 STOC -- 1988 STOC parallel 1984 STOC -- 1987 FOCS algorithm 1984 FOCS -- 1987 FOCS graphs 1987 STOC -- 1989 STOC learning 1987 FOCS -- 1997 FOCS competitive 1990 FOCS -- 1994 FOCS randomized 1992 STOC -- 1995 STOC approximation 1993 STOC -- improved 1994 STOC -- 2000 STOC codes 1994 FOCS -- approximating 1995 FOCS -- quantum 1996 FOCS -- Figure 5: The 30 bursts of highest weight in B22, using titles of all papers from the theory conferences STOC and FOCS, 1969-2001. positive intensity, but not to emphasize hierarchical structure. Thus, the two-state automaton B22 is used; given an optimal state sequence, bursts of positive intensity correspond to intervals in which the state is q1 rather than q0. For such a burst [t1, t2], we can define the weight of the burst to be t2X t=t1(oe(0, t) - oe(1, t)). In other words, the weight is equal to the improvement in cost incurred by using state 1 over the interval rather than state 0. Observe that in an optimal sequence, the weight of every burst is non-negative. Intuitively, then, bursts of larger weight correspond to more prominent periods of elevated activity. (This notion of weight can be naturally extended to 16. larger numbers of states, as well as to the automaton model from Section 2.) In Figure 4, this framework is applied to the titles of SIGMOD and VLDB papers for the years 1975-2001. For each word w (including stop-words), an input to B22 is constructed in which rt is the number of titles at the tth conference (chronologically) that contain the word w, and dt is the total number of titles at the tth conference. The 30 bursts with the highest weight, over all possible words w, are then depicted in the figure, sorted by year of appearance. The bursts with no given ending date (`mining', `warehouse', `similarity', `approximate', `web', `indexing', and `xml') are those for which the interval extends to the most recent conference, suggesting terms that are in the middle of a large-weight burst at present. Note that no pre-processing is done on the titles, other than to convert each word to lower-case. One observes that the words in Figure 4 are almost all quickly recognizable as carrying technical content, even though they are the top results in an enumeration where bursts were computed and ranked for all words, including stop-words.2 Figure 5 shows the results of the same computation on the titles of STOC and FOCS papers for the years 1969-2001. For both these collections, it is important to note that the number of occurrences of a word w is in general a quantity that, at a local scale, changes very rapidly from one conference to the next; thus, many of the intervals depicted in the figures span conferences in which the indicated word did not appear at all, and omit ones with large numbers of occurrences. The non-trivial cost of state transitions in B22 is crucial in making it possible for intervals of any reasonable length to form in the presence of this data. 5 Related Work The Topic Detection and Tracking (TDT) study [2, 3, 61, 62] articulated the problem of extracting significant topics and events from a stream of news articles, thereby framing the type of document stream analysis questions considered here. Much of the emphasis in the TDT study was on techniques for the on-line version of the problem, in which events must be detected in real-time; but there was also a retrospective version in which the whole stream could be analyzed. Similar issues have recently been addressed in the visualization community [26, 43, 60], where the problem of visualizing the appearance and disappearance of themes in a sequence of news stories has been explored. Following on the TDT work, Swan, Allan, and Jensen [56, 57, 58] developed a method for constructing overview timelines of a set of news stories. For each named entity and noun phrase in the corpus, they perform a O/2 test to identify days on which the number of occurrences yields a value above a certain threshold; contiguous sets of days meeting this condition are then grouped into an interval that is added to the timeline. Thus, the high-level structure of their approach is parallel to the enumerative method in Section 4. However, the 2The bursts for `data,' `base,' and `bases' in the years 1975-1981 arise in large part from the fact that the term "database" was written as two words in a significant number of the paper titles during this period. 17. underlying methodology is quite different from the present work in two key respects. First, Swan et al. note that the use of thresholds makes it difficult to construct long intervals of activity for a single feature -- such intervals are often broken apart by brief gaps in which the feature does not occur frequently enough, and subsequent heuristics are needed to piece them together. The present work, by modeling a burst as a state transition with costs, allows for a long interval to naturally persist across such gaps; essentially, in place of thresholds, the optimization problem inherent in finding a minimum-cost state sequence adaptively groups nearby high-intensity intervals together when it is advantageous to do so. Second, the work of Swan et al. does not attempt to infer any type of hierarchical structure in the appearance of a feature. Lewis and Knowles analyze the dynamics of message-sending over a very short time scale, searching for features that can determine whether one message is a response to another [37]. This is applied to develop robust techniques for identifying threads, a popular metaphor for organizing e-mail and newsgroup postings [15, 22]. In a very different context, Grosz and Sidner develop structural models for discourse as a means of analyzing communication [21]; their use of stack models in particular results in a nested organization that bears an intriguing, though distant, relationship to the nested structure of bursts studied here. The present work clearly overlaps with the large areas of time series analysis and sequence mining [10, 25]; connections to related probabilistic frameworks such as bursty on-off sources [32] and hidden Markov models [48] have already been discussed above. Ehrich and Foith [16] proposed a method for constructing a tree from a one-dimensional time series, essentially by introducing a branch-point at each local minimum and a leaf at each local maximum (see also [55]). In the context of the applications here, such an approach would yield trees of enormous complexity, due to the ruggedness of the underlying temporal data, with many local minima and maxima. The search for a minimum-cost state sequence in the automata of Section 2 and 4 can also be viewed as a search for approximate level sets in a time series, and hence related to the large body of work on piece-wise function approximation in both statistics and data mining (see e.g. [23, 24, 27, 31, 33, 35, 41]). In a discrete framework, work on mining episodes and sequential patterns (e.g. [1, 12, 25, 40]) has developed algorithms to identify particular configurations of discrete events clustered in time, in some cases obeying partial precedence constraints on their order. Finally, there is an interesting general relationship to work on traffic analysis in the areas of cryptography and security [52]; in that context, temporal analysis of a message stream is crucial because the content of the messages has been explicitly obscured. 6 Extensions and Conclusions In the settings discussed above, the analysis has made use of both the temporal information and the underlying content. The role of temporal data is clear; but the content of course 18. plays an integral role as well: Section 3 deals with streams consisting of the response set for a particular query to a larger stream; and Section 4 considers streams with batched arrivals, in which a particular subset of each batch is designated as relevant. And in fact, there is strong evidence that the interplay between content and time is crucial here -- that an arbitrary set of messages with same sequence of arrival times would not exhibit an equally strong set of bursts. Adapting a permutation test from Swan and Jensen [58], one can start with a complete e-mail corpus having arrival times t1, t2, . . . , tN , choose a random permutation ss, and shuffle the corpus so that message ss(i) arrives at time ti (instead of message i), for i = 1, 2, . . . , N . The resulting shuffled corpus has the same set of arrival times and the same messages, but the original correspondence between the two is broken; do equivalently strong "spurious" bursts appear in this new sequence? In fact, they clearly do not: when the weight of bursts for all words (with respect to A*2) is computed using the e-mail corpus in Section 3, the total weight associated with the true corpus is more than an order of magnitude greater than the average total weight over 100 randomly shuffled versions (369,980 versus 25,141). Moreover, the shuffled versions exhibit almost no non-trivial hierarchical structure; the average total number of words generating bursts of intensity at least 2 (i.e. inducing trees \Gamma with two or more levels below the root) is 16.7 over the randomly shuffled versions, compared with 3865 in the true corpus. I have also applied the overall framework developed here to Web clickstream data collected by Gay et al. [18]. The dataset in [18] was compiled as part of a study of student usage of wireless laptops: The browser clicks of roughly 80 undergraduate students in two particular classes at Cornell were collected (with consent) from wireless laptops over a period of two and a half months in Spring 2000. Bursts with respect to A*s,fl can be computed by an enumerative method, as in Section 4: for every URL w, all bursts in the stream of visits to w are determined; the full set of bursts is then ordered by weight. Each burst, associated with a URL w, now has an additional quantity associated with it: the number of distinct users who visited w during the interval of the burst. This allows one to distinguish between collective activity involving much of the class and that of just a single user. As it turns out, if one focuses on bursts that involve at least 10 distinct users, then many of those with the highest weight involve the URLs of the on-line class reading assignments, centered on intervals shortly before and during the weekly sessions at which they were discussed. A final observation is that the use of a model based on state transitions leads to bursts with sharp boundaries; they have clear beginnings and ends. In particular, this means that for every burst, one can identify a single message on which the associated state transition occurred. This is akin to the TDT study's notion of (retrospective) first story detection [2], although in the automaton model of the present work, identifying initial messages does not constitute a separate problem since it follows directly from the definition of the state transitions. In the context of e-mail, the contents of such an initial message can often serve as a concentrated summary of the circumstances precipitating the burst -- in other words, there is frequently something in the message itself to frame the flurry of message19. sending that is about to occur. And for messages on which bursts for several different terms are initiated simultaneously, this phenomenon is even more apparent; such messages often represent natural "landmarks" at the beginning of a long-running episode. In many domains, we accumulate extensive and detailed records of our own behavior -- in the e-mail we send and receive, the Web pages we visit, the queries we issue to search engines. An underlying theme, of which several aspects have been developed here, is that a great number of these settings have a fundamental temporal aspect; they are punctuated by the sharp and sudden onset of particular episodes, and can be organized around rising and falling patterns of activity. There is a great amount of complexity underpinning such a picture. But by developing a better understanding of it, one can hope ultimately to find structure in the raw data that we generate through the basic process of interacting and communicating. Acknowledgements. I thank Lillian Lee for valuable discussions and suggestions throughout the course of this work.

    1. Re:text inside the PS file - references cut by GutBomb · · Score: 1

      well, i tried to paste it in here but the lameness filter would not let me. try this.

    2. Re:text inside the PS file - references cut by GutBomb · · Score: 0

      oh it looks as if the filter DID let me...
      /me slams head against desk

  15. What's wrong with IMAP ? by MagicFab · · Score: 1

    IMAP (Internet Message Access Protocol) was designed to centralize email information, I believe. If stored/implemented with a database, what more would you need ?

    I think querying through SQL would satisfy most of us.. and be very useful in corporate environments (for example, query all email sent from a user to support), and it's already done by some projects like DBMAIL.

    Anybody out there with experience using these ?

    BTW, there's an extensive database of IMAP products including some that make the data accessible via LDAP... hours of fun!

    --
    Notepad specialist & FAT administrator, group training available
    1. Re:What's wrong with IMAP ? by statusbar · · Score: 5, Interesting

      DBMAIL looks cool, once it supports postgresql it would be awesome.

      I have been dissapointed in general with most SMTP, IMAP and POP servers. A real database is the proper way to do things. Email is my #1 app and I want to do complex queries on my archives.

      So last year I bit the bullet and wrote a 50 line python program which imported all my mbox and Maildir format archives into a simple postgresql database. 600 megs worth over the last 4 years.

      And another simple 50 line php program gives me a web database query interface. It suits my needs now and is much faster than searching through a big (but much much smaller) imap folder with almost every mail program I've tried. With some good design it really shouldn't be too hard to make an industrial strength email database system and I am surprised that it hasn't happened sooner in the open source world.

      I think that direct SQL access to the mail database is preferred over IMAP. SQL gives you more capabilities and I find it less problematic than all the various combinations of IMAP servers and mail programs.

      Jeff

      --
      ipv6 is my vpn
    2. Re:What's wrong with IMAP ? by ryochiji · · Score: 1

      Personally, I find the search command (part of the IMAP specs) to be sufficient for finding old messages. I usually have a general idea of which folder a particular message might be and some other key word or information, which is all I need to find the message in question.

      Only times I have problems with this is when I have to search through a mailing list archive containing several years' worth of digests... so, I guess a more "intelligent" solution would have it's uses.

    3. Re:What's wrong with IMAP ? by Tracy+Reed · · Score: 1

      Maildir has done very well for me. Fast and reliable. A filesystem IS a database. I don't think there is much need to put email into any other sort of database. It just adds another unnecessary layer of complexity. I wouldn't mind seeing a database used to store metadata so the mail can be quickly searched but I'd prefer to leave the emails themselves in the filesystem and have the database of metadata contain pointers to the filenames in the filesystem.

    4. Re:What's wrong with IMAP ? by statusbar · · Score: 2

      Maildir DOES work great, and I use it myself for non archived emails.

      They work great until you have lots of messages.

      My postgresql email database contains 54,244 email messages. Current filesystems do not like having that many files in one directory. A filesystem is NOT a database - it only has one field (filename) that you can do queries on.

      The database allows me to properly index the fields such as 'date', 'subject', and 'from' - for instance:

      select count(*) from msgs where msg_from = '<myemailaddress@domain>';

      returns a result in a fraction of a second because it does not have to iterate through all my messages. Whereas a Maildir directory with 5000 files in it can not be groked with wildcards. Try it!:

      $ ls *
      bash: /bin/ls: Argument list too long

      A big problem with the concept of putting only metadata in the database and the content in the filesystem is that you end up making the system even more complex as you need two different ways of accessing the data and the data is split between two sources.

      The two different ways of accessing the data is a problem when you want to access the emails from another computer. For me it is simple - my Mac OS X machine can make a postgresql connection to my linux server and do queries including message content easy and quick.

      SQL databases nowadays handle large text fields and blobs just fine and make it dead simple to back up, process, or query all the data.

      'Folders' can be just SQL VIEWs and are way more flexible than seperate Maildirs for each folder.

      By the way, I would LOVE to see an imap server and email client program handle 54,244 messages in one folder that I can view and search different ways without it bogging down or trying to cache 600 megs of data locally or trying to make a single list box with 54,244 items in it (and usually crashing in the process)

      Please show me one so I can use it instead of writing my own smtp to postgresql gateway.

      --Jeff

      --
      ipv6 is my vpn
  16. look by Joe+the+Lesser · · Score: 4, Funny

    Now we all now that most email is delivered promptly by gremlins, but gremlins are hungry and will eat a few bytes here and there.

    They also leave waste in the form of spam.

    So, I propose that we turn to gnomes to deliver the mail instead, as they are much cleaner, and can be satiated by attaching a file like 'Hamburger.txt'.

    --
    "I only speak the truth"
    Karma: null(Mostly affected by an unassigned variable)
  17. The joys of owning a domain by CaptainPhong · · Score: 5, Insightful
    I've found the most joy from owning my own domains, and a lot of it has to do with e-mail sorting/filtering as much as the traditional benefits (a permanent www.yourdomain.com web site address and yourname@yourdomain.com e-mail address).

    Every time you sign up for some mailing list or discussion group, create a new e-mail account or alias for just those mailings. Bam, it's automatically sorted out by itself with extreme ease. If you have limited bandwith (or are checking, say, on your palm) sometimes, just check your important addresses frequently, and reserve your mailing lists for a once-per-day check.

    If some site asks for your e-mail address to download a piece of software, or to register, make up a new alias and give that to them. If you start getting tons of crap at that address, you can just remove that alias, and they get it all bounced back in their stupid spamming faces.

    Give one address to your cow-orkers just for work stuff. Give a different one to your Mom and other techno-nots that blocks all attachments. Give another one to your friends with brains that goes unfiltered. For people you don't want to talk to, give them the address of an autoresponder tied to Eliza.

    Be a *Happy Camper* and let your addresses be *Bubbles* and you be just *You*.

    --
    ... "Give me a woman who loves beer and I will conquer the w
    1. Re:The joys of owning a domain by berck · · Score: 1

      I use sneakmail.com for giving my address out on webforms where I'm worried about spam. Easy for those of us without FQDNs.

    2. Re:The joys of owning a domain by Anonymous Coward · · Score: 0

      And most importantly:

      "It is funny enough. Do not forget to *enjoy the sauce*!"

      (It's on topic, for those of you who haven't played Star Control II)

    3. Re:The joys of owning a domain by Anonymous Coward · · Score: 0
      Yeah, but with all of the tolls on /., I tend to want to say:


      *Ngaaaaaa*! *Squeezing juice*! You will be *sick* for the last time!


      (This happens if you talk about the Androsynth one too many times...)


      Incidentally, SC2 is still my favorite game ever... and I have played a lot of games. No game, in my opinion, has come close in terms of plot and gameplay. I _still_ like playing it... even though it was released 10 years ago!


      Speaking of which, Happy 10th anniversary, SC2! I only hope it doesn't end up suffering the fate of the Precursors...

    4. Re:The joys of owning a domain by UnknownError · · Score: 1

      Small amendment to CaptainPhong's three categories of mail senders ("Mom and other techno-nots"; "friends with brains"; and "people you don't want to talk to"): Couldn't you just say "techno-nots" and leave motherhood out of this? Giving birth doesn't lead, necessarily, to a life of spewing unwelcome email attachments. In fact, I've heard some mothers even write code....

      Thanks, though for the autoresponder-tied-to-Eliza idea.

  18. Uh... by willis · · Score: 1
    I'm not sure if this was for karma or not, but you might want to format text before dumping it like this. Just looks more professional.

    --

    there is no thing
    what else could you want?
  19. Re:Link to a postscript file? by Pfhreakaz0id · · Score: 1

    How about plain text, HTML, RTF, or PDF? Every person who's been on the internet longer than two months has Acrobat.

  20. Re:hi, i need to ask a question by ldopa1 · · Score: 1

    I used to know how to do this, but I forgot it as soon as I figured it out. Go figure...

    What was I talking about?

    --
    The Dopester
    "Yes, I'm a Karma Whore, but I'm doing it to pay my way through school."
  21. Re:Link to a postscript file? by tps12 · · Score: 2

    Not to mention that a PDF would be 10x the size. I have no idea why Mac and Windows OS's are so baffled by postscript...half the printer drivers have to deal with it already, why not just bundle a damn interpreter with the OS and have a minimal frontend on it for screen viewing?

    --

    Karma: Good (despite my invention of the Karma: sig)
  22. Re:she said this, and she created COBOL too by Rorschach1 · · Score: 2

    Damn you, Admiral Hopper! I've got a huge stack of COBOL listings on my desk that I've got to translate to, of all things, vbscript (damn you, Bill Gates!)

  23. Been there, done that by ansonyumo · · Score: 1
    I worked for a dot-com that built an application to do just this. Our focus was on product management, to automatically classify incoming email about products to specific categories in a hierarchy. The classification part was very general and could have been used for just this purpose.

    I say could have because it got sucked down the drain in late 2000 with all of the companies that didn't have a damn thing to offer. Lesson #1: make sure your CEO gets along with your venture partner.

    No, I'm not bitter. Much.

  24. how 'bout common-sense? by jptwo · · Score: 1

    a nice methods paper, but mr. kleinberg doesn't use any of the free metadata that comes with email and news: to, from, subject.

    i use outlook, and cluster my mail by sender... most of the time, that tells me pretty easily whether a given piece of mail is a work email, a personal email or a mailing list. from there, i check the subject line of work emails, just to confirm my categorization of work/humor/administrivia.

    i'd want to see a comparison between a metadata-only method (rules and filters on the RFC 822 header) and mr. kleinberg's method before i'd consider using it.

  25. Remembrance Agent by Tekmage · · Score: 5, Informative

    It's more general than e-mail, but in the wearable computing community, there's a little application called Remembrance Agent, written by Bradley Rhodes that many folks use. In terms of stand-alone UI, it's still quite primitive, but that's because it was built around dynamic hooks into Emacs.

    I've been playing around with some Java-based wrapper code, to wrap the ra-retrieve executable in a Server and allow clients to access the data via sockets. I have a Java-based client coded up that hooks into the System clipboard, but it's still in alpha-mode. All GPL'd of course, but needs a little time to mature. It's a proof-of-concept, work in progress. :-)

    Check out Brad's site for more insight into the work he did and is doing.

    --
    --The more you know, the less you know.
    1. Re:Remembrance Agent by jamieo · · Score: 1

      Yes I've used this a little and it's *very* nice. You're writing an email, or any document, and the bottom of your screen has a list of other emails and documents related not only to the mail/document you're writing, but the part of the document you're writing!!!

      I first saw this a few years ago and when you first use it it blows you away. Why this hasn't become wildly popular I don't know.

      Check it out.

      Jamie

    2. Re:Remembrance Agent by xyzzy · · Score: 2

      Perhaps because it's only compatable with RMAIL through a one-of-a-kind elisp interface?

      The correct question is, why hasn't someone taken it and tried to hook it into a somewhat more common platform.

    3. Re:Remembrance Agent by Tekmage · · Score: 2

      Which would be why and what I'm playing with... :-)

      I have it hooked into the system clipboard, so getting info to and from the RA is easy via that mechanism. The true power will come when/if there happens to be a way to "watch the keys". Kind of like a key-logger, but I'd rather not have it watching from the keyboard side; it should only be watching what's visible, not everything including passwords.

      The challenge (learning curve on my part) is getting deep enough access to system level interfaces via Java... Focus-independent access to mouse and keyboard input streams. Also have to work up an ra-index wrapper; more a function of the JRAServer than JRAClient classes.

      --
      --The more you know, the less you know.
    4. Re:Remembrance Agent by Anonymous Coward · · Score: 0

      What's a point? RMAIL is dead. Use VM-mode instead - you'll love it.

  26. Re:Link to a postscript file? by tps12 · · Score: 2

    Postscript has the best reproduction accuracy for the file size. Assuming it has any kind of figures or equations, the only other reasonable alternatives are dvi and pdf. I've never seen dvi files rendered in a decent amount of time, and pdf is too fat, esp. for a paper linked to by slashdot. :)

    --

    Karma: Good (despite my invention of the Karma: sig)
  27. One use to rule them all by Col.+Panic · · Score: 3, Funny

    my $pr0n = "adult";
    my $spam = "viagra";
    my $urgent = "penis enlargement";
    open (INBOX,/home/mail) or die "Damn! No fun for me:$!\n";
    @list = readdir(INBOX);

    foreach $ (@list) {
    if (-f $spam) {
    my $status = unlink($spam);
    }
    if (-f $pr0n) {
    my @MUST_SEE = $pr0n;
    next;
    }
    if (-f $viagra) {
    my @RAINY_DAY = $viagra;
    next;
    }
    }
    # or something like that ...

  28. Re:Link to a postscript file? by Anonymous Coward · · Score: 0

    Amen, brother.

    Especially considering NeXT used postscript for its internal rendering, you'd think OSX could at least do the same.

  29. Six Degrees from Creo by PHAEDRU5 · · Score: 2


    OK, it's not a piece of Linux software, but it is a beautful idea:

    http://www.creo.com/sixdegrees/

    --
    668: Neighbour of the Beast
  30. Re:Link to a postscript file? by SuiteSisterMary · · Score: 4, Funny
    why not just bundle a damn interpreter with the OS and have a minimal frontend on it for screen viewing?
    Gee, wouldn't that be illegally using their monopoly to muscle out third party developers? Why, if the OS had a PS viewer built in, nobody would every buy one! Businesses would go bankrupt!
    --
    Vintage computer games and RPG books available. Email me if you're interested.
  31. PDF File by bheilig · · Score: 1

    Here is a PDF version.

    Boy could I use the Karma!

    1. Re:PDF File by bheilig · · Score: 1

      Crappy geocities... Is there someplace else I can put it?

  32. Censorship? by PCM2 · · Score: 2

    Hoo boy. Here we go again. When are you kids going to get it straight?

    - Choosing not to listen to somebody is *not* censorship.
    - Throwing your mail away before you open it is *not* censorship.
    - Choosing not to relay somebody's spam is *not* censorship.
    - Choosing not to broadcast somebody's TV program, even if you own a TV network, is *not* censorship.
    - Telling a movie producer you won't distribute his/her movie unless he/she makes cuts or changes to the subject matter is *not* censorship.
    - Rallying your church group together to burn books is *not* censorship.
    - Refusing to sell certain magazines or newspapers, if you own a newsstand, is *not* censorship.

    The only way somebody can be truly "censored" is when there is no legal means for that person to get his/her speech/art/etc. produced and disseminated to the pubic. Generally speaking, the only body with that type of power is the government -- because they make the laws.

    Everything else is merely an inconvenience. It may piss you off, sure, and you may wish things were different. But you can't force people to support you, encourage you, or fund you if they just don't want to. For example, people in this country (the US) *do* have a right to decide what material constitutes pornography, relative to their local community standards -- and if you don't like it, you are within your rights to move to another town.

    "No censorship" does not mean being forced to look at every piece of crap that somebody wants to throw in your face, and god help us if it did.

    --
    Breakfast served all day!
    1. Re:Censorship? by alouts · · Score: 2, Interesting
      Very valid points but:

      The post you're ranting against was a reply to one that suggests filtering is not what we should do. That spam needs to be "killed at the source". Which means legally preventing someone from creating any mail in the first place.

      Say what you will about spammers, but that IS censorship.

      ('Course there's plenty of people here who believe that censorship is fine in this case, but that's not what you're arguing, so I won't either.)

    2. Re:Censorship? by ldopa1 · · Score: 1

      I was saying that killing it at the source is censorship, not filtering it out.. So, I am in complete agreement. If I wasn't clear enough on that, I'm sorry..

      --
      The Dopester
      "Yes, I'm a Karma Whore, but I'm doing it to pay my way through school."
    3. Re:Censorship? by t · · Score: 2
      It is only censorship if "killed at the source" means to literally kill the fucknut sending the spam. Death is the most convienient and widely implemented form of censorship in places like China.

      Preventing someone from sending emails is NEVER censorship by definition. They can always go to Kinkos and make plain old paper mailings and then mail them to everyone on the planet.

      t.

  33. Finally... by Aiku1337 · · Score: 2, Funny

    Now I can automatically filter my barely-legal porn spam from my anime porn spam. Lets hear it for technology =)

  34. Postscript document by Tim+Ward · · Score: 3, Interesting

    Somewhat to my astonishment when I clicked on the link up popped a box asking me to confirm Postscript Renderer options! I had no idea that I had anything on this box that could read Postscript.

    Some minutes of 100% CPU later up pops a PSP window, with the document rendered in a font about five pixels square. Fair enough, I suppose, for what's basically a photograph editing application.

    But really, how bizarre, posting something in a low level printer file format. We'll have people posting documents in PCL5 next.

    1. Re:Postscript document by rgmoore · · Score: 2
      But really, how bizarre, posting something in a low level printer file format. We'll have people posting documents in PCL5 next.

      What's so strange about it? Postscript has the great advantage that it's actually designed to describe exactly what's on the page. That lets you produce very nicely formatted documents that will render exactly the same way on any computer, which makes it the output format of choice for programs like TeX. It's great because it's easy to print, so people who prefer to see things in dead tree format can do so easily. It can be processed into PDF very easily, too, so people who like PDFs won't have any problems. Sounds like a good choice to me.

      --

      There's no point in questioning authority if you aren't going to listen to the answers.

    2. Re:Postscript document by ansonyumo · · Score: 1

      Most academic publications are delivered in either Postscript or TeX, it's not unusual.

    3. Re:Postscript document by ansonyumo · · Score: 1

      Furthermore, PS isn't a low-level printer file. It is a page description language, and a very powerful one at that. It was the language used to implement GUIs for NeXT's NeWS, and its offshoot PDF is used on the Aqua GUI.

    4. Re:Postscript document by Permission+Denied · · Score: 1
      But really, how bizarre, posting something in a low level printer file format. We'll have people posting documents in PCL5 next.

      Not bizarre at all. Is it bizarre that people post things in PDF format? Should they drop PDF and just post using MS Word instead, since Word is a far more universal and portable method for distributing professionally typeset documents?

      What's the relationship between PostScript and PDF? Look into it and you'll see PDF was created just to deal with a couple of issues that distributing PS files has, and PDF is not far removed from PS.

      Search the web for any mathematics papers, and you'll find most of them in PostScript. Recently, I've seen some people using PDF for this purpose, but PS is far more prevalent in the math and CS communities.

      Years ago, there was a move to make the IETF standardize RFCs in PostScript format since it was almost as universal among the intended audience as plain text (the current format for RFCs).

      PostScript is not a stupid low-level printer language like PCL. PostScript is a beautiful, full-fledged powerful programming language, and contains programming constructs that are far more "high-level" than, for instance C or C++ (like equivalance of data and code, something you usually don't find in procedural languages). It's been loved by computer professionals for years.

      If you're interested, do a google search and you'll find the "blue book" the "red book," etc. Learning PostScript will change the way you think about programming, which should really be the important reason for learning a new language.

      Don't diss PostScript.

    5. Re:Postscript document by jonbrewer · · Score: 2

      .ps generally sounds like a good idea to many science-types.

      I think it's rather tiring.

      If I didn't have a full install of Acrobat on my system, I wouldn't have bothered with it. (It configured itself to handle .ps documents by converting them into .pdf.)

      .pdf has been around for as long as the commercial Internet, and is understood by every computer I've used in the past five years. It can be created by innmuerable commercial and free (as in beer and as in speech) tools. It can be read by Acrobat reader, a fantastic free (as in beer) tool from Adobe.

      There really are no reasons to publish in .ps other than whim, eliteism, or ignorance. All of those being sins in my book.

    6. Re:Postscript document by t · · Score: 2
      The reason is ignorance but not on the part of the publishers.

      Acrobat is shit.

      ggv will view .ps, .pdf, .ps.bz2, .ps.gz, probably others. Works great. There is no reason to differentiate between any of them. And if you really must ps2pdf works quite well.

      t.

    7. Re:Postscript document by Tim+Ward · · Score: 2

      Yes dear, I have written code in PostScript, both hand-coded programs (to generate forms etc) and machine generated (ie I've written PostScript printer drivers). In the mid 1980s IIRC.

      But it's still not an appropriate language to distribute documents that you want anyone other than Unix users to read. This is a historical accident. For anyone who is too young to remember, this came about because the first decent laser printer happened to be a PostScript machine, and Unix didn't develope a printer driver model - instead everyone just emulated, one way or another, the LaserWriter. (I don't know if this has changed, I haven't found it profitable to do much work on Unix graphical apps the last few years.)

      PDF is vastly more sensible as a general distribution format.

      I usually take the distribution of a document in PostScript format as a message that I'm not part of the intended audience, and I don't read it. If I'm really not part of the intended audience then that's fine, of course, and everybody's happy; but if I was intended to use it then they got the format wrong.

    8. Re:Postscript document by Permission+Denied · · Score: 1
      Yes dear, I have written code in PostScript, both hand-coded programs (to generate forms etc) and machine generated (ie I've written PostScript printer drivers). In the mid 1980s IIRC.

      Fair enough; you gain my respect. However, your original post made it seem like this was something new (We'll have people posting documents in PCL5 next, perhaps a subtle troll?).

      Like you said, whether posting PostScript is appropriate depends on your audience. This paper was from a guy in the CS department at Cornell, so I'd say it was altogether appropriate. I'd say less than one tenth of the posters here on slashdot read the paper, since it talks about automata theory and does not at all concentrate on spam busting. Those versed in automata theory have probably been through a traditional CS/Math program which means they've certainly seen papers in PostScript.

  35. Re:Link to a postscript file? by Anonymous Coward · · Score: 0

    It shouldn't take two months to get ghostscript and ghostview. In fact, it comes with most modern operating systems.

  36. Re:Link to a postscript file? by Kiaser+Zohsay · · Score: 2

    Actually, ghostscript created a PDF about half the size of the .ps file.

    -rw-r--r-- 1 kz None 239121 Apr 24 14:13 bhs.pdf
    -rw-r--r-- 1 kz None 433678 Apr 24 14:02 bhs.ps

    Of course, the PDF is Flate encoded internally, and the ps is a big fluffy text file, so the ps file would compress to well below the PDF size.

    --
    I am not your blowing wind, I am the lightning.
  37. procmail! [Re:The ultimate spam blocker?] by Styx · · Score: 5, Informative
    I use procmail, with weighted scoring
    First, I sort out mail from the mailingslists I read.
    Then, mail from friends, and people I correspond with a lot.
    Finally, I have a weighted scoring recipe:

    :0 Bh
    * -199^0
    #Assign an initial value of -199, mail gets filtered, if the score is above 0, at the end of the recipe.
    * 50^1 ^(From|To):.*@hotmail.com
    * 50^1 ^(From|To):.*@yahoo.com
    * 50^1 ^(From|To):.*@aol.com
    * 50^1 ^(From|To):.*@msn.com
    * 50^1 ^(From|To):.*@excite.com
    * 50^1 ^(From|To):.*@netscape.net
    * 50^1 ^(From|To):.*@yahoo.co.uk
    #Most mail to and from these domains is spam, so score it.
    * 100^1 opt-out
    * 50^1 opt-in
    * 200^1 OTCBB
    * 50^1 viagra
    * 50^1 zyban
    * 50^1 propecia
    * 75^1 FREE
    * 75^1 GUARANTEED
    * 75^1 LEGAL
    * 50^2 MILLIONAIRE
    * 50^1 100%
    #Words I only see in spam.
    mail/Trash

    This works quite well for me. If any spam gets through, I try to find some words, that I don't get in normal mail, and add them to the scoring.

    --
    /Styx
    1. Re:procmail! [Re:The ultimate spam blocker?] by bruckie · · Score: 4, Informative

      Or you could just use SpamAssassin, which is designed specifically to do this and has many more rules that have been created by others.

      --Bruce

      --
      There are 10 kinds of people in the world: those who understand binary, and those who don't.
  38. Since 5.0 it can by barzok · · Score: 3, Informative

    Message rules are very easy to set up and manage. No agents.

    1. Re:Since 5.0 it can by Milalwi · · Score: 2

      Message rules are very easy to set up and manage. No agents.

      Yeah, but you don't get any visual indication that there is new mail in your folders. You get told that there's new mail somewhere, but you have to go through your folders individually to find it. I have over 60 folders... do you think I'm going to use the message rules to automatically file them when I might note notice that they were there?

      If there's a way to provide visual indication of where the message got filed, I'm listening.

      Milalwi
  39. Re:I don't hate Jews. by fscking_coward_2001 · · Score: 0, Flamebait

    If [Jordan|Saudi Arabia|Egpyt|etc] is so pro-Palestine, why don't they simply absorb all the Palestinians?

    Thanks for the brilliant point, Mr Anonymous "I'm so lame I won't post as myself" Coward. You win the Slashdot Sophistic Argument of the Day award!

  40. That's not necessarily the point.. by cheesyfru · · Score: 2, Interesting

    Spam filtering is one possible application of this type of tool, but the more useful involves taking the mail you *do* want, and sorting it into logical buckets. For instance, let's say work on several open source projects, belong to a couple organizations, and have a real-life job. You could toss a filter in your email that scans each incoming message and throws it in the proper bucket. This allows you to logically separate your mail to reduce confusion of each non-overlapping category.

    Procmail only goes so far, it's really only useful for simple header scanning.. I could really see a good scanner utility being a valuable tool. Maybe Google should share some of their technology.. :-)

    1. Re:That's not necessarily the point.. by 4of12 · · Score: 2

      Dynamic folders or views of your email would be a Wonderful Thing.

      I can't say how constraining it is to have statically defined folders which I have to move mail into based on my selection.

      Procmail helps to do this dynamically based on simple criteria, but when you want to have a particular piece of email show up in multiple views without having multiple copies, it really calls for associating named "views" of the whole mess with specific search and sorting criteria.

      That way, one view is "Latest Unread Messages" which has a particular message in it that might also show up in "Most Recent Messages about Project X" and in "Most Recent Messages from Boss".

      I'd love to have my email client show multiple views this way.

      --
      "Provided by the management for your protection."
  41. Re:Link to a postscript file? by nphillips · · Score: 1

    -rw-r--r-- 1 phillips cmb 181384 Apr 24 14:59 bhs.ps.gz

    And the winner is...gzipped postscript, which needn't be ungzipped before viewing.

  42. Re:Link to a postscript file? by MaxVlast · · Score: 2

    Alas, no. Adobe wanted ridiculous prices to license Display PostScript (DPS), the engine that NeXT used in the NEXTSTEP display system. (NeXT is a company. NEXTSTEP is an operating system.)

    Given the ridiculous licensing prices, Apple went a different way and created Display PDF for Mac OS X's drawing system.

    Ghostscript works just fine, but the lack of DPS is one of the reasons I still keep a NeXT cube on/under my desk.

    --
    There should be a moratorium on the use of the apostrophe.
    Max V.
    NeXTMail/MIME Mail welcome
  43. News for Nerds by lydon · · Score: 2, Informative

    Why are there so many people complaining about a PS link? The answer is simple: ./ is news for nerds, not for geeks.

    So while the average geek keeps his favorite postscript viewer handy, the standart nerd wonders about such an ancient format and does not know how to feed his acrobat viewer with it...

    Here is the solution for those irritated ones: try this piece of ancient software on the ancient adobe format, and you can miracously view it's contents!

    Have fun and keep your google handy!

  44. and idea by Anonymous Coward · · Score: 0

    I get alot of spam mail still... I was wondering if Hotmail could ever creat MD5 sums of every e-mail and keep a database, then create some kind of popularity grid where you could, for instance, say hey i don't want e-mail that everyone else is getting...

    *shrug* maybe it'd work, maybe not... i'm just tired of getting spam..

    1. Re:and idea by ziplux · · Score: 1

      http://www.rhyolite.com/anti-spam/dcc/

      This and other projects do what you're talking about but most are not in wide use.

  45. VM & EMACS by pmz · · Score: 3, Informative

    I have enjoyed using the VM module for Emacs. It allows sorting your entire Inbox into separate categorized mail boxes via regular expressions. Basically with one shift-A keystroke, my entire day's worth of mailing list stuff gets whisked away into a half-dozen different files. After this, I feel really sorry for people trapped in the Outlook dungeons!

    1. Re:VM & EMACS by brer_rabbit · · Score: 2

      could you give an example of your vm-auto-folder-alist? I've been using VM for quite awhile but I haven't tried this feature yet. Just curious how to set the variable to something useful.

    2. Re:VM & EMACS by Anonymous Coward · · Score: 0
      Emacs and especially Xemacs are very great tools for virually everything human-centric: email, outline, calendar, programming IDE, web-browsing (W3), filesystem (dired), database, XML. You can tune standard modes, reconfigure them, re-program them or create your own mode.

      Some of *macs' modes are very great made and tuned, like VM, IDE and dired, while anothers are just a prototype of idea, like calendar and outline. There are some modes, which will never be finished: W3 (they will barely do all javascript, java and plugins, right? They would rather consider to integrate with mozilla - that would be lots of benefits) and psgml/XML (they still think that XML-Schema is something from future and ready yet, thus they made everything based on that stupid DTD).

      Actually, VM mode is one of the best *macs modes and it one of the best email programs. It works just perfectly (the best I see in other emailers) with virtual folders, as for phisical ones, like IMAP folder structure - forget the tree, forget IMAP. It works only with your local mailbox and POP3. For locally archived folders use the function vm-visit-folder. There is another message reader, gnus, it was designed mostly for NNTP. There is a rumor that it works somehow with IMAP, but... nobody proved that.

      I use Xemacs/VM on the same computer where is my mailbox. I tuned it to save/archive all drafts/sents/seens in folders available for IMAP server. When, by some reason I don't have SSH or X11 access to that box - I run either Mozilla or SqirrelMail (Webmail). But usually I prefer vm-mode - it's too good when your fingers are used to type a code :) Seriously, *macs is the environment where the working with macro-commands is not an extension - it's a nature. It's like a car. You can live without it. And at first time you don't know how to use it. But once you know it - you don't want to live without it. Same.

      And of course, I use procmail - it does all dirty work by sorting all that junk, semi-junk, unknown, from-mail-lists, from-friends, from-job, from-recruiters, from-banks etc etc messages to different phisical folders, again - availble for imap server and for vm-mode. Once it's sorted, mostly by email addresses - I use vm-virtual-folders to sort it by subjects and othe rules.

      However, there is another problem. The documentation for *macs is ...good, and it ... exists ... but ... just be ready to get your answers in news and mail lists :)

    3. Re:VM & EMACS by pmz · · Score: 2

      The vm-auto-folder-alist is basically a list of which fields to scan and what to do with classes of entries in those fields. A simple example is:

      (setq vm-auto-folder-alist ("Sender:" ("mailing-list@domain" . "mailing-list.saved" ) ("mailing-list2@domain" . "mailing-list2.saved" ) ) ( "From:" ( "user@domain" . "user.saved" ) ( "your-e-mail@your-domain" . "sent_mail.saved" ) )

      A more powerful example using regular expressiongs:

      (setq vm-auto-folder-alist ("From:" ( "^.*@dot[.]bomb$" . "dot.bomb.saved" ) ) )

      This will take every e-mail whose From field matches the expression and save it into the file, dot.bomb.saved.

      I think this is by far the most useful and time-saving feature in VM, especially when subscribed to a high-volume mailing list.

  46. PDF Available Here by Jeffster98 · · Score: 1

    For those without postscript readers, a PDF version is available here.

  47. Better organizing sought by Anonymous Coward · · Score: 0

    I'd just like a way to reorganize the emails without affecting them, so that it acts more or less like a database report allowing me to group based on different criteria. For example, if I normally manually organize my emails into separate folders by Topic:Region (where each topic is arising), I'd like to be able to reorganize them by Region:Topic for a different perspective (what's going on in each region).

  48. Enfish by cnladd · · Score: 1

    Check out Enfish Onespace for those of you running MS Outlook. Not only does it do great text mining of all e-mails, it does the same with contacts and with files on your hard drive (the professional version handles network files, as well).

    It's got a clean UI, but it is a bit hard getting used to. I've found it to be a great tool for finding info in a snap - I just enter a search phrase and instantly get a list of relevant e-mails, Word docs, spreadsheets, contacts, and even websites.

    And nope, I'm not associated with them in any way - I just like the product. :)

    --

    --
    Welcome to the land of the easily amused...

  49. I do this too + questions for other domain owners by Anonymous Coward · · Score: 1, Interesting

    I also use this technique for my externally hosted domain...I get all the mail addressed to any user in the domain, but its easy to set up mail client filters to remove those with are addressed To:, say, potentialspammer1@mydomain.

    So, if there's any possibility of SPAM, I just invent a new user. Unfortunately, I didn't figure this out quite soon enough and I have some users which get spam and real mail, which I can't afford to filter to trash - people buying their own domains (come on, its like $15 a year) should be thinking ahead.

    Also, its not as neat a set up as having my own POP server bounce back the message (which might mean you get off the spam list one day!). More importantly, filtering the To: field, doesn't help me most times, since spammers set the To: to "READTHIS" and use Bcc: for their spammies (is that a word!).

    ALSO

    Here's an unrelated question for anyone else who owns a domain like me, where they get a catch all POP box.

    How do you guys make sure people USE your nice domain name?

    In other words, its okay having a POP box, mail.mydomain.com, but you never seem to get offered the services of an SMTP server through which you can send your messages From: this nice address.

    I would hazard that most people rely on Reply-To:, which is all very well, except that not all mail clients respect it, and you may want to entirely obscure the actual From:.

    Of course, mail clients like Emacs and Mozilla make it easy to arbitrarily set your From:, however you then have to get this through whatever SMTP server you have available (and in order to block spammers and other pranksters, you will increasingly find that most will only send mail if the From: agrees with your user name).

    One of the reasons I moved to linux was so I could run sendmail and not rely on other peoples SMTP servers. The is okay at work, since we have direct internet access, but from home when I dial up, it doesn't work.

    I don't think my ISP likes to have people sending mail from their own computers, I get name resolution errors from sendmail when attempting to send email (but have no problem with DNS for web), so I think that perhaps the ISPs DNS servers refuse to give up MX records.

    Anyone else in a similar boat?

  50. What's wrong with PostScript? by Anonymous Coward · · Score: 2, Insightful

    Just use GhostView...

  51. Re:Link to a postscript file? by NewbieSpaz · · Score: 1

    Here ya go, in PDF format: http://www.kevindustries.com/bhs.pdf

    ps2pdf bhs.ps worked fine for me...

    --
    ------
    Random, useless fact: I type in startx entirely with my left hand.
  52. Re:Link to a postscript file? by tps12 · · Score: 2

    What is the difference between postscript and DPS? Any reason why DPS can't be integrated into X? The only effects of a DPDF renderer in OS X that I've seen are being able to view .pdf's without Acrobat and having vector-based widgets.

    --

    Karma: Good (despite my invention of the Karma: sig)
  53. pipe to mysql db? by yerdaddy · · Score: 1

    I've often thought it would be great if I could save my email to a mysql (or postgres, if you prefer) that would automatically parse the header and body into table fields. Then when you want to search it you can use SQL queries instead of the covoluted grep commands I use now.

    Doesn't seem hard to write. Anybody know of such a thing?

  54. Done already by Matts · · Score: 5, Informative

    "Perhaps even one of them Perl monkeys will quickly hack such a background tool."

    Been done already. Check out Mail::Miner.

    --

    Matt. Want XML + Apache + Stylesheets? Get AxKit.
  55. Re:Link to a postscript file? by tps12 · · Score: 2

    NM, here is this project that seems to be just that. Apparently Display Ghostscript is dead, but DPS lives on. Still don't see what the big whoop is.

    --

    Karma: Good (despite my invention of the Karma: sig)
  56. Zoe by Anonymous Coward · · Score: 0

    Have you seen that http://homepage.mac.com/zoe_info/

  57. I Want Fewer Filter "Features" by kentborg · · Score: 2, Insightful
    Once I was at some internet tradeshow in Boston and every other booth seemed to be showing off their e-mail filtering features, each with one or more enormously complicated dialog box. Features! Features! Features!

    My reaction was to want an e-mail reading program that didn't require any filter configuration, though I imagined it would do well to be given a few hints, such as who my boss is, who my mother is, and who my wife is. Other than that, let the program figure it out.

    Imagine the canonical, old-fashioned secretary temp. She ('cause that's what the canonical version was) didn't have to know anything domain-specific to sort the morning mail. Magazines go together, bills go together, personal letters go together, etc.

    I imagine an automated version for my e-mail. Look at who it is "to" (am I on the list?), look at who is "cc"-ed (am I on that list?), look at who it is from (my boss, wife, or mother?), look at who else it is to (boss, wife, or mother?), look at the thread it is part of (is it responding to something I previously wrote?), look at the content (does it mention me, things I have written, my boss, wife, or mother?). Was it sent to a mailing list? Was it written by someone I have explicitly written to (once or many times?)? Was it written by someone who has previously sent me direct e-mail (once or many times?)? Those ideas are just the obvious ones, think of others. Think of more. (Does it talk about sex, credit card merchant accounts, stock tips, or Nigerian money?)

    Now take that and sort it by importance and similarity. Look for a way to present me in a descriptive summary, arranged in a hierarchy with a top-level of, say, 3 to 9 categories, a greatest depth no greater than, say, 4, and keep the sub-branching at intermediate nodes between 3 and 5--but don't max out all those dimensions at once, try to keep the total number of leaf categories to under, say, two dozen. Try to make more important items land higher in the tree and with few siblings, grouped with siblings of similar importance. (Maybe give an importance weight to each e-mail and balance the tree on that scale, that would float e-mails to me from my boss about my mother and wife really high with few siblings.)

    This summary needs to be integrated with a complete index of the e-mail so I can see how a message fits into a larger thread, how it fits into previous e-mails.

    I (the user) would need to tell the program when to make me a summary of my e-mail (e-mail reading is different when a lot comes in or just a little), and I want to be able to browse through old summaries, including deciding to see composite summaries or, say, the last several days, a week (or three), month, year, or 400 days.

    So I think it ends up being a 4-part user interface:

    List of summaries (which can be manipulated).

    A given summary.

    Exhaustive thread/date/subject/sender list (analogous to what every e-mail reader seems to have now). Note that this view could effectively be turned into an exhaustive address book. Frequent (favored) correspondents could be highlighted by me for ease in sending a new e-mail, and also to provide importance hints to the program. This is where I might say who my boss/wife/mother is.

    A body of a (or more) specific e-mail being read, written, or old e-mail (sent or received) being reviewed.

    And I could go on, but I won't. If anyone wants to write such a thing and wants to hear more, send me an, um, e-mail.

    -kb, the Kent who has been saving all his e-mail (including spam!) for a year or so, providing plenty of raw material to test any such program.

    1. Re:I Want Fewer Filter "Features" by Radix42 · · Score: 1
      I agree, I've been thinking about building such a "just do the right thing like a secretary" system for quite a while now.

      I was gonna email you about it, but your address isn't displayed (duh). Drop me a line at slashdot@arkhein.net

  58. It's already been implemented! by nikko · · Score: 1

    Please check out:

    http://homepage.mac.com/zoe_info/

    Zoe is way ahead of this curve.

  59. Check Out Phorecast by ciaweb · · Score: 1

    Phorecast downloads all your email into a database of your choosing; it is database abstracted using PHP's PEAR DB library.

    Phorecast is a web application written in PHP that combines email, calendar, and address book functions. It is language abstracted, so you can write a .tsv translation file for any language you like. Version 0.5 (on the way) improves these functions, and adds a todo list as well.

    Full disclosure: I wrote it, and I use it as my primary email client.

    --
    Try out Phorecast, open-source email, calendar,
    1. Re:Check Out Phorecast by statusbar · · Score: 2

      Looks great so far! However I guess I have to manually create the postgresql database tables....

      jeff

      --
      ipv6 is my vpn
  60. patent-esque by Anonymous Coward · · Score: 0

    Really, the diagrams help convey many of the points, but do the equations really tell you anything? Can anyone actually read those and explain them?

    If I didn't know better, I'd think these sorts of papers ( which are common ) were a cheap attempt to sneak obvious claims past the patent office. I guess it's this sort stuff that got the patent office in the trouble it has seen.

  61. Re:Link to a postscript file? by MaxVlast · · Score: 2

    Well, the attentive reader would have noted that I pointed out that Adobe wanted a very high per-seat license. Apple wanted to pay a flat rate, IIRC, and the two companies didn't work it out. So Apple went a different way.

    DPS was used in a more fundamental way in NEXTSTEP. It was really amazing. There was true WYSIWYG, as the code on the screen was what was literally sent to the printer. Layout was really improved as a result, and you could mix postscript code with your drawing program efforts and see it previewed in a live fashion on-screen. It was easy to save documents in a portable fashion (PS), and a dozen other things.

    --
    There should be a moratorium on the use of the apostrophe.
    Max V.
    NeXTMail/MIME Mail welcome
  62. How about converting to html? by Anonymous Coward · · Score: 0
  63. Re:I do this too + questions for other domain owne by Permission+Denied · · Score: 1
    I don't think my ISP likes to have people sending mail from their own computers, I get name resolution errors from sendmail when attempting to send email (but have no problem with DNS for web), so I think that perhaps the ISPs DNS servers refuse to give up MX records.

    This may also be a reverse DNS resolution problem. Check that your IP resolves to your hostname and that your hostname resolves to your IP. If not, some sendmail installations will reject your mail. Also, make sure your sendmail is sending out the correct hostname - eg, you can set up your machine so that it thinks its hostname is something.domain.com instead of some-long-crap-dsl-023-094.domain.com where something.domain.com is not an actual DNS record. This works fine for everything except when sendmail starts sending out emails claiming something.domain.com as originator.

    Another thing you can do is configure sendmail to send all mail addressed to "user+any_arbitrary_string@domain.com" to "user@domain.com". This is useful since I don't have to do anything to generate a new email address. Search google.

    I'll add that giving out a separate email addy for every company works beautifully. It also lets you know when some company sells your email address, something they will never admit to doing otherwise. I now get zero spam in my inbox.

  64. finding NEW topics by tswaterman · · Score: 2, Informative
    Many of these comments are missing the point. The paper is not really about categorizing your email.

    The main result in Kleinberg's paper relates to finding NEW topics that start to appear in the stream. Let's say you already have categorization filters (procmail, keyword filters, your own set of folder hierarchies, whatever...), but there's a new topic that starts showing up in your mail, or in your newsgroup feed, or on CNN. Klienberg's result is a way to find that the new stuff really is NEW, and you might want to group it up together, and make a folder for it. You could do that automatically, or by hand, but first you have to know that there's a topic.

    there's a bunch of other work in this area, what the NLP types call TDT -- "Topic Detection and Tracking"

  65. This can be done for free! by gregstoll · · Score: 1

    Check out sneakemail.com - it does basically this, but at their domain name, and you can set filters of particular addresses, or just delete them. Very useful idea, I'd definitely be willing to pay for it though...

  66. Intertwingle by geirt · · Score: 2

    jzw of Mozilla/Netscape fame have a hypothetical program called Intertwingle which is (Score:5,Interesting) ....

    --

    RFC1925
    1. Re:Intertwingle by Anonymous Coward · · Score: 0

      Check ZOE for a system inspired by these ideas. http://homepage.mac.com/zoe_info/

  67. Not new, but cool. by jefferson · · Score: 3, Informative

    There's been lots of work on auto-classifying email. I did my semester project in Machine Learning on this in 1999. It's a fairly simple study, but it seems like a Naive Bayesian classifier using word counts as features does a pretty decent job of classifying email, and does really well on spam.

    The paper is here here.

    J.

  68. wtf??!! by Anonymous Coward · · Score: 1, Funny

    who the hell gets so much email they need to
    mine for text, christ ??!! dont change your email filtering, change your pathetic life !!
    there are plenty of other things far more worth mining than TEXT

  69. Re:there is no way to win...Maybe there is... by wolf- · · Score: 1
    That all being said...
    As much as slashdotters hate lawyers...
    And most politicians...
    Maybe we should learn how to use them...

    Use the laws in place. Sue them for the costs associated with spam. Sue them when they break the laws that exist to protect consumers. Whether it is spam, telemarketers, or Best Buys. Stiff it to them. Out smart them. After all, arent we geeks and nerds? By the world's definition arent we all supposed to be smarter, or a step above average person?

    --
    ----- LoboSoft specializes in Digital Language Lab
  70. View the document online by Hew · · Score: 1
    --
    /cj
  71. 8020 Retreiver Does All This And More by Anonymous Coward · · Score: 0

    www.80-20.com (I think)

    Integrates with Outlook (not UNIX version). But it offers real-time indexing of email, contacts, local and networked files. Super cheap for what you get.

    It's saved my butt a hundred times when I can only remeber a fragment of someone's contact info, message or whatever.

    They're the worst marketers in the world. They owe me for this one!

    http://www.hiredinsight.com

  72. kleinberg is rebel king!!! by complete+fallout · · Score: 1

    i'm taking a class with kleinberg right now and he's a great lecturer. if anyone is interested in algorithms of any kind, go read his papers.

  73. Re:Link to a postscript file? by tps12 · · Score: 2
    That sounds awesome. In my college days I always embedded latex in my xfig diagrams. I don't know if I'd necessarily want to go any less abstract than that (which isn't saying much, I know) under most circumstances, but it's cool that it's there, I suppose.

    Reminds me of the maps for the 3D network game for the Mac that Ambrosia made...Avara, I think? The maps were vector graphics, where different shapes meant different things and text inside the shapes was code. Very cool idea. I think there's still a lot of potential in the idea that source code doesn't necessarily need to be a simple linear text file.

    --

    Karma: Good (despite my invention of the Karma: sig)
  74. I do this too + questions for other domain owners by AFreeman · · Score: 1
    This may also be a reverse DNS resolution problem. Check that your IP resolves to your hostname and that your hostname resolves to your IP. If not, some sendmail installations will reject your mail. Also, make sure your sendmail is sending out the correct hostname - eg, you can set up your machine so that it thinks its hostname is something.domain.com instead of some-long-crap-dsl-023-094.domain.com where something.domain.com is not an actual DNS record. This works fine for everything except when sendmail starts sending out emails claiming something.domain.com as originator.

    Aha, is this why it works at work (where my hostname is correct and resolvable), but not at home with my ISP, where my hostname remains the same but could not be looked up?

    I just dial of over a modem, and its possible I configured that kind of perculiarly, because of wanting to switch between the LAN at work and my dialup at home.

  75. Re:Link to a postscript file? by tps12 · · Score: 2
    Gee, wouldn't that be illegally using their monopoly to muscle out third party developers? Why, if the OS had a PS viewer built in, nobody would every buy one! Businesses would go bankrupt!

    Haha, that is what I like to see. Some common sense once in a while.

    Some other transgressions: the Mac OS has forced the Apple menu on its users for nearly 20 years. Why can't I have a 3rd party menu? And sure people could download an alternative to GNOME terminal, but realistically who will exert the effort? And why don't I have a choice of who provides me with a tea timer in KDE?

    --

    Karma: Good (despite my invention of the Karma: sig)
  76. plenty of e-mail mining tools by j09824 · · Score: 2

    There are plenty of e-mail mining tools in development. This particular work takes one particular approach to mining the data. Whether this approach will turn out to be useful remains to be seen.

  77. Re:Link to a postscript file? by Pfhreakaz0id · · Score: 2

    this is perhaps the greatest example of slashdotter myopia ever. I don't give a crap about my karma, I just have to laugh at this AC:

    It shouldn't take two months to get ghostscript and ghostview. In fact, it comes with most modern operating systems.

    Clue time: 99% of people who've ever used a computer have never heard of either. If they click on the link above, they get a windows file box for "open with" and they wonder why the author didn't inlude a warning of what this strange file format was and what, exactly, they are supposed to do with this file.

  78. Sneakemail: was "The joys of owning a domain" by mysta · · Score: 1

    I also have my own domain name but I'm limited to 5 forwarded email addresses. I wanted to do what you suggested a while ago but couldn't. Then I stumbled across Sneakemail and it basically did everything I had intended anyway.

    In a nutshell, you sign up for an account, giving only a contact email address (I use spam AT threewordslong DOTTY com). Once logged in you can create a new, randomized email address for each new web service that needs an email address. If one of these services spams or sells your sneakemail address you: a) know exactly who did it and cease further business with them and b) can filter on that specific email address.

    It's a great service and no, I don't work for them...

    --

    "Where is the wisdom we have lost in knowledge, and where is the knowledge we have lost in information?"-T.S.Eliot
  79. Re:Link to a postscript file? by Anonymous Coward · · Score: 0


    Well if they dont know what a postscript file is, they aren't going to be able to provide much of an objective review of a reasearch paper either.

  80. Re:Link to a postscript file? by t · · Score: 2
    Get current man!

    158213 Apr 19 09:41 bhs.ps.bz2

    t.

  81. Re:there is no way to win...Maybe there is... by foniksonik · · Score: 1

    There are several articles on /. concerning lawsuits against spammers.

    http://slashdot.org/search.pl?query=sue+sued+sui ng &op=stories&author=&topic=111&section=&sort=1

    --
    A fool throws a stone into a well and a thousand sages can not remove it.
  82. It won't work by Anonymous Coward · · Score: 0

    The rogue marketers will just let another company entity send th enext spam.

    Ie. starta a new throw-away company each time they want to spam us.

  83. qmail dash-ext by main() · · Score: 1

    Qmail is good for this sort of thing. By default, a user receives everything at username-*@domain.com.

    So, I subscribe to amazon with username-amazon@domain.com, slashdot with username-slashdot@domain.com etc.

    You can then control the delivery location of mails to these recipients using .qmail-amazon and .qmail-slashdot files in your home directory and have a .qmail catch-all.

    This comes in handy for filing mailing-lists away, filtering out spam etc. Its also interesting to see who's sold you down the river to spammers, I recently started receiving spam to username-bsdtoday@domain.com... bastards!

    Cheers,
    Si

    1. Re:qmail dash-ext by dlc · · Score: 2

      Most of the major MTA's will do this nowadays, but with a + rather than a -. I know sendmail does this, and am pretty sure about postfix and exim as well.

      Look at this reference, for example.

      --
      (darren)
  84. Horrible interface. by arafel · · Score: 1

    I could list it all here, but it's much more efficient to just point people at:

    http://www.iarchitect.com/lotus.htm

    (Which is a site that everyone should read before doing UI stuff.)

    Sample of one of the "best" bits:

    Judging from the number of visitors who have mentioned it, the process of copying messages in Notes is perhaps its worst interface "feature". Apparently, when mail messages are copied from one folder to another, the message itself is not copied; Notes creates a "reference" to the message. Unbeknownst to the user, if you delete the reference, Notes will in turn delete the message itself. Similarly, deleting the message will cause all references to it to also be deleted.

  85. Text-Mining your bookmarks? by lvirden · · Score: 1

    I'd like to find something like this - where
    as I browse across a web page I could active
    a program that would look at the page, and suggest a series of folders that appear to be relevant.
    If I agree, I click okay and go on. If I think a category is unnecessary, or missing, I would
    have the option of adding a category.

    I see this as a parallel need to the mining
    of the email.

    --
    URL: http://xanga.com/lvirden > Quote: Saving the world before bedtime. Even if explicitly stated to the contrary, n
  86. SWISH++ is a good mail indexer by pauljlucas · · Score: 1

    SWISH++ (my search engine) specifically knows to index mail/news files (including text, HTML, RTF, LaTeX files) and attachments of any of those (in quoted-printable or base64 encodings). It can also index any other kind of attachment via external filter programs. A procmail recipe for auto-splitting incoming mail is included in the distro. I also believe that my statement of SWISH++ being the fastest open-source indexer is accurate.

    --
    If you reply, do so only to what I explicitly wrote. If I didn't write it, don't assume or infer it.
  87. Re:Link to a postscript file? by ansonyumo · · Score: 1

    Yeah sure, embed it in the OS. More proof that any notion of what constitutes an "Operating System" has been completely destroyed by Microsoft's marketing department.

    Dude, it's real simple. Install ghostview, set up as helper app in the browser of your choice for application/postscript.

    Bundle with the OS. Feh!

  88. Re:Link to a postscript file? by devinjones · · Score: 0

    Didn't NeXT OS do this?
    IIRC, postscript was the underlying graphics transport for screens, so they had WYSIWYG everywhere.

  89. Zoe: Intertwingularity for e-mail by Anonymous Coward · · Score: 0
    Thread readers may be intrigued by Zoe, an experimental e-mail reader that builds on Ted Nelson's concept of Intertwingularity.
    Intertwingularity is not generally acknowledged -- people keep pretending they can make things deeply hierarchical, categorizable and sequential when they can't. Everything is deeply intertwingled. -- Ted Nelson
  90. SpamBouncer by gasull · · Score: 1

    SpamBouncer is a set of procmail recipes to filter spam.