I have gone several ways on this question. Mathematica for symbolic solving, coding my own in C for mathematical modeling. But I did once use S-plus which is a fairly nifty matrix algebra system that might be useful to you. An open source variation is R
Give a particular topic of letter, this problem isn't too different than looking for spam vs. ham, and can be approached in similar ways (e.g. Bayesian filter).
Actually, you probably could do quite well identifying boilerplate by simply dropping all punctuations, spaces, and capitalized words, and then computing a hash (say, md5) over every even letter and over every odd letter. If either hash matches either hash of another letter, that should be a very specific indication of boilerplating.
These still require a corpus of letters, though, or a way to generate one from a search.
I'll second the nomination of SpamAssassin.
In the last 30 days, it tagged 427 messages to me as spam. No false positives, and probably about 30 or so false negatives (I use the standard threshold). I could probably tweak it to do even better.
This problem is exactly analogous to the proposal to test all married couples for HIV that went around Chicago some years back. Surprise, surprise, the base rate of HIV among to-be-married couples was quite low. More false positives than true positives. Lots of wasted time, money, and stress on re-screening.
As you may know, Bayes Theorem (actually a statement of fact in probability theory) says:
Post-test odds = Likelihood Ratio * Pre-test odds
(Where the likelihood ratio for a positive test is the sensitivity/(1-specificity), or TP rate / FP rate)
If your pre-test odds of being a terrorist are very low (and when you consider how many terrorists fly compared to how many non-terrorists fly, they must be exceedingly low), you're going to need a very, very powerful ("highly specific" in medical terms) test if you want to reliably determine that a given person ought to be treated with greater care.
On the other hand, if they were planning to spend a lot of time and money screening people anyway, and they could improve their sensitivity (TP rate), facial recognition might be a (statistically) sound approach to screening *out* suspects. That is, one you pass a face-detection screen that has a high TP rate, you don't need to be subjected to as much extra screening; but if you fail the face-detection screen, it's not really diagnostic.
Normally, you could use my diagnostic test calculator to fool around with numbers yourself and see what the impact would be, but it appears to be down until I can get to the server (dratted dist upgrade!)
Well, let's say I live somewhere where the local folk decide it's a good idea to have a book-burning - Harry Potter, maybe, or Catcher in the Rye. Or the local government decides certain books and those who read them are subversive and should be watched. Or the local corporations decide that if they could compile a big database of who buys certain types of books, they could "target" their marketing of associated products, and sell lists of, e.g. Kilgore Trout fans, to the highest bidder.
Be awfully convenient for them to be able to find who's got those books, and where, don't you think?
Ok, it's slightly off-topic, but just to clear the record.
I work at the College of Medicine of the University of Illinois at Chicago, which is the largest one in terms of MDs graduated annually in the US (about 400 per year).
Like many other US Medical Colleges, the oath that graduates take is the 1948 Declaration of Geneva version of the Oath of Hippocrates, which reads:
Now being admitted to the profession of medicine, I solemnly pledge to consecrate my life to the service of humanity. I will give respect and gratitude to my deserving teachers. I will practice medicine with conscience and dignity. The health and life of my patient will be my first consideration. I will hold in confidence all that my patient confides in me. I will maintain the honor and the noble traditions of my medical profession, My colleagues will be as my family. I will not permit consideration of race, religion, nationality, party politics, or social standing to intervene between my duty and my patient. I will maintain the utmost respect for human life. Even under threat I will not use my knowledge contrary to the laws of humanity. These promises I make freely and upon my honor.
As you can see, even medicine changes with the times, while trying to maintain the important features of the Oath of Hippocrates.
One of my favorite tools are the various flavors
of MUSH servers, such as the one I maintain,
PennMUSH.
In many ways, muds can provide everything you've asked for -- categorized fora (real-time chat channels, virtual spaces, and asynchronous bulletin boards) that are user-extensible with a relatively simple initial set of commands, a clean interface (text with ansi color), virtually no lag, and boss-friendly in appearance.
I have been involved with communities that started out of MUSHes and later evolved into off-line communities, and vice versa.
Many people who run honeypots base them directly on sendmail, by running "sendmail -bd" on systems that aren't supposed to be mailservers, as described in
this page
DNS-based blacklists are not your problem. There are no more than a dozen that are really widely used (some orbs spinoffs like http://www.ordb.org and http://www.orbz.org, the MAPS ones if you're willing to pay (or can get a hobby contract) at http://www.mail-abuse.org, and the collection at http://relays.osirusoft.com that includes open relays, spamhaus, and SPEWS. All of these systems have clearly-published listing policies and are actively maintained and if you're blocked by one of them, you'll likely get out sooner or later once you're clean. (In some cases, you can have them automatically retest you). Plenty of mail admins find that using the information on these sites to protect their mail servers from spam is highly effective.
Your problem is twofold. First, while you've cleaned up your open relay, plenty of spammers and spam-friendly hosts make the same claim and lie (Rule #1: Spammers lie). So you may have to be patient.
More importantly, your server ip may now be sitting in hundreds of private blacklists of mail servers whose admins don't like to use the centralized lists, and just reject/blackhole spammers on their own. It is the presence of well-trusted centralized blacklist services that gives you even the hope of ever having decent communication, because without them, you'd get into a thousand tiny blacklists and never get out.
(P.S. Note that if you're checking your status using the rblcheck tool at http://relays.osirusoft.com, it will tell you about a lot of blacklists that are not intended to be publicly used and not part of the usual osirusoft dnsbl, as well...)
An online publication venue for this kind of work
(and a place to go to read other related work) is the Journal of Virtual Environments (formerly Journal of Mud Research).
First, a disclaimer. I like this book. Despite having printouts of most of the BOFH stuff already, I bought both the first BOFH book from Plan Nine and this one, because it's nice to have a bound copy to put on the desk to scare users.
But there are a few critical points that should be made about the second book, and that can hopefully be avoided in the next installment:
The PFY comes from nowhere. The stories that introduce him aren't included, so if you don't already read BOFH online, he appears rather abruptly.
Illiad's illustrations are cute, but there are only about four of them, repeated over and over, which is a real shame.
The price, as previously noted, seems a bit much for the quality of the paper, etc. used, but obviously, the market will bear (has borne) it.
As compared to the first BOFH book, you get a lot more BOFH vs. corporation (especially accountants) and less BOFH vs. users. Depending on your outlook, this may be a very positive thing or not.:)
Last I checked, if someone patches my (source freely available) code, they've created a derived work, and I retain the copyright. Assuming that their patch can't stand alone as a separate work, it's legally mine.
Now, I'm not a geneticist, I'm a research psychologist in the area of medical judgment and decision making, where professional norms are to keep your data for years and provide it on request, but I fully understand the problem of "difficulty/convenience" -- even in finding your old data for yourself.
This is a place where research scientists could really use some good old fashioned technological and social help from programmers. Consider a typical computer-administered psychological experiment's process:
Write code to run the experiment and log the data. If you didn't document the code or write self-documenting code, you'll have trouble when someone wants the data later.
Run the experiment and collect those log files. If the log file format isn't self-documenting, you'll have trouble when someone wants the data later.
Get all the log files transformed into a format that can be usefully imported into statistical software. If you didn't document all the variables and values in the resulting stat file, you'll have trouble when someone wants the data later. And most statistical packages allow you 8 characters for variable names and make detail labelling of variables and values highly tedious.
Analyze the data and produce some output. If you didn't save the analysis details (as is all too easy to forget if you're doing stats with a dialog-box-based program), you'll... (well you know the rest).
Write a paper describing what you did and submit it to a journal. Have it accepted (hopefully) in 4-6 months. Have it appear about 6 months later. It is now probably 18-24 months since you started the study. If you're lucky, you've probably changed computers at least once by now, and possibly offices/buildings/universities, too. If you can find the data yourself and understand what it means, you're ahead of the game.
This is not too far from the problem of managing a source code project over time and across maintainers. It's not enough for professional scientists to have standards for retention and sharing of data -- we need a tutorial in documentation (and statistical and other software packages that better support it.)
And by so doing, Medievia has been accused of violating the license of the Dikumud source code on which it is (by admission of its creator as well as by inspection of source code) drived,
which prohibits any commercial use.
Of course, this new Entropia project gets to write their own license, assuming they're not basing their code on one of the many fine free mud codebases (where your equipment might degrade through use, but not due to economic externalities!)
As one set of research grant deadlines for major U.S. Federal agencies fall early in the year (NSF: Jan 15, NIH: Feb 1), most Decembers find me plugging away.
For those of us in academia, especially on the tenure-track, "holidays" often mean "when you're not teaching and can get around to writing up your papers or grant proposals", although I'm pleased to say that I'm also getting to travel to see my family (hooray for the laptop and the spread of home broadband).
- Alan, Asst Professor of Clinical Decision Making
Thanks for the kind review and the useful comments. I'm pleased that the book still has some value today, though I agree that the specific MLMs covered are no longer current -- I've just sent a copy of this page to O'Reilly along with a suggestion that we do a second edition, so we'll see what they say!
My playlist for MML2 would include mailman, majordomo2, listar, and ezmlm; greater discussion of non-sendmail MTAs (qmail, postfix, maybe exim), coverage of commercial options (running your own with lyris/listproc vs outsourcing vs. egroups.com approaches), and greater depth in terms of list policy, spam issues, legalities, and tips for specific kinds of lists (like opt-in marketing to established customers)
It was a thrill to see this on slashdot.:)
- Alan Schwartz (author of Managing Mailing Lists)
If I'm a news administrator, I have the right to decide what to carry on my news server, right? That's why spammers complaining about UDPs don't get very far with me.
Well, the fact is, if I own a network, I probably should have the right to decide what to carry over it. In this case, Arizona owns that network, its people presumably expect to provide that network for educational purposes, and its elected representatives get to decide what can be carried on it. (Unless it's a common carrier, or, as a governmentally-owned system, the 1st amendment applies, of course).
Students may be forced to find alternative internet providers (dialups) rather than use the campus network, just as you might have to find an alternative USENET source if you didn't want to participate in a UDP.
So, the bill can be right in principle (absent the 1st amendment issue). But totally wrong in practice if the goal is to save the taxpayer's money, of course -- it will certainly cost more to enforce than it will save in reduced "porn bandwidth".
(Now, you and I know that the goal is really to return to some kind of imaginary "when I was girl, people were proper" morality, but the argument is made on cost as well, and, if the democratic process works, will be answered that way in Arizona and this bill will go down in flames.)
I have gone several ways on this question. Mathematica for symbolic solving, coding my own in C for mathematical modeling. But I did once use S-plus which is a fairly nifty matrix algebra system that might be useful to you. An open source variation is R
Give a particular topic of letter, this problem isn't too different than looking for spam vs. ham, and can be approached in similar ways (e.g. Bayesian filter).
Actually, you probably could do quite well identifying boilerplate by simply dropping all punctuations, spaces, and capitalized words, and then computing a hash (say, md5) over every even letter and over every odd letter. If either hash matches either hash of another letter, that should
be a very specific indication of boilerplating.
These still require a corpus of letters, though, or a way to generate one from a search.
For stuff like medical data, financial data, etc., I'd seriously consider looking into wipe instead, which uses Peter Gutman's patterns.
I'll second the nomination of SpamAssassin. In the last 30 days, it tagged 427 messages to me as spam. No false positives, and probably about 30 or so false negatives (I use the standard threshold). I could probably tweak it to do even better.
As you may know, Bayes Theorem (actually a statement of fact in probability theory) says:
Post-test odds = Likelihood Ratio * Pre-test odds
(Where the likelihood ratio for a positive test is the sensitivity/(1-specificity), or TP rate / FP rate)
If your pre-test odds of being a terrorist are very low (and when you consider how many terrorists fly compared to how many non-terrorists fly, they must be exceedingly low), you're going to need a very, very powerful ("highly specific" in medical terms) test if you want to reliably determine that a given person ought to be treated with greater care.
On the other hand, if they were planning to spend a lot of time and money screening people anyway, and they could improve their sensitivity (TP rate), facial recognition might be a (statistically) sound approach to screening *out* suspects. That is, one you pass a face-detection screen that has a high TP rate, you don't need to be subjected to as much extra screening; but if you fail the face-detection screen, it's not really diagnostic.
Normally, you could use my diagnostic test calculator to fool around with numbers yourself and see what the impact would be, but it appears to be down until I can get to the server (dratted dist upgrade!)
Well, let's say I live somewhere where the local folk decide it's a good idea to have a book-burning - Harry Potter, maybe, or Catcher in the Rye. Or the local government decides certain books and those who read them are subversive and should be watched. Or the local corporations decide that if they could compile a big database of who buys certain types of books, they could "target" their marketing of associated products, and sell lists of, e.g. Kilgore Trout fans, to the highest bidder.
:)
Be awfully convenient for them to be able to find who's got those books, and where, don't you think?
(It's only paranoia until they get you.
I work at the College of Medicine of the University of Illinois at Chicago, which is the largest one in terms of MDs graduated annually in the US (about 400 per year).
Like many other US Medical Colleges, the oath that graduates take is the 1948 Declaration of Geneva version of the Oath of Hippocrates, which reads:
Now being admitted to the profession of medicine, I solemnly pledge to consecrate my life to the service of humanity. I will give respect and gratitude to my deserving teachers. I will practice medicine with conscience and dignity. The health and life of my patient will be my first consideration. I will hold in confidence all that my patient confides in me. I will maintain the honor and the noble traditions of my medical profession, My colleagues will be as my family. I will not permit consideration of race, religion, nationality, party politics, or social standing to intervene between my duty and my patient. I will maintain the utmost respect for human life. Even under threat I will not use my knowledge contrary to the laws of humanity. These promises I make freely and upon my honor.
As you can see, even medicine changes with the times, while trying to maintain the important features of the Oath of Hippocrates.
I have been involved with communities that started out of MUSHes and later evolved into off-line communities, and vice versa.
Do not taunt Happy Fun Ball, fire extinguisher model
Many people who run honeypots base them directly on sendmail, by running "sendmail -bd" on systems that aren't supposed to be mailservers, as described in this page
Your problem is twofold. First, while you've cleaned up your open relay, plenty of spammers and spam-friendly hosts make the same claim and lie (Rule #1: Spammers lie). So you may have to be patient.
More importantly, your server ip may now be sitting in hundreds of private blacklists of mail servers whose admins don't like to use the centralized lists, and just reject/blackhole spammers on their own. It is the presence of well-trusted centralized blacklist services that gives you even the hope of ever having decent communication, because without them, you'd get into a thousand tiny blacklists and never get out.
(P.S. Note that if you're checking your status using the rblcheck tool at http://relays.osirusoft.com, it will tell you about a lot of blacklists that are not intended to be publicly used and not part of the usual osirusoft dnsbl, as well...)
An online publication venue for this kind of work (and a place to go to read other related work) is the Journal of Virtual Environments (formerly Journal of Mud Research).
But there are a few critical points that should be made about the second book, and that can hopefully be avoided in the next installment:
- The PFY comes from nowhere. The stories that introduce him aren't included, so if you don't already read BOFH online, he appears rather abruptly.
- Illiad's illustrations are cute, but there are only about four of them, repeated over and over, which is a real shame.
- The price, as previously noted, seems a bit much for the quality of the paper, etc. used, but obviously, the market will bear (has borne) it.
As compared to the first BOFH book, you get a lot more BOFH vs. corporation (especially accountants) and less BOFH vs. users. Depending on your outlook, this may be a very positive thing or not.Last I checked, if someone patches my (source freely available) code, they've created a derived work, and I retain the copyright. Assuming that their patch can't stand alone as a separate work, it's legally mine.
"Super Commercials: A Mental Engineering Special" is made possible by a grant from Doubleclick.
This is a place where research scientists could really use some good old fashioned technological and social help from programmers. Consider a typical computer-administered psychological experiment's process:
- Write code to run the experiment and log the data. If you didn't document the code or write self-documenting code, you'll have trouble when someone wants the data later.
- Run the experiment and collect those log files. If the log file format isn't self-documenting, you'll have trouble when someone wants the data later.
- Get all the log files transformed into a format that can be usefully imported into statistical software. If you didn't document all the variables and values in the resulting stat file, you'll have trouble when someone wants the data later. And most statistical packages allow you 8 characters for variable names and make detail labelling of variables and values highly tedious.
- Analyze the data and produce some output. If you didn't save the analysis details (as is all too easy to forget if you're doing stats with a dialog-box-based program), you'll
... (well you know the rest).
- Write a paper describing what you did and submit it to a journal. Have it accepted (hopefully) in 4-6 months. Have it appear about 6 months later. It is now probably 18-24 months since you started the study. If you're lucky, you've probably changed computers at least once by now, and possibly offices/buildings/universities, too. If you can find the data yourself and understand what it means, you're ahead of the game.
This is not too far from the problem of managing a source code project over time and across maintainers. It's not enough for professional scientists to have standards for retention and sharing of data -- we need a tutorial in documentation (and statistical and other software packages that better support it.)Of course, this new Entropia project gets to write their own license, assuming they're not basing their code on one of the many fine free mud codebases (where your equipment might degrade through use, but not due to economic externalities!)
As one set of research grant deadlines for major U.S. Federal agencies fall early in the year (NSF: Jan 15, NIH: Feb 1), most Decembers find me plugging away.
For those of us in academia, especially on the tenure-track, "holidays" often mean "when you're not teaching and can get around to writing up your papers or grant proposals", although I'm pleased to say that I'm also getting to travel to see my family (hooray for the laptop and the spread of home broadband).
- Alan, Asst Professor of Clinical Decision Making
My playlist for MML2 would include mailman, majordomo2, listar, and ezmlm; greater discussion of non-sendmail MTAs (qmail, postfix, maybe exim), coverage of commercial options (running your own with lyris/listproc vs outsourcing vs. egroups.com approaches), and greater depth in terms of list policy, spam issues, legalities, and tips for specific kinds of lists (like opt-in marketing to established customers)
It was a thrill to see this on slashdot. :)
- Alan Schwartz (author of Managing Mailing Lists)
If I'm a news administrator, I have the right to decide what to carry on my news server, right? That's why spammers complaining about UDPs don't get very far with me.
Well, the fact is, if I own a network, I probably should have the right to decide what to carry over it. In this case, Arizona owns that network, its people presumably expect to provide that network for educational purposes, and its elected representatives get to decide what can be carried on it. (Unless it's a common carrier, or, as a governmentally-owned system, the 1st amendment applies, of course).
Students may be forced to find alternative internet providers (dialups) rather than use the campus network, just as you might have to find an alternative USENET source if you didn't want to participate in a UDP.
So, the bill can be right in principle (absent the 1st amendment issue). But totally wrong in practice if the goal is to save the taxpayer's money, of course -- it will certainly cost more to enforce than it will save in reduced "porn bandwidth".
(Now, you and I know that the goal is really to return to some kind of imaginary "when I was girl, people were proper" morality, but the argument is made on cost as well, and, if the democratic process works, will be answered that way in Arizona and this bill will go down in flames.)