jmason · Slashdot Mirror

the corpus was *not* classified by SA alone on Response to Gordon Cormack's Study of Spam Detection · 2004-06-24 08:09 · Score: 5, Informative

My $.02. disclaimer: I'm one of the SA developers.

"The Corpus was Classified by SpamAssassin, for SpamAssassin", and "The Accuracy of the Test Subject's Corpus is Questionable":

No, this is incorrect. Firstly, he states that he used user feedback to reclassify FNs and FPs (p. 4).

The misunderstanding probably comes from p. 6, where he notes that he also ran SpamAssassin 2.63 over the "gold standard" corpus once it was complete, to verify his original classifications.

However, in addition to that, he states 'all subsequent disagreements between the gold standard and later runs were also manually adjudicated, and all runs were repeated with the updated gold standard. The results presented here are based on this revised standard, in which all cases of disagreement have been vetted manually.' So in other words, the "gold standard" should be as near as possible to 100% accurate, since all the tested filters and the human classification have "had a shot" at classifying every mail, and the human has had final say on every misclassification.

In other words, if any misclassifications remain in the "gold standard" corpus, every one of the tested filters agreed on that misclassification.

IMO, that's as good as a hand-classified corpus can get.
"old versions of software were used":

It's unrealistic to expect the author to use the most up-to-date versions of filters available by the time the paper is made available to the public. That's the difference between results and a paper -- it takes time to analyze results, write it up and come to valid conclusions, once the testing results are obtained. IMO, the author can't be faulted for spending some time on that end of things.

Given that, using 6-month old release versions of the software under test seems reasonable.

SpamAssassin 2.60, when new SpamAssassin rules were last added to a released ruleset, is 9 months old (released 2003-09-22); so logically, in testing against DSPAM 2.8 (released 2003-11-26), DSPAM should therefore have had the edge. ;)
"test started with untrained filters":

IMO, that's the real world. People don't start with fully-trained filters.

In addition, the graphs on pp. 15-20 show accuracy over the course of the entire 8 month period, so "post-training" accuracy can be viewed there.
"spam in the test is as old as 14 months":

Nope, he states (p. 4) that the corpus uses mail between August 2003 and March 2004.
"it should purge old data":

SpamAssassin purges its Bayes databases automatically, based on the age of messages in the corpus. We call it "expiry".

In that test, the "SA-Standard" dataset would be using this, so stating "Cormack did not perform any purge simulation at all" is not accurate. However, that would not have increased SpamAssassin's accuracy figures, since we have generally have found that while it keeps the overhead of bayes database sizes and memory down, it marginally reduces accuracy, instead of increasing it (at the default settings).

(Also worth noting that it can deal with being run from an en-masse check over a static corpus, as it uses the timestamp information in the Received headers rather than the current system time. So even if this test was run in the course of 4 hours, it'd still be an accurate simulation of what would happen in "real world" use over the course of 8 months.)

And finally, what Henry said in comment 9520473.

--j.

Re:Anyone know how to get started with refi spam? on Confession For Two: A Spammer Spills it All · 2004-06-21 14:27 · Score: 1

Take a look at this message on the Spamassassin-users list for some interesting details of what happened when someone strung along those refi spammers. quote:

So I did talk to some of these lenders. Apparently they buy leads from www.lendergateway.com . One guy that I talked to was irritated because it costs him $100 per lead they sell him and it's supposed to only be sold to him. He apologized quite a bit and was nice enough to give me the information on who sold him the names. The number he game me goes to voicemail which I'm going to try later. A couple other people told me what I can do with myself and one lady kept saying that she couldn't give me information on who provided her with my information.
The stupid thing is each time I talk to them I tell them I'm on a cell and that I need their name and number and I'll call them right back. They give it to me... So when they hang up I start calling again and again. I've been irritating the hell out of them...
Anyways, that's the fun storing of what happens when these forms are filled out.

I agree -- this sounds like a very effective way to cause trouble for the spammers; if their customers aren't happy, they won't be ordering many more spam runs....

Re:Patents, and what they are and aren't on Microsoft Patents The Task List · 2004-06-08 19:37 · Score: 1

Actually, I'd suggest the poster learn a little more about patents, instead of ranting.

"The f**king summary" -- or at least the Claims part -- is exactly what governs what other implementations are judged to be infringing, or not. No matter how complex the further explanation is, if the claims are simple and broad, the danger for other software developers are similarly simple and broad.

(The further explanation is supposed to be a way for other implementors to easily use the patent to implement the same system, assuming they then go ahead and license it from the original inventor. Of course, in the software field, that explanation is generally never coherent or detailed enough to do so, without having to expend pretty much as much effort as if you wrote it from scratch yourself.)

good workaround: 'mail all commits' on Webmasters Pounce On Wiki Sandboxes · 2004-06-07 06:58 · Score: 1

We've also had problems on the SpamAssassin Wiki.

Our solution has been to ensure that all changes are emailed to a mailing list, where we can monitor them and remove the spam links within minutes of their arrival.

An ideal solution: Google should define an attribute for the A tag, which indicates that a URL should not be used in computing Page Rank. We could then modify our Wikis so that page links from Wikis are not included.

Same thing would work for weblog comment spamming, too.

Re:Poor bastard. on OptInRealBig Wins Restraining Order On SpamCop · 2004-05-12 07:01 · Score: 1

To save the 'poor bastard', I'd suggest getting this 10MB Quicktime .mov version instead; it loses the unfunny subtitles, and is hosted on archive.org, which can handle the traffic.

Re:Not to pick on just Microsoft... on Microsoft Assembles Patent Arsenal for Longhorn · 2004-05-04 08:07 · Score: 3, Insightful

I don't think you understand what happens if a developing country annoys the WTO by ignoring provisions of the WIPO and TRIPS treaties.

Check out what happened to Brazil when they tried to manufacture generic AZT without paying license fees to Glaxo. Here's a snippet from this doc:

'In 1996, Brazil passed a law authorizing the local production of five key anti-retroviral drugs used in the US. Some of the medications, such as AZT, an anti-retroviral drug that prevents the transmission of HIV from mother to child, were patented prior to 1995 when the WTO provisions first applied. These medicines fall outside the scope of TRIPS. Through its patent law, Brazil allows the drugs to be produced legally, without paying royalties. As a result, Brazil is able to provide free drugs to people living with HIV/AIDS. Recently, Brazil managed to persuade the US company Merck to lower the prices of two of its drugs, Crixivan and Stocrin, used to treat people with AIDS, by threatening to permit compulsory licensing if Merck did not cut prices by 50 per cent.
In the US government's view, a section of Brazil's law discriminated against foreign owners of patents. Under the law, designed to help build a national pharmaceutical industry and reduce the price of medicines, Brazil will honour a patent only if the drug is produced locally. Therefore, foreign companies must establish a presence in Brazil in order to enjoy protection. According to the US, TRIPS prohibited this kind of discrimination. The US government maintained steady diplomatic pressure on Brazil to get it to change its patent regime and medicines policy, backing up the pressure with a threat of unilateral trade sanctions.'

So, a developing country that came to the attention of a sufficiently-powerful US corporation in ignoring specific IP-related trade treaties, got slapped down with threats of unilateral trade sanctions.

For a developing country, sanctions are no small deal. Hell, even for the US, they're no small deal ;) I'd say the local software industry would quickly find out that TRIPS was back on the menu....

run the server on UNIX on Windows Source Control for the Lone Developer? · 2004-04-19 17:52 · Score: 1

Do yourself a favour -- don't try running the server on a Windows machine, it'll be a world of pain.

Just get hold of a clunky old PC, install linux, and use that as a dedicated source code control server with whatever system you want to use. You'll save yourself a lot of bother (and a bit more immunity to disk crashes, too).

Re:Two thoughts on Domain Based Spam Prevention? · 2004-01-28 06:19 · Score: 1

'One, wouldn't a normal Bayesian filter do this automatically? I.e., pick up that url in mail classified as spam and then weight it positively in the future?'

Yep, that's the case, in SpamAssassin 2.6x at least.

Re:Wont make a blind bit of difference on UK Spam Law Goes Live · 2003-12-11 06:57 · Score: 1

I find it amazing that, because spam comes from various offshore addresses, people always say that spam laws are pointless because the spam "all comes from overseas anyway". People say this no matter where the law is being discussed!

If it's not illegal *anywhere* then we've made no legislative progress whatsoever.

Basically, if spamming is illegal in the UK (and Ireland, and Australia, etc.) then (A) spammers cannot offshore to those countries, or outsource to spam bureaus there, so that's one set of possible spamhosting ISPs we don't have to worry about. (B) if a multinational company spams from the US to a recipient in the UK/Ireland/Australia etc., and have an office in those countries, they can still be held accountable for their spamming even despite the US' weak laws. and (C) at least in Ireland, it may be possible to prosecute spammers in other European countries due to EU harmonization of laws and jurisdictions. -- (I think. IANAL.)

Me, I'm thinking that (B) may turn out to be handy against the serious mainsleaze spammers -- of which there are plenty, and given the CAN-SPAM act, there will be many many more quite soon.

MTA authors better watch out on IronPort Arms Both Sides In Spam War · 2003-12-03 14:03 · Score: 1

Ironport's appliances are good for sending lots of mail. That probably means they may be useful for sending lots of spam, but it also definitely means they're useful for sending legit bulk mail. It's a "dual-use" thing.

In fact, that applies to SMTP email in general. Consider all the mailing lists you read; they're "bulk email". That's one reason why spam filtering is harder in email than other protocols like IM; bulk one-to-many contact is a lot more common in the SMTP case. The IETF recognised this, and hence we have ESMTP.

Given this story, if I was Eric Allman or Wietse Venema, I'd be worried about people complaining that sendmail or postfix are spammer tools...

DEs with their own virtual filesystem layers on Freedesktop.org on KDE/Gnome, New Goals · 2003-11-24 11:41 · Score: 1

'Eugenia Loli-Queru: In your opinion, which is the hardest step to take in the road ahead for full interoperability between DEs? How far are we from the realization of this step?

'Havoc Pennington: I think the "URI namespace" or "virtual file system" issue is the ugliest problem right now. It bleeds into other things, such as MIME associations and WinFS-like functionality. It's technically very challenging to resolve this issue, and the impact of leaving it unresolved is fairly high. Here are some links on that here, here and here. '

OK -- so, unsurprisingly, having GNOME have one set of apps that can read one namespace, KDE have another set that can read another namespace, and a whole load of command line tools that can't read either, is a problem.

I still can't understand why this hasn't made it into a mainline kernel hook, or at least a shared library kludge. Something like AVFS
is infinitely preferable to a filesystem that can only be accessed by a small subset of applications...

A bunch of 5-year-olds! on GameCube Tunneling Software Rivals Clash · 2003-11-14 08:59 · Score: 3, Informative

I took a look -- it's crazy.

One group seems to have written this 'Warp Pipe' tool, using Sourceforge infrastructure, declaring it under a BSD license (as far as I can make out from the comments) when they set up the SF project.

Another group then starting working off that (supposedly open-source) codebase. The first group are not happy about this, and have decided it's now proprietary and want to remove rights to use that code.

(Either that, or they think users of a BSD-licensed package needs 'express written consent of Warp Pipe to repackage or redistribute in any way'.)

Apparently, they didn't *actually* specify license terms in the source; but they must have claimed an open-source license in order to use Sourceforge. So at some point, they were a little 'unclear' about the license.

All very amateurish...

BTW, the sf.net project page is still there: here's a link: http://sourceforge.net/projects/cubeonline23/

And CVS: http://cvs.sourceforge.net/viewcvs.py/cubeonline23 /WarpPipe/

Re:Mozilla bug on IBM Applies for Password Manager Patent · 2003-11-10 07:38 · Score: 1

doh. 'apply often and apply for anything' was what I meant to say. (mental note: use preview in future!)

Mozilla bug on IBM Applies for Password Manager Patent · 2003-11-10 07:34 · Score: 1

I reported this to Bugzilla last week. Interestingly, I also submitted it to /., but it was rejected ;)

IMO, IBM are doing the right thing in many areas, but their patent policy (apply and apply for anything) seems to be out of control.

original paper correct: blame an Excel screwup on Climate Data Re-examined (updated) · 2003-11-05 13:26 · Score: 1

Stop the presses -- the original paper looks like it was correct, as far as review of the M&M results reveals so far. It seems a screw-up somewhere resulted in exporting 159 columns of data into a 112-column Excel spreadsheet, which screwed up the analysis for this . (Blame MS! ;)

Also, theirs is not the only paper that supports the 'hockey stick' graph anyway -- there's quite a few others, too.

But anyway -- we're jumping the peer-review process heavily here. USA Today stories are supposed to happen after the peers do the reviewing ;)

Re:Oh, THAT eolas patent on W3C Requests Eolas Patent Re-Examination · 2003-10-29 07:27 · Score: 1

Yeah -- and I, as an Irishman, am well ashamed of that.

thumbs down on Replacing the Aging Init Procedure on Linux · 2003-10-02 11:57 · Score: 2, Insightful

Gotta say, I hate the idea. I've dealt with unusual apps in charge of starting services in the past (AIX had some kind of DCE-based service control daemon) -- and it was a world of hell. Shell scripts, by comparison, are comprehensible, tweakable, and very very easy to deal with. I know -- this sounds very unlikely -- but any system that has to deal with as many settings/dependencies/external hooks etc. as the boot scripts, is going to be that confusing anyway no matter what language it's in!

But I do like the idea of parallelization of the boot scripts, and starting X a whole lot earlier (like before the daemons are all started); I hacked up the init scripts to do this on my desktop linux machine a few years ago, and on Solaris and SunOS machines before that, and it was great for boot time.

Richard Gooch's need(8) and provide(8) tools look like a fantastic way to do this simply, comprehensibly, and without rewriting everything in a new language. that's available here, and that page notes that it should be in versions of init in util-linux since 2.10q.

Re:hmm on snopes.com's David Mikkelson Interviewed · 2003-08-03 18:19 · Score: 1

Wouldn't you know it, someone already mentioned that...

'Snopes was set up in early 1995 by the CIA as a way to debunk popular conspiracy theories, Companies and individuals can now pay to have their urban legend denied on the site, a prime beneficiary being Richard Gere.'

labelling on UK Expert Panel Split on GM Food Risks · 2003-07-22 13:39 · Score: 3, Insightful

'because you have to mark as GM anything that even could have come into contact with GM crops - this is 99.9% of American crops - nobody in the EU will buy any food exports from the US'.

Come on. Is this really a good argument? Why would you be against labelling a foodstuff as to its origin and provenance?

Sorry, I don't agree. IMO, the more info a consumer has on where their food comes from, how it was grown, what pesticides were used, whether it may contain GM pollen, how it was treated after picking, etc. -- the better.

It's simply called informing the consumer. Then the consumer can use their judgement instead of trusting some big, faceless organisation who Knows What's Good For You.

And then interested parties can persuade the consumers that GM is safe, and eating the tomatos with the GM sticker is fine. That's OK, that makes sense. But don't use this 'information is bad' line, it's crap.

PS: re GM patents, etc. IMO the GM industry at the moment is acting like the RIAA; there's lots of good ways to use GM, but they're focused on the short term gain -- make $$$$ fast.

Opt-out - bad plan on Michigan Governor Signs Anti-Spam Bill · 2003-07-16 06:45 · Score: 1

It's worth reading Ray Everett-Church of CAUCE's comments on another opt-out based anti-spam bill:

'Any legislation that permits all of America's estimated 23 million small businesses to legally send everyone at least one email cannot be considered anti-spam. And any bill that limits a consumer's recourse to clicking an opt-out link 23 million times isn't going to make our lives any better. ....
Opt-out laws have let the problem grow to the state it is today; no one in Congress can supply an adequate explanation as to why opt-out at a national level will make any difference. Opt-out in Korea has been an unmitigated disaster and their legislature is rushing to repair the global damage their opt-out law has done to their Internet economy. California's opt-out law is being scrapped. And the European Union knew better than to waste time with a discredited approach and went straight to opt-in.'

CAUCE points out that the current proposals to Congress all suffer the same problem. Opt-in, as the EU have chosen, is the only way to reduce the flood of spam effectively, through legal means.

At least this law allows ISPs to prosecute spammers, and it does not block class action suits from multiple spam recipient consumers (AFAICS). Also the damages of $500 per message is a lot better than the proposed Texas state law's puny $10 per message.

But consider these facts: there's 23 million small businesses in the US. That means a lot of "I would like to opt out" mails you'll be sending. Multiply that by however many possible addresses you can receive mail at: foo@domain1.com, foo@[211.11.22.34], foo%domain1.com@domain1.com, root@domain1.com, postmaster@domain1.com, foo@forwardingservice.net, foo@perl.org, foo@users.sourceforge.net, etc. etc. etc.

Then there's the "tagged addressing" concept, where you "tag" the addresses you give out with additional text to identify who you gave it to, e.g. foo+amazon@domain1.com, foo+slashdot@domain1.com. Each of those is a different "e-mail address".

Better get those typing fingers in shape :(

actually forced through TRIPS treaty on EU Sues Member Nations To Force Change In Patent Laws · 2003-07-11 13:17 · Score: 5, Informative

Should be pointed out that this was a condition of the WTO's TRIPS treaty of 1995:

Here the World Trade Organization (WTO) lent the biotech industry a shoulder to cry on by allowing the major players to formulate the Trade Related Intellectual Property Rights Agreement (TRIPS) which came into force in 1995. TRIPS aims to force all countries to take on board a menu of biotech patents and 'harmonize' their national patenting regimes accordingly - the aim is to make the world follow the US example.

This book review at Nature says: 'Central to this analysis is the account of the negotiation of TRIPS, whereby the campaign for globalized intellectual-property standards was shifted to the international trade agenda. Developing countries were persuaded to sign up to TRIPS in exchange for the liberalization of world trade markets. The subsequent failure of these markets to materialize (witness US steel tariffs and farm subsidies in the United States and Europe) also goes some way to explaining the growing disenchantment with TRIPS.'

See also why Biotech patents are patently absurd. As members of the WTO, and signatories to TRIPS, these countries really don't have a choice; they'd be in breach of the TRIPS treaty if they do not ratify these laws.

SA Public Corpus on Bayesian Filter Testing? · 2003-07-02 06:22 · Score: 1

There is one, for exactly this reason -- the SpamAssassin public corpus. I made it available for developers of spam tools to compare effectiveness using a good, recent corpus from 1 person's mail feed (as much as that was possible).

Here's the pertinent part of the README :

This is a selection of mail messages, suitable for use in testing spam filtering systems. Pertinent points:

All headers are reproduced in full. Some address obfuscation has taken place, and hostnames in some cases have been replaced with "spamassassin.taint.org" (which has a valid MX record). In most cases though, the headers appear as they were received.

All of these messages were posted to public fora, were sent to me in the knowledge that they may be made public, were sent by me, or originated as newsletters from public news web sites.

relying on data from public networked blacklists like DNSBLs, Razor, DCC or Pyzor for identification of these messages is not recommended, as a previous downloader of this corpus might have reported them!

Copyright for the text in the messages remains with the original senders.

OK, now onto the corpus description. It's split into three parts, as follows:

spam: 500 spam messages, all received from non-spam-trap sources.

easy_ham: 2500 non-spam messages. These are typically quite easy to differentiate from spam, since they frequently do not contain any spammish signatures (like HTML etc).

hard_ham: 250 non-spam messages which are closer in many respects to typical spam: use of HTML, unusual HTML markup, coloured text, "spammish-sounding" phrases etc.

easy_ham_2: 1400 non-spam messages. A more recent addition to the set.

spam_2: 1397 spam messages. Again, more recent.

Total count: 6047 messages, with about a 31% spam ratio.

You mean "Thank you, US" on EU Parliament to Vote on New Patent Rules · 2003-06-29 15:46 · Score: 2, Interesting

Actually, you can thank the US for that. ;)

One reason this has come up as an issue, is because the US (via the WTO) have been applying pressure to countries around the world to "reform" their IP systems -- to match the US' own system -- for quite a while.

The TRIPS (Trade Related Aspects of Intellectual Property Rights) treaty, and GATT, are the main methods used to do this. The FFII page on the treaty notes 'Article 27 has often been construed by patent lawyers to imply that patent claims must be allowed to extend to computer programs' (my emphasis).

FFII go on to make the case that this can be circumvented BTW; here's hoping, since all of Europe has signed up to TRIPS AFAIK.

Irish voters: MEP list on European Software Patents Vote Now June 30th · 2003-06-21 08:20 · Score: 1

As I posted at my blog posting on the issue:

If you are a European and bothered by software patents, now is the time to write to (or even email) MEPs asking them to oppose this directive; it's the 'proposed software patentability directive as amended by JURI' (COM(2002)92 2002/0047). The letter should support the FFII/Eurolinux and/or Green position.

Irish voters: here's the list of Irish MEPs:

1. Mrs AHERN, Nuala Group of the Greens/European Free Alliance
2. Mr ANDREWS, Niall Union for Europe of the Nations Group
3. Mrs BANOTTI, Mary Elizabeth Group of the European People's Party (Christian Democrats) and European Democrats
4. Mr COLLINS, Gerard Union for Europe of the Nations Group
5. Mr COX, Pat Group of the European Liberal, Democrat and Reform Party
6. Mr CROWLEY, Brian Union for Europe of the Nations Group
7. Mr CUSHNAHAN, John Walls Group of the European People's Party (Christian Democrats) and European Democrats
8. Mr DE ROSSA, Proinsias Group of the Party of European Socialists
9. Mrs DOYLE, Avril Group of the European People's Party (Christian Democrats) and European Democrats
10. Mr FITZSIMONS, James (Jim) Union for Europe of the Nations Group
11. Mr HYLAND, Liam Union for Europe of the Nations Group
12. Mr McCARTIN, John Joseph Group of the European People's Party (Christian Democrats) and European Democrats
13. Mrs McKENNA, Patricia Group of the Greens/European Free Alliance
14. Mr O' NEACHTAIN, Sean Union for Europe of the Nations Group
15. Mrs SCALLON, Dana Rosemary Group of the European People's Party (Christian Democrats) and European Democrats

Please take the time to send them a letter, or even a mail. This really is a terrible proposal, and the last thing open source and small software developers need, is more software patents with an expanded range.

ghosts on Open Source Distributed Shell Tools? · 2003-06-18 12:15 · Score: 2, Informative

'ghosts' is a command which has been included with perl in the 'eg' directory since at least 4.036. It does this effectively, allowing you to do

gsh somemachines somecommand

or

gcp somefile somemachines:/etc/newfile

worked great, last time I had to admin a large network (about 5 years ago ;). *EXTREMELY* simple, too.

http://outflux.net/unix/software/gsh/ seems to be an updating of this tool.

Slashdot Mirror

User: jmason

Comments · 64