Jason Rennie gave an extremely interesting talk about this at the MIT Spam Conference this month, although he wasn't using quite as direct a method, instead he was looking at MLD - Minimum Length Description. This is a technique to discover features in corpora that allow you to describe the classification of a corpus in the minimum number of details.
Basically it's a way to discover features in emails using compression techniques, so rather than having us SpamAssassin developers have to carefully and manually examine emails to see what's new and interesting about them, MLD techniques can automatically detect these features.
Jason Rennie's web page (talk and paper available) about this is here. Please do read it as it's extremely interesting.
The one downside of it is that Jason said at the end of his talk that it's extremely slow at doing the feature detection. When asked how slow he said that on a reasonably small corpus it took 4 months (although he said it was written in Perl, so a C port is probably a good plan).
In comparison to Bayesian techniques the MLD technique presents a great deal of interest - primarily because I work for a company doing spam filtering at the internet level, and so we can't feasibly do personal training which is what makes Bayesian techniques so great (see the talk I gave at the MIT spam conference). Without the personal training Bayes is only about 90-95% effective, so it should be interesting to see where these techniques lead us.
I'm afraid you don't know what you're talking about. SpamAssassin 2.50 uses *exactly* the same tokenising techniques as Graham's filters (in fact some more advanced ones too) as well as SpamAssassin's original set of heuristics. These are combined together to get an overall better picture of the email. See my presentation on this topic that I gave at the recent spam conference.
And yet still, all these years on, there's still no core (distributed with Java) way to get documentation directly on the command line (aside from generating a bunch of HTML files and calling lynx on them). The man page for javadoc even says: "javadoc parses the declarations and documentation comments in a set of Java source files and produces a corresponding set of HTML pages describing (by default) the public and protected classes, inner classes, interfaces, constructors, methods, and fields.". And most of that is too low level for your average programmer who just wants a synopsis.
I'll take my perl man pages over javadoc any day of the year. Perl ships with tools to turn your documentation into man pages, text, HTML and LaTeX and CPAN is full of tools to convert to many other formats. They may not be quite as cross referenced automatically and have wizzy features like tables and other things that javadoc covers, but they are available right there where I program - in a shell. No browser required. And they work just fine over ssh thank you. Not only that but perl documentation just seems easier to figure out what's going on to me in general, because they encourage you to include a synopsis of how this module should be used. Java programmers seem happy because the Java doc tools are better than what C or C++ offers, but there's a whole other world out there that you're missing.
I wrote perl's XML::SAX::PurePerl - a pure perl recursive descent parser entirely with the aid of print statements. I never use a debugger - I can't recall the last time I used one except for getting a stack trace.
However I used to use debuggers. I just find that the more mature a programmer I become the more I know almost exactly where the bug is as soon as its described.
Spoken by someone who has never tried to run an open source business, I'm guessing.
Should you some day try to do this you'll realise very quickly that it's only the very largest projects that can make money in the ways you've stated above. Personally I have several dozen open source projects, including some rather popular ones. I've been soliciting money for a number of those projects in all the ways stated above, and the only one that actually brought money in was writing articles - simply because it didn't require the customer to come to me and lay money down directly.
A lot of small business people like myself trying to make open source work for them have realised this, and have had to switch to doing something proprietary to make a living. There's nothing wrong with that - it just turns out that open source isn't a very good source of income (it's still a great source of software though).
ActiveState started out porting Perl to windows for Microsoft to put on the Resource Kit CD. They've funded and helped quite a lot of ActiveState's development. They have not yet killed them, nor shown any sign that they wish to.
Yes, and the poster you replied to was absolutely correct though. No email MTA keeps the email in the queue for a 5xx error as this is a hard bounce (retrying after a 5xx error would be a violation of the relevant RFCs). Only a soft bounce error (4xx) will keep the message in the queue. So the statement in the original message about using up disk space is totally invalid.
Could be some other room's. I tend to take my wireless hub with me when I go to stay at hotels, so I can plug it in and surf from the comfort of the couch or the bed. You could just be picking up the guy's next door.
We also do this at the O'Reilly open source conferences - we make sure geeks are located close to eachother, and then share one net connection bill between us. Sometimes we'll even setup wireless repeaters.
Nonesense. As the author of one of the available OpenOffice to HTML (and DocBook) converters out there, I can honestly say we did most of the work without the Schema in front of us (especially since that Schema is a 400+ page pdf). We just used plain old reverse engineering principles most of the time. Works damn well, and XML makes it infinitely simpler than a binary format.
Re:Please listen up to my noteworthy advice
on
Professional PHP4
·
· Score: 4, Informative
Ah, hopefully grasshopper you will learn in time that embedding code inside your HTML is a really bad idea. Sadly PHP (and JSP and ASP and a whole host of other languages) encourages this behaviour.
What you really want is a good templating system.
(Yes, I know PHP can do templating systems, thank you).
You might want to consider Apache's AxKit. While parts of AxKit are written in Perl to make it faster for us to write, the key parts that do the heavy lifting are written in C for performance reasons. Some fortune 500 companies are already running their site on it, and we get an awful lot of people coming to us from Cocoon for performance reasons.
It's a full apache project (under the XML umbrella), just like Cocoon is, and incorporates the same technologies (XSLT and XSP), but a lot of people skip over it because it's not Java. Maybe now is time to re-evaluate that decision.
There's no way in hell we're going to be that lucky. A 50% increase in 5 years would make me jump with joy.
The truth is it's increasing at a much faster rate than that. Recent research has shown that it's going up about 400% per year!!! And my personal email account verifies that sort of increase.
I suspect Jupiter is going to be eating its own words. In 5 years I suspect we'll be seeing perhaps 50 times more spam, not 50% more.
Switching accounts isn't always that easy. My name is Matt Sergeant, and my email address is matt@sergeant.org. I'm just not changing that because someone who lives in a million dollar home thinks my address is his public shit can.
Luckily my job is detecting spam (I'm a SpamAssassin developer too), so I'm actually quite happy to get my address harvested loads of times:-) Bring it on, spammers.
But yes, I get lots of spam. About 100 a day. Not including mailing list subscriptions I get about 5 to 10 regular pieces of email a day. That's a hell of a ratio.
Try OpenOffice. My company sells an XSLT based filter that will turn OpenOffice documents (if using sensible styles) into DocBook XML. You may have to tweak it a bit to get exactly what you desire, but that's going to be the case with any tool.
just look at how Bruce Perens' threats to fork Web standards made the W3C reject RAND licensing.
You know, this is pretty hysterical. You're looking at the world through the microcosm that is Slashdot. The W3C didn't reject RAND because of Bruce Perens, they rejected it because their members didn't want it, and because many people in different communities (the XML community, the Web community, the vector graphics community, etc) fought against it. RAND was really unpopular in all but the larger members of the W3C (e.g. Microsoft, Adobe - and even they fought internally about it), because companies realised that they were better off with RF specifications.
If cost is even slightly an issue, I can recommend using qpsmtpd and clamav. The clamav team are pretty fast at adding new virus signatures to their database, and they catch most of the common viruses out there. I've written a qpsmtpd plugin for clamav which you can find here.
I can't honestly recommend Sophos for gateway scanning. They are better on the desktop. If you can I would go for NAI who have the best gateway scanning of the commercially available scanners (according to our live tests).
Alternatively, if a 100% guarantee appeals to you, the company I work for, MessageLabs will give you a 100% guarantee against letting through an email virus. We'll also do spam scanning for you. Yes, I'm biased.
There are a few options for content management systems built on top of AxKit.
First if your needs are really simple you can try the AxKit wiki, which is the only wiki out there that allows you to enter data in either XML (sdocbook), WikiWiki text, or Perl's POD format. Although right now the wiki is extremely simplistic (no versioning or user management), it's quite extensible.
Next up the ladder of complexity is CallistoCMS which is has a really cool online editor component, basically allowing you to do almost WYSIWYG editing of XML content live in the web browser (all just uses pure HTML+CSS+JS+DOM, no ActiveX or Java plugins involved).
Finally there's XIMS, which is basically what you might consider as a full blown CMS, including versioning, metadata, workflow, etc etc.
Re:Parsing HTML in Perl
on
Perl & LWP
·
· Score: 3, Interesting
Try XML::LibXML instead. It parses HTML, uses a DOM tree, and is all in C code, so uses about twice the memory of your source document, instead of about 8 times for a pure perl DOM.
Jason Rennie gave an extremely interesting talk about this at the MIT Spam Conference this month, although he wasn't using quite as direct a method, instead he was looking at MLD - Minimum Length Description. This is a technique to discover features in corpora that allow you to describe the classification of a corpus in the minimum number of details.
Basically it's a way to discover features in emails using compression techniques, so rather than having us SpamAssassin developers have to carefully and manually examine emails to see what's new and interesting about them, MLD techniques can automatically detect these features.
Jason Rennie's web page (talk and paper available) about this is here. Please do read it as it's extremely interesting.
The one downside of it is that Jason said at the end of his talk that it's extremely slow at doing the feature detection. When asked how slow he said that on a reasonably small corpus it took 4 months (although he said it was written in Perl, so a C port is probably a good plan).
In comparison to Bayesian techniques the MLD technique presents a great deal of interest - primarily because I work for a company doing spam filtering at the internet level, and so we can't feasibly do personal training which is what makes Bayesian techniques so great (see the talk I gave at the MIT spam conference). Without the personal training Bayes is only about 90-95% effective, so it should be interesting to see where these techniques lead us.
Install SpamAssassin. It has an ok_languages option that you can set to only allow through languages you want, and it's pretty accurate at guessing.
I'm afraid you don't know what you're talking about. SpamAssassin 2.50 uses *exactly* the same tokenising techniques as Graham's filters (in fact some more advanced ones too) as well as SpamAssassin's original set of heuristics. These are combined together to get an overall better picture of the email. See my presentation on this topic that I gave at the recent spam conference.
And yet still, all these years on, there's still no core (distributed with Java) way to get documentation directly on the command line (aside from generating a bunch of HTML files and calling lynx on them). The man page for javadoc even says: "javadoc parses the declarations and documentation comments in a set of Java source files and produces a corresponding set of HTML pages describing (by default) the public and protected classes, inner classes, interfaces, constructors, methods, and fields.". And most of that is too low level for your average programmer who just wants a synopsis.
I'll take my perl man pages over javadoc any day of the year. Perl ships with tools to turn your documentation into man pages, text, HTML and LaTeX and CPAN is full of tools to convert to many other formats. They may not be quite as cross referenced automatically and have wizzy features like tables and other things that javadoc covers, but they are available right there where I program - in a shell. No browser required. And they work just fine over ssh thank you. Not only that but perl documentation just seems easier to figure out what's going on to me in general, because they encourage you to include a synopsis of how this module should be used. Java programmers seem happy because the Java doc tools are better than what C or C++ offers, but there's a whole other world out there that you're missing.
I wrote perl's XML::SAX::PurePerl - a pure perl recursive descent parser entirely with the aid of print statements. I never use a debugger - I can't recall the last time I used one except for getting a stack trace.
However I used to use debuggers. I just find that the more mature a programmer I become the more I know almost exactly where the bug is as soon as its described.
SA2.50 (released soon - but nightly builds work well for most people) includes a Bayesian component just like POPFile, spambayes, bogofilter etc.
Spoken by someone who has never tried to run an open source business, I'm guessing.
Should you some day try to do this you'll realise very quickly that it's only the very largest projects that can make money in the ways you've stated above. Personally I have several dozen open source projects, including some rather popular ones. I've been soliciting money for a number of those projects in all the ways stated above, and the only one that actually brought money in was writing articles - simply because it didn't require the customer to come to me and lay money down directly.
A lot of small business people like myself trying to make open source work for them have realised this, and have had to switch to doing something proprietary to make a living. There's nothing wrong with that - it just turns out that open source isn't a very good source of income (it's still a great source of software though).
ActiveState started out porting Perl to windows for Microsoft to put on the Resource Kit CD. They've funded and helped quite a lot of ActiveState's development. They have not yet killed them, nor shown any sign that they wish to.
Yes, and the poster you replied to was absolutely correct though. No email MTA keeps the email in the queue for a 5xx error as this is a hard bounce (retrying after a 5xx error would be a violation of the relevant RFCs). Only a soft bounce error (4xx) will keep the message in the queue. So the statement in the original message about using up disk space is totally invalid.
Could be some other room's. I tend to take my wireless hub with me when I go to stay at hotels, so I can plug it in and surf from the comfort of the couch or the bed. You could just be picking up the guy's next door.
We also do this at the O'Reilly open source conferences - we make sure geeks are located close to eachother, and then share one net connection bill between us. Sometimes we'll even setup wireless repeaters.
Nonesense. As the author of one of the available OpenOffice to HTML (and DocBook) converters out there, I can honestly say we did most of the work without the Schema in front of us (especially since that Schema is a 400+ page pdf). We just used plain old reverse engineering principles most of the time. Works damn well, and XML makes it infinitely simpler than a binary format.
Ah, hopefully grasshopper you will learn in time that embedding code inside your HTML is a really bad idea. Sadly PHP (and JSP and ASP and a whole host of other languages) encourages this behaviour.
What you really want is a good templating system.
(Yes, I know PHP can do templating systems, thank you).
You might want to consider Apache's AxKit. While parts of AxKit are written in Perl to make it faster for us to write, the key parts that do the heavy lifting are written in C for performance reasons. Some fortune 500 companies are already running their site on it, and we get an awful lot of people coming to us from Cocoon for performance reasons.
It's a full apache project (under the XML umbrella), just like Cocoon is, and incorporates the same technologies (XSLT and XSP), but a lot of people skip over it because it's not Java. Maybe now is time to re-evaluate that decision.
There's no way in hell we're going to be that lucky. A 50% increase in 5 years would make me jump with joy.
The truth is it's increasing at a much faster rate than that. Recent research has shown that it's going up about 400% per year!!! And my personal email account verifies that sort of increase.
I suspect Jupiter is going to be eating its own words. In 5 years I suspect we'll be seeing perhaps 50 times more spam, not 50% more.
Switching accounts isn't always that easy. My name is Matt Sergeant, and my email address is matt@sergeant.org. I'm just not changing that because someone who lives in a million dollar home thinks my address is his public shit can.
:-) Bring it on, spammers.
Luckily my job is detecting spam (I'm a SpamAssassin developer too), so I'm actually quite happy to get my address harvested loads of times
But yes, I get lots of spam. About 100 a day. Not including mailing list subscriptions I get about 5 to 10 regular pieces of email a day. That's a hell of a ratio.
Try OpenOffice. My company sells an XSLT based filter that will turn OpenOffice documents (if using sensible styles) into DocBook XML. You may have to tweak it a bit to get exactly what you desire, but that's going to be the case with any tool.
just look at how Bruce Perens' threats to fork Web standards made the W3C reject RAND licensing.
You know, this is pretty hysterical. You're looking at the world through the microcosm that is Slashdot. The W3C didn't reject RAND because of Bruce Perens, they rejected it because their members didn't want it, and because many people in different communities (the XML community, the Web community, the vector graphics community, etc) fought against it. RAND was really unpopular in all but the larger members of the W3C (e.g. Microsoft, Adobe - and even they fought internally about it), because companies realised that they were better off with RF specifications.
And according to their contacts page, Guru Rajan is their Chief Architect.
Hotmail just started using Brightmail, hence the drop in spam. It's nothing to do with blocklists or Verio.
[Disclaimer: I work in AV]
If cost is even slightly an issue, I can recommend using qpsmtpd and clamav. The clamav team are pretty fast at adding new virus signatures to their database, and they catch most of the common viruses out there. I've written a qpsmtpd plugin for clamav which you can find here.
I can't honestly recommend Sophos for gateway scanning. They are better on the desktop. If you can I would go for NAI who have the best gateway scanning of the commercially available scanners (according to our live tests).
Alternatively, if a 100% guarantee appeals to you, the company I work for, MessageLabs will give you a 100% guarantee against letting through an email virus. We'll also do spam scanning for you. Yes, I'm biased.
FWIW, SpamAssassin 2.50 will include a statistical filter that works like similar bayesian filters.
It should be pretty cool, in that it will automatically train on spamassassin results, as well as allowing you to add or remove spam and non-spams.
Matt (a spamassassin developer)
Full details of why this has the potential to break things are on the XML news site Cafe Con Leche.
Please read that before making uninformed comments - news.com isn't where you'll find technical information about this problem.
It's already being worked on.
Matt. (a SpamAssassin developer)
There are a few options for content management systems built on top of AxKit.
First if your needs are really simple you can try the AxKit wiki, which is the only wiki out there that allows you to enter data in either XML (sdocbook), WikiWiki text, or Perl's POD format. Although right now the wiki is extremely simplistic (no versioning or user management), it's quite extensible.
Next up the ladder of complexity is CallistoCMS which is has a really cool online editor component, basically allowing you to do almost WYSIWYG editing of XML content live in the web browser (all just uses pure HTML+CSS+JS+DOM, no ActiveX or Java plugins involved).
Finally there's XIMS, which is basically what you might consider as a full blown CMS, including versioning, metadata, workflow, etc etc.
Try XML::LibXML instead. It parses HTML, uses a DOM tree, and is all in C code, so uses about twice the memory of your source document, instead of about 8 times for a pure perl DOM.