Domain: jgc.org
Stories and comments across the archive that link to jgc.org.
Stories · 16
-
Names That Break Computers (bbc.com)
Reader Thelasko writes: The BBC has a story about people with names that break computer databases. "When Jennifer Null tries to buy a plane ticket, she gets an error message on most websites. The site will say she has left the surname field blank and ask her to try again." Thelasko compares it to the XKCD comic about Bobby Tables, though it's a real problem that's also been experienced by a Hawaiian woman named Janice Keihanaikukauakahihulihe'ekahaunaele, whose last name exceeds the 36-character limit on state ID cards. And in 2010, programmer John Graham-Cumming complained about web sites (including Yahoo) which refused to accept hyphenated last names. Programmer Patrick McKenzie pointed the BBC to a 2011 W3C post highlighting the key issues with names, along with his own list of common mistaken assumptions. "They don't necessarily test for the edge cases," McKenzie says, noting that even when filing his own income taxes in Japan, his last name exceeds the number of characters allowed. -
HTTP/2.0 Opens Every New Connection It Makes With the Word 'PRISM' (jgc.org)
An anonymous reader writes: British programmer and writer John Graham-Cumming has spotted what appears to be a 'code-protest' in the next generation of the hypertext protocol. Each new connection forged by the HTTP/2.0 protocol spells out the word 'PRISM' obliquely, though the word itself is obscured to the casual observer by coded returns and line-breaks. Work on the hidden message in HTTP/2.0 seems to date back to nine days after the Snowden revelations broke, with the final commit completed by July of 2013. In July 2013 one of the protocol's architects appealed to the development group to reconsider design principles in the light of the revelations about the NSA's worldwide surveillance program. -
UK Government Owns 16.9 Million Unused IPv4 Addresses
hypnosec writes "The Department of Work and Pensions in the UK has a /8 block of IPv4 addresses that is unused. An e-petition was created asking the DWP to sell off the block to ease the IPv4 address scarcity in the RIPE region. John Graham-Cumming, the person who first discovered the unused block, discovered that these 16.9 million IP addresses were unused after checking in the ASN database." -
A Hacked WiFi Router, an API, and a Toy Bus: It's the Ambient Bus Arrival Monito
JohnGrahamCumming writes "In this simple project, a hacked Linksys WRT54GL talks to a public API to get real-time bus information, and displays the times of the next buses on a model bus. Never miss the bus again! 'It's possible to reflash the Linksys with a custom Linux installation that lets me control the box completely (and still use it as a wireless router). There are various project, but I used OpenWRT. With OpenWRT it's possible to SSH into the box and treat it as any Linux server (albeit a rather slow one). But there's plenty of power to grab bus times and update an LED display connected to the WRT54GL's serial port. " -
A Hacked WiFi Router, an API, and a Toy Bus: It's the Ambient Bus Arrival Monito
JohnGrahamCumming writes "In this simple project, a hacked Linksys WRT54GL talks to a public API to get real-time bus information, and displays the times of the next buses on a model bus. Never miss the bus again! 'It's possible to reflash the Linksys with a custom Linux installation that lets me control the box completely (and still use it as a wireless router). There are various project, but I used OpenWRT. With OpenWRT it's possible to SSH into the box and treat it as any Linux server (albeit a rather slow one). But there's plenty of power to grab bus times and update an LED display connected to the WRT54GL's serial port. " -
Haystack and the Myth of the Boy Wizard
Jamie sent in an interesting writeup about The Myth of the Boy Wizard. No, it's not about Hogwarts, but rather about Haystack and its creator, Austin Heap. Last summer the media covered the programmer, the software, and its supposed effect on Iranian censorship. But as is often the case, truth is less interesting than reality. What happened is that the story managed to press some magic buttons, and the media ran with it. This one is worth a read. -
Breaking the Fermilab Code
-
Subliminal Spam Using an Animated GIF
JohnGrahamCumming writes "Everyone's noticed the recent flood of image spam (including the SpamAssassin developers who are working on an OCR-extension to beat it), but take a look at this spam containing a subliminal message flashed every 17 seconds to try to entice you to buy the stock being pumped. Does this work? Warning: link shows the actual spam; don't blame me if you lose money on this stock!" -
Subliminal Spam Using an Animated GIF
JohnGrahamCumming writes "Everyone's noticed the recent flood of image spam (including the SpamAssassin developers who are working on an OCR-extension to beat it), but take a look at this spam containing a subliminal message flashed every 17 seconds to try to entice you to buy the stock being pumped. Does this work? Warning: link shows the actual spam; don't blame me if you lose money on this stock!" -
People Suck at Spotting Phishing
JohnGrahamCumming writes "Initial results at SpamOrHam.org show that people don't fare well when trying to spot spams and phishes. This blog entry shows some actual spams and phishes that people fell for, as well as genuine messages that they think are spam." The thing about these s[cp]ams is that they must work sometimes. When I see the messages, I can't fathom 'how'. -
Bayesian Filters Predict Sundance
JohnGrahamCumming writes "The LA Times reports on a company's use of Bayesian filtering to predict the winners at the Sundance Film Festival. They use a modified POPFile email filter and claim an 81% success rate." -
Ending Spam
Shalendra Chhabra writes "Jonathan Zdziarski has been fighting spam since before the first MIT spam conference in 2003, and has now released a full-on technical book, Ending Spam, on spam filtering. Ending Spam covers how the current and near-future crop of heuristic and statistical filters actually work under the hood, and how you can most effectively use such filters to protect your inbox." Read on for the rest of Chhabra's review. Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification author Jonathan A. Zdziarski pages 312 publisher No Starch Press rating 8 reviewer Shalendra Chhabra ISBN 1593270526 summary Very Good Book Covering Statistical Models and Techniques Implemented in Current Spam Filters
Spam (unsolicited commercial email) and phishing (fraudulent emails) are causing losses of billions of dollars to businesses. Many initiatives are currently underway for fighting this challenge. On the legal front, a Virginia court recently sentenced a prolific spammer, Jeremy Jaynes, to nine years in prison, and a Nigerian court sentenced a woman to two and a half years for phishing. Michigan and Utah have both passed laws creating "do-not-contact" registries in July/August 2005, covering e-mail addresses, instant messaging addresses and telephone numbers. Technical initiatives to fight spam include server- or client-side spam filtering, using Lists (Blacklists, Whitelists, Greylists), Email Authentication Standards (IIM, DK, DKIM, SPF, SenderID), and emerging sender reputation and accreditation services.
Ending Spam is the first book explaining the fine details of the theoretical models and machine-learning algorithms implemented in these filters. The book is divided into three parts: introduction to spam filtering, fundamentals of statistical filtering, and advanced concepts of statistical filtering.
The first section of the book discusses the history of spam, spam kings, different approaches for fighting spam such as blacklisting, whitelisting, heuristic filtering, challenge response, throttling, collaborative filtering, Authenticated SMTP, Sender Policy Framework and SenderID, spammer fingerprinting, etc. However, the author omitted any mention of locally-sensitive hash functions (such as Nilsimsa Hash) to counter spammers' random insertion of words, the use of CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart), Greylisting, Identified Internet Mail, and Domain Keys (now Domain Keys Identified Mail).
In the next chapter, the author clearly explains various components of a Language Classifier Pipeline, including the Historical Dataset (aka wordlist, database, dictionary, filter memory), Tokenizer, and the Analysis Engine with its feedback loop. However, the process flow of a language classifier could have been more generalized, e.g. incorporating an initial text-to-text transformer. This chapter also covers the advantages and disadvantages of various training modes for filters, such as Train Everything (TEFT), Train-on-Error (TOE), and Train Until No Errors (TUNE). This part concludes with the description of Paul Graham's famous spam-filtering technique using Bayesian classification (as described in "A Plan for Spam"), Gary Robinson's Geometric Mean Test, Fisher-Robinsons Inverse Chi Square (including the source code for the inversion function), and some other tricks for optimizing spam- filtering accuracy.
The second part of this book deals with the fundamentals of statistical filtering. The author explains HTML and Base64 encoding, followed by a detailed description of tokenization techniques (e.g. Sparse Binary Polynomial Hashing). Then there's a discussion of the various tricks that spammers use for penetrating filters. Although these tactics are mentioned in John Graham-Cumming's "Spammers Compendium," Jonathan has very elegantly explained why some tricks work for spammers and some don't. This part concludes by addressing some of the resource, storage and scaling concerns raised by the large number of features generated from tokenization techniques.
The third part of this book deals with advanced concepts of statistical filtering. This includes the testing criteria for measuring accuracy of an email filter, and some advanced tokenization concepts, e.g. chained tokens (taking word-pairs and phrases into account, instead of individual words) generated using a sliding 5-byte window as mentioned in Sparse Binary Polynomial Hashing. The next chapter describes the Markovian Model implemented in the CRM114 Discriminator, but the author fails to describe different weighting schemes for features implemented in the Markovian-based version of CRM114. The author then describes the Bayesian Noise Reduction Technique for purging "out of context" data from the mail text. This chapter concludes with a very nice summary of collaborative algorithms and techniques, such as Message Innoculation, Streamlined Blackhole List, Fingerprinting, Automatic Whitelisting, URL Blacklisting, and Honeypot email addresses for snaring spammers' address harvesting bots.
The most interesting part of this book is the appendix, where the author presents interviews with John Graham-Cumming of POPFile, Brian Burton of SpamProbe, Marty Lamb of TarProxy, Bill Yerazunis of CRM114 Discriminator, and Jonathan Zdziarski of DSPAM (himself). I loved this section.
The salient points of the book: it's very easy to read; each chapter begins with a very thought-provoking introduction, and concludes with a crisp "final thoughts" section. The number of technical errors are very few in this print, and the illustrations are of good quality. Since the book is geared more toward the Bayesian and statistical generation of spam filters, the absence of certain spam-busting technologies is acceptable. However, a noticeable omission is the lack of discussion about measuring spam-filter accuracy, and what impact this has on setting filtration thresholds. A section on the economics of tradeoffs, and the use of a Receiver Operating Characteristic curve (ROC) would have been very helpful.
Overall, by putting together Ending Spam, Jonathan Zdziarski has made another significant contribution (after DSPAM) to the anti-spam community. Whether you are a system administrator, anti-spam researcher, engineer or a newbie interested in fighting spam, this book is a great reference.
William S Yerazunis and Richard Jowsey also contributed to this review. Shalendra Chhabra is a Graduate Student in Department of Computer Science and Engineering at University of California, Riverside. He is on the development team of CRM114 Discriminator and has presented his work at MIT Spam Conference 2005, Cisco Systems, and Stanford University. You can purchase Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Ending Spam
Shalendra Chhabra writes "Jonathan Zdziarski has been fighting spam since before the first MIT spam conference in 2003, and has now released a full-on technical book, Ending Spam, on spam filtering. Ending Spam covers how the current and near-future crop of heuristic and statistical filters actually work under the hood, and how you can most effectively use such filters to protect your inbox." Read on for the rest of Chhabra's review. Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification author Jonathan A. Zdziarski pages 312 publisher No Starch Press rating 8 reviewer Shalendra Chhabra ISBN 1593270526 summary Very Good Book Covering Statistical Models and Techniques Implemented in Current Spam Filters
Spam (unsolicited commercial email) and phishing (fraudulent emails) are causing losses of billions of dollars to businesses. Many initiatives are currently underway for fighting this challenge. On the legal front, a Virginia court recently sentenced a prolific spammer, Jeremy Jaynes, to nine years in prison, and a Nigerian court sentenced a woman to two and a half years for phishing. Michigan and Utah have both passed laws creating "do-not-contact" registries in July/August 2005, covering e-mail addresses, instant messaging addresses and telephone numbers. Technical initiatives to fight spam include server- or client-side spam filtering, using Lists (Blacklists, Whitelists, Greylists), Email Authentication Standards (IIM, DK, DKIM, SPF, SenderID), and emerging sender reputation and accreditation services.
Ending Spam is the first book explaining the fine details of the theoretical models and machine-learning algorithms implemented in these filters. The book is divided into three parts: introduction to spam filtering, fundamentals of statistical filtering, and advanced concepts of statistical filtering.
The first section of the book discusses the history of spam, spam kings, different approaches for fighting spam such as blacklisting, whitelisting, heuristic filtering, challenge response, throttling, collaborative filtering, Authenticated SMTP, Sender Policy Framework and SenderID, spammer fingerprinting, etc. However, the author omitted any mention of locally-sensitive hash functions (such as Nilsimsa Hash) to counter spammers' random insertion of words, the use of CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart), Greylisting, Identified Internet Mail, and Domain Keys (now Domain Keys Identified Mail).
In the next chapter, the author clearly explains various components of a Language Classifier Pipeline, including the Historical Dataset (aka wordlist, database, dictionary, filter memory), Tokenizer, and the Analysis Engine with its feedback loop. However, the process flow of a language classifier could have been more generalized, e.g. incorporating an initial text-to-text transformer. This chapter also covers the advantages and disadvantages of various training modes for filters, such as Train Everything (TEFT), Train-on-Error (TOE), and Train Until No Errors (TUNE). This part concludes with the description of Paul Graham's famous spam-filtering technique using Bayesian classification (as described in "A Plan for Spam"), Gary Robinson's Geometric Mean Test, Fisher-Robinsons Inverse Chi Square (including the source code for the inversion function), and some other tricks for optimizing spam- filtering accuracy.
The second part of this book deals with the fundamentals of statistical filtering. The author explains HTML and Base64 encoding, followed by a detailed description of tokenization techniques (e.g. Sparse Binary Polynomial Hashing). Then there's a discussion of the various tricks that spammers use for penetrating filters. Although these tactics are mentioned in John Graham-Cumming's "Spammers Compendium," Jonathan has very elegantly explained why some tricks work for spammers and some don't. This part concludes by addressing some of the resource, storage and scaling concerns raised by the large number of features generated from tokenization techniques.
The third part of this book deals with advanced concepts of statistical filtering. This includes the testing criteria for measuring accuracy of an email filter, and some advanced tokenization concepts, e.g. chained tokens (taking word-pairs and phrases into account, instead of individual words) generated using a sliding 5-byte window as mentioned in Sparse Binary Polynomial Hashing. The next chapter describes the Markovian Model implemented in the CRM114 Discriminator, but the author fails to describe different weighting schemes for features implemented in the Markovian-based version of CRM114. The author then describes the Bayesian Noise Reduction Technique for purging "out of context" data from the mail text. This chapter concludes with a very nice summary of collaborative algorithms and techniques, such as Message Innoculation, Streamlined Blackhole List, Fingerprinting, Automatic Whitelisting, URL Blacklisting, and Honeypot email addresses for snaring spammers' address harvesting bots.
The most interesting part of this book is the appendix, where the author presents interviews with John Graham-Cumming of POPFile, Brian Burton of SpamProbe, Marty Lamb of TarProxy, Bill Yerazunis of CRM114 Discriminator, and Jonathan Zdziarski of DSPAM (himself). I loved this section.
The salient points of the book: it's very easy to read; each chapter begins with a very thought-provoking introduction, and concludes with a crisp "final thoughts" section. The number of technical errors are very few in this print, and the illustrations are of good quality. Since the book is geared more toward the Bayesian and statistical generation of spam filters, the absence of certain spam-busting technologies is acceptable. However, a noticeable omission is the lack of discussion about measuring spam-filter accuracy, and what impact this has on setting filtration thresholds. A section on the economics of tradeoffs, and the use of a Receiver Operating Characteristic curve (ROC) would have been very helpful.
Overall, by putting together Ending Spam, Jonathan Zdziarski has made another significant contribution (after DSPAM) to the anti-spam community. Whether you are a system administrator, anti-spam researcher, engineer or a newbie interested in fighting spam, this book is a great reference.
William S Yerazunis and Richard Jowsey also contributed to this review. Shalendra Chhabra is a Graduate Student in Department of Computer Science and Engineering at University of California, Riverside. He is on the development team of CRM114 Discriminator and has presented his work at MIT Spam Conference 2005, Cisco Systems, and Stanford University. You can purchase Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Microsoft Researchers on Stopping Spam
TheBackBencher writes "Scientific American today has a very interesting article about "Stopping Spam" by Joshua Goodman, David Hackerman and Robert Rounthwaite from Microsoft Research. They talk about different types of spam -- spam with emails, spam on IMs, spamlinks on web pages and image based spam. They mention different techniques for spam filtering mainly fingerprinting matching techniques, n grams model, naive bayesian approach, optical character recognition, challenge/response systems and Human Interacted Proofs (HIP) in a very lucid style. They however do not mention fingerprinting approach of using Nilsimsa Hash to tackle addition of random words by spammers in emails or hypertextus interruptus technique used by spammers of splitting words using HTML comments, pairs of zero width tags, or bogus tags. Also, Spam-Research is reporting the SplitFit Technique that Spammers are using to fool Yahoo! Mail SpamGuard." -
The Spam Conference 2005
dos_dude writes "This year's Spam Conference is over. As usual, the MIT provides low and high bandwidth webcasts. The talks featured a full spectrum of anything possible. From absurd to sound, from boring to entertaining, and from dead-horse-beating to brand-new. Highlights: John Graham-Cumming presented the results of the survey he did with the help of many Slashdot readers, Jon Praed gave the details of the trial against spammer Jeremy Jaynes and friends, Brian McWilliams posed the question what will happen when all spam is finally filtered, and Matthew Prince plugged Project Honeypot in a very entertaining way. Shameless but useful plug: here's the final schedule with links to the webcasts." -
DeCSS Loses Free Speech Shield
JohnGrahamCumming writes "BusinessWeek/CNET is reporting that the California Supreme Court has ruled that 'a Web publisher could be barred from posting DVD-copying code online without infringing on his free speech rights.' They also say that 'the state Supreme Court ruled that property and trade secrets rights outranked free speech rights in this case.'" According to the article, this "...overturned an earlier decision that said blocking Web publishers from posting the controversial piece of software called DeCSS, which can be used to help decrypt and copy DVDs, would violate their First Amendment rights."