Pre-coffee fog. Sorry. Typing got ahead of brain. Tripped up confounding the words-as-symbols/bytes-as-symbols distinction with the model markovity.
You are correct about the order-1 assertion. That should indeed have been order-N, where N is the length of the longest prefix string maintained explicitly or implicitly by a Ziv-Lempel dictionary or backpointer set. The Ziv-Lempel engines can be regarded as using shortened N-grams to represent classes of longer, yet-unseen N-grams; and they do use Markov models, where the stationary and transition probabilities are all set equal. In these cases, the probabilities only count for being zero or non-zero.
A "Bayesian Spam Filter" is order-0 if it relies only on token frequencies, where the tokens are complete strings, and not conditional occurrences of word pairs. The assertion is that a spam filter mechanism would be improved if it relied on a higher-order underlying model, and if the symbols were taken to be bytes and not words. The probability of a string is thus the product of the probabilities of its symbol sequence under the order-N model. But any higher-order model, even one using within-message word digrams or trigrams, would probably be an improvement.
A Bayesian spam filter uses an underlying order-0 Markov model of email messages. gzip uses an underlying order-1 Markov model.
A Bayesian filter uses words as "symbols." gzip uses bytes as symbols.
The right thing to do would be to combine them.Ttake a gzip-style Markov model, using bytes as symbols and conditional probabilities, and plug it into a Bayesian filter. That would (1) make the filter more powerful and (2) make the filter applicable to any sort of data, arbitrary binary or readable text. Negligible computational overhead, sharper discrimination.
Pre-coffee fog. Sorry. Typing got ahead of brain. Tripped up confounding the words-as-symbols/bytes-as-symbols distinction with the model markovity.
You are correct about the order-1 assertion. That should indeed have been order-N, where N is the length of the longest prefix string maintained explicitly or implicitly by a Ziv-Lempel dictionary or backpointer set. The Ziv-Lempel engines can be regarded as using shortened N-grams to represent classes of longer, yet-unseen N-grams; and they do use Markov models, where the stationary and transition probabilities are all set equal. In these cases, the probabilities only count for being zero or non-zero.
A "Bayesian Spam Filter" is order-0 if it relies only on token frequencies, where the tokens are complete strings, and not conditional occurrences of word pairs. The assertion is that a spam filter mechanism would be improved if it relied on a higher-order underlying model, and if the symbols were taken to be bytes and not words. The probability of a string is thus the product of the probabilities of its symbol sequence under the order-N model. But any higher-order model, even one using within-message word digrams or trigrams, would probably be an improvement.
A Bayesian spam filter uses an underlying order-0 Markov model of email messages. gzip uses an underlying order-1 Markov model.
A Bayesian filter uses words as "symbols." gzip uses bytes as symbols.
The right thing to do would be to combine them.Ttake a gzip-style Markov model, using bytes as symbols and conditional probabilities, and plug it into a Bayesian filter. That would (1) make the filter more powerful and (2) make the filter applicable to any sort of data, arbitrary binary or readable text. Negligible computational overhead, sharper discrimination.
...it was *Mrs. Glitch* who should really get the credit.
...I'd be *pissed off*. All this scheme is going to do is boost the value of used CDs.
Wonder if they're going to try this with Classical too. The incentive to go for a new album would shrink from miniscule to zero.
Sony is killing its own children.