Gmail Now Rejects Emails With Misleading Combinations of Unicode Characters
An anonymous reader writes: Google today announced it is implementing a new effort to thwart spammers and scammers: the open standard known as Unicode Consortium's "Highly Restricted" specification. In short, Gmail now rejects emails from domains that use what the Unicode community has identified as potentially misleading combinations of letters. The news today follows Google's announcement last week that Gmail has gained support for accented and non-Latin characters. The company is clearly okay with international domains, as long as they aren't abused to trick its users.
...
...of the e-mail. Any attempt to block spam or phising on the basis of mixing character sets would have to confront the fact that some people do need to mix character sets. Typically representations of Mari in the Latin alphabet, for example, also make use of the Greek letters beta and eta. In fact, eta is used in Latin representations of several minority languages of Russia. And the Reddit crowd loves making weird smilies in their English-language writing by means of symbols drawn from Indian scripts.
If this spells death to those ridiculous smilies then it's ok with me.
I routinely substitute Cyrillic letters for Latin on Disqus and other forums to get around their filters (which block for more than mere "profanity").
Slashdot does not allow non-ASCII characters — although it does not attempt to screen out profanity either.
In Soviet Washington the swamp drains you.
OK, good. Now if ICANN applied that tougher standard to domain name registrars, we'd make progress. But no, ICANN still allows registrars to register domain names without forcing them to comply with the most restrictive profile.
This looks like fun, I probably wouldn't catch that bank example and family certainly wouldn't. Looks like pretty much any word could substitute one letter.
No idea exactly what these "combinations" are. The example used one letter substitution. Using this example and the little display of new letters there would appear to be billions of potentially misleading combinations.
If I start a business with a unicode domain, and if later a scammer registers an ascii domain that is similar looking, then Gmail will blackhole my business, not the scammer, because I'm the one using unicode.
And the latest round of whack-a-mole begins...
...absolutely nothing! The scammers will just find some other way to create their automated email garbage.
q898(^*$*EUIDXEZ{Pm;vd80eGUIOIO:>P{
{}.
det6767ir6768P)I*)&%B(()_}K>?YIBV$WCJ!!!!!
Or perhaps more practically, needing to send email with multiple translations in them. Either as a courtesy to your audience who may speak English or French, or German, and you're not quite sure which they're more comfortable with. So you send your email with all three languages in it.
North American based companies may do English, French and Spanish in their email.
Though perhaps one area where they could block in the body is in HTML tags - if there's a restricted character in a link, perhaps that's a reason to block.
Good that this applies to from: and not the body of the e-mail.
That's not at all good and filtering the body exactly what I want.
Spammers already spoof the from: domain and then link you out to exactly the type of domain that Gmail is now filtering.
There's no reason Gmail can't flag [body] links to domains that use mixed character sets.
[Fuck Beta]
o0t!
Damn, now i see it's just domains, i tought they killed all my german and french spammers.
90% of the population would be better off with a white listed email account, i.e. if you are not on their list the email does not get through. END OF STORY.
I would seem to be more efficient to filter mail IN than to filter it out. Most people would have 20 or so people they actually want mail from.
I have mail accounts strictly for family and my local email rules enforce this
I have mail accounts for "sign up" sessions for competitions that I know are going to get spammed to hell
I have mail account for work, another for my business , etc etc all with differing contacts.
White listing would pretty much kill off spam, if there is zero chance of it getting though, what is the point. Currently spammers get through because of out dated spam lists, new tricks to get around baynesian filters, etc etc etc. White lists would negate the need.
Google, if you set up a white listed email system, my friends and family will happily sign up.
If you use Unicode for domains, addresses, certificates and whatnot you are begging for an endless cascade of support problems and glitches, not to mention security vulnerabilities. Let others exercise all these broken codes paths for you while you avoid the fail. Eventually, after most of the broken code gets cycled out of use, many years from now, you may then safely allow this stuff into real systems.
Unicode breaks all sorts of stuff in subtle and unfixed ways. A fine example from a widely used Microsoft system (W2K8 R2 SP1, three years old) is this gem: http://support.microsoft.com/kb/2597665; IIS can't handle Unicode attributes in x509 certs. You have to "hotfix" that broken OS to deal with Unicode.
Just leave it be another decade or so, if you can.
For those of you frothing at the mouth to write "BUT BUT I HAVE TOO!!!!1" re-read the end of that last sentence over and over till it sinks in; not everyone can avoid dealing with this. My sympathies. I'm writing for those that can.
As much as I can appreciate the intent and the fact that this will solve 99.999% of people's problems for this type of spamming and create 00.0000000001% of problems for legitimate users, it still feels a little like Google is trying to be the thought police on this one; you know free speech and all.
Seriously,
most filters are now "very good". And, I make new acquaintenances, connections and friends. They have new email addresses that aren't in the whitelist. But, the filters pretty much just work.
IME, Gmail is rejecting a lot of legitimate mail nowadays.
Their filters used to be good, but they completely fucked it up lately.
Knowledge is power; knowledge shared is power lost.
Unfortunately those aren't likely to be mistaken for latin characters... They probably get a free pass.
I never did see a domain with non-Latin characters in spam. I have seen Russian, Chinese or Japanese text in the body and subject line.
...unless they're in code page 1252.
Any sufficiently unpopular but cohesive argument is indistinguishable from trolling.
will I talk to ZALGO!
As an interesting background fact, I heard that Google has an advanced Al doing all this stuff completely autonomously.
His real name is Albert, by the way.
They're reason enough for me to almost believe whoever designed ASCII was a genius.
Addresses should be simple and easy to learn and transmit over as many means of transport as possible. We had a working world-wide de-facto standard: 7-bit ASCII. Sure, there were no accented letters, no support for Asian scripts, etc., but it worked. Addresses are infrastructure. You can send anything you want as content. If you need to write Hindi in an email, then do so. That should not require all mail masters to upgrade their software to handle Hindi.
(I write this as someone whose native language has letters beyond ASCII.)
GMail doesn't accept all comers. Get too many complaints and they'll reject you... this is just new ideas to add to that filter. There's a list of words you can't say on GMail without it getting read, they don't publish those lists because that'll never be said to them.
And so this "standard" was designed in this way because country A didn't want it's script mixed up with country B, introducing vulnerabilities into the DNS system in the process. As in '' '' and 'A' all encode to different unicode er .. codes.
Slashdot does not allow non-ASCII characters...
Óh réällý?
I live ze unknown. I love ze unknown. I am ze unknown.
That's pretty cool. I guess, the entire ISO-8859-15 is Ok? But not Cyrillics :-( Or else, you would've seen some Ukrainian-Russian conflict right here...
In Soviet Washington the swamp drains you.
They are right doing so. There are letters in different alphabets whose typing is very very similar -- or in fact they are written exactly the same, depending on the font used.
This can be exploited for interesting uses. For example, "E" and "ÃZ"** are respectively the latin "e" and the greek "epsilon" vowels, but they are indistinguishable in caps, at least in Arial font. The second one is the UTF 395 code. My name has an "E" on it, and for my email signature I spell my name using the traditional latin letter from the keyboard when the email is important and should be archived. By contrast, when the email is mostly irrelevant for future use (such as meeting arrangement emails, which are useless after the meeting takes place) I spell my name using the Greek epsilon letter (hint: 395 followed by Alt+X in most Windows programs). There is no obvious difference for the receiver, but a search tool can be used to quickly find all sent emails which can be deleted safely.
While the previous is a somehow "legit" use, in general any word which combines letters from different alphabets could be used to confuse an trick the receiver, for example by creating an email account which reads exactly the same as the one from another person. There is a nice image of 5 letters a-b-c-d-e in different alphabets in the linked post. I agree with Google in preventing such combinations for email accounts. It would be interesting to know the exact policy used to forbid account names, which is not detailed.
** At the time of writing, these two letters look exactly the same. Classic Slashdot lacks Unicode support and does not represent the greek Unicode letter from my comment. I tried logging into Slashdot Beta (first time, I swear it!!) and it seems to represent a different letter... Please try this on your own computer!
I found it amusing that you are aware of the existence of different fields in email and then used you post to demonstrate that you have no fucking clue how to use them. Sentences should not be split across the subject and body fields.
I must have about 50 filters to auto-delete some of the really basic, obvious spam that Google accepts to my gmail account, and I still get 20-30 spams a day. Auto-filtered into my spam folder, but even so I still have to look at it because I do get the occasional false positive.
In contrast to my DNS-RBL+SpamAssassin+procmail I have for my own domain MTA which successfully turns away or drops about 99.99% of the spam that arrives at that email address.
I thought those guys at google were supposed to be smart. If they're so smart, why can't their mail system recognize the obvious spam?
Heuristics could pretty easily determine if someone communicate only in English in their e-mails, and as such, any legitimate e-mails that contain large amounts of non-English words or characters should be viewed with greater suspicion. For those that routinely communicate in more than one language and use non-ascii sets, the heuristic should be able to account for that fact.
These sorts of rules are always fuzzy by nature. Obviously, whether an e-mail is determined to be legitimate or not is due to many different factors. This could simply be one of those contributing factors.
Irony: Agile development has too much intertia to be abandoned now.
Boo, slashdot discards Esperanto and Polish letters; surprisingly lame!
It's not as if UTF-8 is some crazy new thing, sheesh.
It allows combinations of Latin + Han + Hiragana + Katakana; Latin + Han + Bopomofo; or Latin + Han + Hangul.
There are a lot of equally safe combinations - what about Latin + Devanagari + Tamil? There would be no look-alike characters and it would allow a lot of people to put their name in multiple scripts that are likely to be meaningful to certain audiences (e.g. someone from Tamil Nadu sending an email to people throughout India and internationally). I'm sure that there are many other combinations that wouldn't have "look alike" issues but which would be useful
I'm curious about why you need to get around the filters. If you disagree with filtering in general, do you really think rebelling against it on some Internet forums is going to make a difference? Why not just move on to less-restricted forums or stay and follow the rules?
The "highly restricted" spec is meant to catch suspicious combos like in the mybank example - but does not catch full-ascii (which is an even more restrictive level) trickery like tvvitter.com (notice the two "v" chars). that combo in particular is now known, but goes to demonstrate that trickery does not need charsets larger than 7-bit... some people simply get caught by hsbc.net...
-- "Simplicity is prerequisite for reliability." --Dijkstra
That we have a supposedly "universal" characterset that is not universally usable without considerable bolted on as an afterthought blacklists and whitelists. In fact, spotify already learned the hard way that the standard ways to compare unicode strings just don't cut it, and inventing your own is fraught with peril. There's much more slightly, subtly, insidiously "off" with unicode, before we consider the cost in code size and its associated costs.
In other words, it's not really suitable for real-world use, for you can only (and then only so-so) trust it if you generated it yourself. As soon as the unicode comes from elsewhere it's a liability to safely reading the input.
...and should be hanged by his nuts
He's trying to post on local news websites (TV stations) is my guess, they all have their comments farmed out to Disqus or Topix these days. The problem is you can't engage in a conversation without editing your comment 10 times, with no indication at any point what is actually being flagged as inappropriate, or else by subverting the filter as GP mentioned. They aren't just filtering vulgarity. Words like bribe and corrupt are blocked by my TV station's Topix comments, so it's hard to discuss politicians for example.
Google, if you set up a white listed email system, my friends and family will happily sign up.
They did, it's called Google+. Nobody seems to like it.
Wealth is the gift that keeps on giving.