Unicode Encoding Flaw Widespread

Limited impact. by shird · 2007-05-21 18:38 · Score: 3, Informative

This appears to be limited to content scanning, and isn't really a vulnerability in itself. Relying on content scanning to prevent an exploit to reach an exploitable system is a pretty bad idea, much better to fix the system than the extra layer of defense on the outside.

Content scanning is mostly useful against filtering known exploits, and is hardly meant to be your primary defense. Being able to bypass this scanning won't buy you much. If the content scanner is aware of an exploit it scans for, chances are so are the systems being targeted and are patched to protect against it.

--
I.O.U One Sig.

Re:Limited impact. by TheRaven64 · 2007-05-21 22:27 · Score: 4, Informative

Windows makes no distinction between privileged and unprivileged ports, so any application that can open sockets can listen on port 80. That said, every port number (and every other object in the NT kernel) has an associated ACL, so it is possible to limit them on an individual basis. I've never seen this exposed to the UI though, so I've no idea how you'd go about doing it. Filesystem objects also have ACLs, so I'd imagine that IIS is not allowed access to the filesystem outside the tree it is sharing.
The NT kernel provides a lot of facilities that are very useful for writing secure code. I often wonder if the application developers at Microsoft ever noticed that they weren't writing code on top of DOS anymore...

--
I am TheRaven on Soylent News
Re:Limited impact. by Ravnen · 2007-05-21 22:44 · Score: 2, Insightful

The Network Service account on Windows has similar privileges to a normal user, which means it can't access files owned by other users, but can of course read some files owned by the system. The notion of reserved ports doesn't exist on Windows, so no software makes security assumptions based on whether or not a port is below 1024, and the ability to open port 80 doesn't imply any higher privileges than the ability to open any other port.
At any rate, running in a chroot jail is arguably better in some ways than just running as an unprivileged user. Vista has some sandboxing features, using 'integrity levels' and redirecting various file and registry accesses to a 'virtual store', but I'm not really familiar with them, except for the basics, and I don't know if IIS uses them anyway.
Re:Limited impact. by fatphil · 2007-05-21 23:32 · Score: 4, Insightful

I think you've missed his point. There are now two ways that, for example, a quote character can be passed as user input to your program: either as " or as %ublah.

Your program, sitting below the layer performing the unicode translations, doesn't need to do anything differently from before, as it doesn't matter which of the two methods were used. If you _relied on_ the layers above you to strip out, reject, escape, or whatever, quote characters, then you're writing teabag code, and should get a job selling flowers instead, as software engineering is beyond you.

Always validate user input to your own specification. Never rely on something external to do it.

This exploit hasn't changed the rules one little bit, it's just highlighted the fact that some idiots don't follow them.

--
Also FatPhil on SoylentNews, id 863
Re:Limited impact. by CastrTroy · 2007-05-22 00:57 · Score: 2, Insightful

Is this another problem with unescaped quotes? When will people learn? Not an hour goes by that a system doesn't get attacked by SQL injection attacks. Why do programmers continue to not use things like prepared statements which are invulnerable against such attacks. I blame it on the people writing the tutorials. Every beginner tutorial on the web shows queries being constructed at runtime, and doesn't have any mention of how insecure doing things like this is. It's hard to break the habit once you've been programming like that for so long.

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Re:Limited impact. by flydpnkrtn · 2007-05-22 01:03 · Score: 2, Informative

I think he meant getting to "port object permissions" on a programmatic level... with an API. What you are describing are filesystem Access Control Lists. He's talking about using ACLs on ports. Everything being an object in NT, and being able to have ACLs applied to "everything," is a good idea. As the grandparent said, the application developers at MS just have to use them.

Basically the "Security tab" you see for files could be applied to individual ports.

--
Here's to the crazy ones
Re:Limited impact. by jZnat · 2007-05-22 02:56 · Score: 2, Insightful

Well, the way I see it, there are three ways to handle Unicode characters (one of which is wrong): store as full two-byte Unicode values (inefficient when using mostly ASCII characters like in english), store in a UTF character set such as UTF-8 (useful for primarily ASCII text as it is a superset of ASCII), or pretend it isn't Unicode and treat it as two (or three if input is in UTF-8 for example) separate ASCII characters (bad).

So, perhaps if data was all stored and represented in UTF-8, for example, this wouldn't be a problem? Or perhaps stored as raw Unicode characters via wchar_t (or language equivalent like u"" in Python)?

--
'Yes, firefox is indeed greater than women. Can women block pops up for you? No. Can Firefox show you naked women? Yes.'
Re:Limited impact. by rabtech · 2007-05-22 03:11 · Score: 4, Informative

The NT kernel has a root namespace for everything in the system (from local filesystems to network drives to sockets to synchronization objects like mutexes), and in fact treats everything as a file (just like Unix) underneath.

Using the Native (NT Executive) API you can read or set the ACL on any object in the namespace, assuming you have the appropriate user rights and you own the object (or the ACL allows you to modify the permissions). NT kernel objects can also be case-sensitive (though that can confuse some Win32 programs). Often, you can delete, move, etc files that are locked by the Win32 subsystem, which can be useful in certain situations (though in Vista they made the IO system capable of cancelling outstanding IOs on its own so the zombie process bug that ends up locking files doesn't happen anymore. Its unfortunate Vista is so DRM-laden, or I'd try upgrading.)

The APIs are NtQuerySecurityObject and NtSetSecurityObject and I believe the devices are in \Device\Tcp, \Device\Ip, \Device\RawIp, \Device\Udp, etc. Check out http://undocumented.ntinternals.net/ for more details on what is in the native API (ntdll). This API provides everything necessary to implement a full POSIX layer, which is exactly what Services for Unix does, installing itself as a new runtime subsystem right next to the Win32 subsystem. (With Server 2003 R2 SP2 they shipped it as an available component as part of the install; I've even got setuid support and GCC installed as part of the package.)

--
Natural != (nontoxic || beneficial)
Re:Limited impact. by MikeB90 · 2007-05-22 06:13 · Score: 2, Interesting

The point is you as your own program might have escaped or regexed items incorrectly and be open to this attack. Of course you don't blindly depend on some "magic" function. Duh! but you yourself are mortal too. And I doubt many people knew about fullwidth/halfwidh unicode transforms. The fact that one of the articles linked to this says they did a successful SQL injection SHOWS there are issues. BTW insulting people is not normally a useful technique

Re:Send your claim in now by QuantumG · 2007-05-21 18:47 · Score: 4, Funny

IIS 6 hasn't had a public remotely exploitable bug in it. Ever. That's bullshit anyway, I've got dozens of remote exploits for IIS 6.

Oh, you said public.. hehe, forget I said anything.

--
How we know is more important than what we know.

Incident response by Anonymous Coward · 2007-05-21 19:11 · Score: 4, Interesting

I work incident response in a large web company (hence anonymous posting, natch) and currently we're treating this as "interesting, but case not proven". We test our web apps filter all input so I'm adding double-width unicode to our security regression test cases; however I'm happy to let the FD posters lab it out between them in the short term. These alleged IIS exploits don't work for us - which is not to say that we don't have some system, somewhere, for which this is an issue. At the end of the day it's just a clear restatement of something that's obvious to anyone - you need to filter input carefully, and you need to be aware of issues around alternative encodings. But it's not a "BRB" (big-red-button, ie emergency stop and all hands to the pumps to fix a vulnerability) issue for us - yet. The last time we had one of those, it was the Microsoft DNS server remote root... because most of our internal domain controllers were also running DNS servers.

"Not vunerable" by iamacat · 2007-05-21 19:20 · Score: 2, Informative

According to the advisory, Apple products do not provide HTTP content filtering and are therefore not vulnerable. This will do nothing to help someone build a functioning protection system.

bypassing great firewall? by z-j-y · 2007-05-21 19:48 · Score: 2, Interesting

I'm wondering if the great firewalls (Cisco product?) are also vulnerable to this. At least it'll force them to do longer string matching.

Re:Smelly foreigners by Anonymous Coward · 2007-05-21 20:36 · Score: 2, Insightful

To think that even English fits in 7-bit ASCII is naïve.

Re:Smelly foreigners by Hognoxious · 2007-05-21 20:39 · Score: 2, Interesting

Would some of the things that led to computers - morse code, telegraphy etc have been feasible using, say, Chinese in its normal written form? Are computers biased towards English (and other languages using the same or similar alphabets) because they were largely invented by English speakers, or is the language fundamentally more amenable to small, simple encoding?

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."

Re:Not a surprise... by etnu · 2007-05-21 20:42 · Score: 5, Insightful

You'd prefer securing against vulnerabilities in dozens, if not hundreds of different encodings? The only people who are against Unicode are those that have never had to work with more than one written language in the same project. Yes, it's a lot easier to secure stuff when you only accept ASCII or ISO8859-1/Windows CP-1252, but then you're limiting your software to about a third of the world (if that). Crappy engineers are going to write crappy code no matter what the encoding. No sense compromising for the sake of poorly written software.

Re:Apple and HP? by Gadget_Guy · 2007-05-21 20:52 · Score: 2, Funny

Ooh, poor old BSD sounds really sick there. I hope that it doesn't die!

Re:Smelly foreigners by ettlz · 2007-05-21 20:57 · Score: 5, Funny

To think that English doesn't fit in 7-bit ASCII is na\"ive.

Re:Not a surprise... by KiloByte · 2007-05-21 21:23 · Score: 3, Insightful

Wrong, the flaw in Cisco's "security" software and IIS is due to them converting things to 8-bit charsets, not due to Unicode. In fact, the whole idea of "code pages" is fundamentally broken, as it assumes all data ever moves to another places only in the same region.

The idea of double-width characters is broken too, yeah, and they are there only to appease the users of some broken Chinese/Japanese software -- but there's nothing wrong with having strange characters in file names. They don't match any file they are not supposed to unless you try to shoehorn them into a limited character set.

So, it's a flaw in the software, not Unicode by itself.

--
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.

Depends on alphabet size by Viol8 · 2007-05-21 21:49 · Score: 2

If you want to represent a language on a computer (and not just numbers) then you need a way to enter and store all the characters that language uses. Obviously the less characters the better. The latin alphabet with all its variations, the cyrillic , hebrew, arabic & korean all lend themselves to this quite easily since they all have a manageable number of letters. Languages such as Chinese and Japanese don't , they don't even use alpabets , they use characters for each object/concept which as you can imagine is a bugger when you need a keyboard of a manageable size not to mention the memory to store each character bitmap (not an issue now , but 40 years ago it would have been a nightmare).

This is only a guess on my part but I suspect computers would have developed completely differently if they'd been developed by a culture that used symbols rather than alphabets.

Re:Depends on alphabet size by TheRaven64 · 2007-05-21 22:24 · Score: 2, Informative

Chinese ideographs are so numerous and difficult to remember that they are considered one of the reasons for China's incredibly low literacy rate. If you want some evidence of this, then take a look at what happened to Korea when it dropped the Chinese ideograms in favour of a new, home-grown phonogram-based alphabet.

--
I am TheRaven on Soylent News
Re:Depends on alphabet size by rabtech · 2007-05-22 03:27 · Score: 4, Interesting

IIRC, China was on its way to moving to an alphabet system (certain characters can be used for their alphabetic sounds in various circumstances) and so was Japan (look at Katakana/Hirigana).

It is likely that the introduction of the printing press (and later mass media like TV/radio and computers) have "arrested" this natural evolution. It may also be possible that the development of a national identity and cohesive society tends to put the brakes on some developments as well - if a single unified language is mandated by culture or a central authority then local variations are much less important.

Romanji (and to a certain extent English itself) is definitely influencing the Japanese; the younger generations even moreso. Japan may end up using an alphabet for day to day needs almost exclusively within the next 100 years. The situation in China is much less clear but it will probably happen eventually.

If we look into the past, nearly all societies with ideographic/logographic writing systems eventually moved to an alphabetic system. Hell, even Ancient Egyptian Hieroglyphs were partially syllabic much like Katakana. Much as previous posters have pointed out, changing to an alphabetic system from Chinese-characters has allowed Korea to dramatically raise literacy rates. There is only so much time for schooling and memorization, and only so much effort to expend on literacy. If a simpler writing system is more accessible then that is a net gain, even if there are a few things that logographic writing systems do better than alphabetic ones.

--
Natural != (nontoxic || beneficial)
Re:Depends on alphabet size by ShakaUVM · 2007-05-22 09:05 · Score: 2, Informative

You're missing the key roadblock to simply replacing characters with pinyin, or any other romanization: Chinese is a heavily overloaded language. While there are a bit of homophones in English, *every* word in Chinese is a homophone, with something like 13 different homophones per sound on some of them. We differentiate some of homophones by writing them differently (layed, laid, etc.), Pinyin *cannot* differentiate these homophones -- it's an exact transcription of the sound. Chinese differentiate their written words with characters. When having a conversation you can get by with spoken Chinese or pinyin, since you can always ask the other person which character they meant, if there's confusion, and Chinese will make do with pinyin in a pinch, but it's more or less impossible to ask them to switch to a Romanisation for all purposes.
Re:Depends on alphabet size by loyukfai · 2007-05-22 11:35 · Score: 3, Interesting

IIRC, China was on its way to moving to an alphabet system (certain characters can be used for their alphabetic sounds in various circumstances)...
I'm a Chinese but I have never heard of this. Would you be so kind to educate me on this...? Where did you hear such things?
I'm serious.

Re:Not a surprise... by kahei · 2007-05-21 22:19 · Score: 5, Insightful

Down below this post, there's a troll writing something like 'lol if u cant just use ASCII u shud let ur language die u foreign creeps lol k thx'.

And a whole bunch of people then jump on the troll and criticize him for his US-centrism, and so on, and the troll is at -1.

Yet the post I'm replying to, which is at +4, really comes to the same thing as this troll; it's simply UNIX 8-bit centric rather than USA ASCII centric.

The fact is, computers are used for text, and much if not most text is non-ASCII. How would you rather represent that text:

--With Unicode
--With KOI-8, KOI-8R, KOI-8RU, EBCDIC, EUC-KR, EUC-JP, shift-JIS, Shift-JIS-the-Jphone-version, ISCII, VISCII, ISO-2022-*, and the many many other encodings that have evolved in different times and environments.

Seriously, which is going to be easier to secure (and otherwise manage) -- one encoding (which is HEAVILY documented and discussed) or a large number of encodings (the actual number being ever-changing and impossible to really know) many of which are not well documented and have forgotten ramifications and assumptions?

Right -- so now you know why people use Unicode so much.

But the interesting question is, why is one error ("All teh world is teh USA lol! Shouldn't you learn to speak English?") rightly jumped on and pounded flat, whereas another form that's actually more problematic ("All teh world is C on UNIX lolz!! Shouldn't you stop wanting dangerous extra features?") isn't?

Actually, I see in another window that some people have indeed been pounding the parent poster flat, so perhaps my question isn't valid after all.

--
Whence? Hence. Whither? Thither.

Re:Hmmmm.... by Anonymous Coward · 2007-05-21 22:22 · Score: 2, Funny

4) You are an idiot
5) You are an asshole

Nothing to see, move along ... by udippel · 2007-05-21 22:24 · Score: 4, Insightful

It is a vulnerability, in the strict sense.
It is a self-inflicted misbehaviour as in common sense.
It is like those silly Cisco content inspectors on port 25, that try to avoid attacks on flimsy MTAs.
It is like someone dying from a jab against measles: the jab protected that person from contracting measles, actually.
It is like those stupid anti-virus programs that are more vulnerable than the daemons they profess to protect.

When the attacker uses a codepage different from the one that you think she ought to use, she can circumvent your content filter. Which ought not be an attack vector, in any case.

As I said: nothing to see, move along ...

Re:Smelly foreigners by TempeTerra · 2007-05-21 22:51 · Score: 3, Interesting

The notable difference between Chinese and English (or most other written languages) is that several English characters combine to form syllables, which combine to form words (i.e., we use an alphabet). In Chinese, each character corresponds directly with a word (each character is a logogram). If you're interested you can look up Alphabet on Wikipedia as a starting point, although I must admit I find the article hard to follow even though I know what it should be saying.

The practical result of this is that English is normally encoded as a long sequence of 0-25 values (a-z), whereas Chinese would be encoded as a shorter sequence of 0-~100,000 values (Wikipedia reports Chinese dictionaries with 85,000 characters). Naturally, there would be fewer Chinese characters required for a message as each character corresponds to an entire word.

I guess that since morse code is rather like binary and English letters can be encoded using 5 bits, Chinese morse codes would need to be... about 20 bits long? It's late at night, brain not work so good. It seems to me that morse codes using 20 dots/dashes would be extremely difficult to learn; but on the other hand it shouldn't be any more difficult than learning Chinese characters in the first place.

I wouldn't be surprised if English morse codes were more robust against poor data, siny Englxsh is stvll reahible even if sew2eral cheracter; are wrong.

Disclaimer: I don't know anything about the subject, I'm talking out of my elbow for the sake of discussion.

--
.evom ton seod gis eht

Re:Hmmmm.... by peragrin · 2007-05-21 22:55 · Score: 4, Interesting

1) unicode is better than having a hundred other encodes to debug
2)there's is nearly two billion chinese and Indians, who can't use your encoding.
3)I get just as much spam from US companies as I do foreign ones

--
i thought once I was found, but it was only a dream.

Re:Smelly foreigners by Hognoxious · 2007-05-21 23:40 · Score: 2, Funny

There are no accent marks in English.

è is sometimes used to indicate that the e in a past participle is pronounced, eg learnèd (rhymes with Bernard) as opposed to learned (rhymes with burned).

When loan words with accent marks come into English, the accent marks are dropped.

The umlaut in naïve is retained to indicate that it doesn't rhyme with glaive.

Loan words that have been in English long enough even tend to have their pronunciations and/or spellings Anglicized.

Yes, that's why I'm posting from an internet caffay.

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."

Re:Not a surprise... by gnasher719 · 2007-05-21 23:46 · Score: 2, Informative

Unicode is of course not the problem at all.

The problem is using character sets that can represent huge amounts of different characters, and among them characters that have similar looking glyphs. That is at the same time a feature that people really really want.

So spam filters will have a problem. They filter out "Viagra" but they don't filter out sequences of letters that look the same. Well, tough. If you follow the rule not to follow any links in emails but type them in yourself, that gets you mostly around it.

The other "problem" is filtering to prevent SQL injection and all that crap. There I'd have to say two things: 1. It is just common sense if you accept Unicode to translate it into a canonical form first, either precomposed canonical or predecomposed canonical (by the way, predecomposed canonical UTF8 is what the MacOS X file system uses). Once that is done, nothing unexpected should slip through. 2. Why would you need to filter out anything at all? This is a completely brain-damaged approach in the first place, using user input to form commands that could potentially be dangerous and filtering out user input that would produce dangerous commands. Instead, there shouldn't be any commands that could be dangerous in the first place.

Re:Smelly foreigners by vtcodger · 2007-05-21 23:55 · Score: 2, Informative

***Would some of the things that led to computers - morse code, telegraphy etc have been feasible using, say, Chinese in its normal written form?***

The answer would seem to be -- sort of ... maybe. See http://www.njstar.com/tools/telecode/jim-reeds-ctc .htm.

Summary: For telegraphy, Chinese characters are assigned numeiic codes in radical-stroke count order. That's the way that Japanese, and -- I assume -- Chinese, dictionaries, are arranged.

It may seem inefficient to use 20 bits (sort of) to encode a character, but remember that each character is a word, not a letter, and that composite words like "Beijing" or "paleontology" are only two words. That means that most "words" will be either 2.5 or 5 eight "bit" characters. Conventional telegraphy is really a trinary rather than a binary code -- pause, short, and long, and the 'digits' differ in length -- so bit count isn't really all that accurate an analogy.

So, no, the Chinese language probably wouldn't have made the development of computers by the Chinese all that much more difficult than European languages did. And the classic Chinese numeric notation is not as convenient as 'arabic' notation. But it's much less unwieldy than say Roman numerals, so I don't think it would have been an insumountable hurdle either.

--
You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey

Re:Another likely example of OSS? by Frankie70 · 2007-05-22 02:51 · Score: 2, Informative

Apparently, Vista's networking stack has been rewritten from scratch -- which does make you wonder how much of the reason for that was technical, and how much was MS wanting to be seen to get rid of all the BSD/*nix code in Windows in preparation for their patent offensive...

Why should using BSD code come in the way of their patent offensive?
Using BSD code isn't infringing on BSD's or someone else's patent.

Re:IIS's fault by spitzak · 2007-05-22 08:43 · Score: 2, Informative

They are there for compatability with some Japanese and Chinese character sets, which contained most of the ascii characters in both "half" and "full width" forms. The full-width ones were twice as wide to match the square characters, which was useful for lining up columns.

This is all pointless now with proportionally-spaced fonts (and multiple fonts, you could easily select the "wide" font to print those characters instead). However Unicode had as a design requirement that translating from any common encoding to unicode and back again would be the identity transform. Thus if any character set existed with two ways of representing the same character, then there had to be two ways to represent it in Unicode. Therefore the full-width characters. This is also why Unicode has hundreds of random accented characters even though the combining characters would allow all of them to be represented easily with only a few dozen characters.

Re:Smelly foreigners by jc42 · 2007-05-22 10:19 · Score: 2, Informative

The notable difference between Chinese and English (or most other written languages) is that several English characters combine to form syllables, which combine to form words (i.e., we use an alphabet). In Chinese, each character corresponds directly with a word (each character is a logogram).

Actually, this is pretty much a myth that originated from people with very little knowledge of Chinese language and writing. In all the Chinese languages ("dialects";-), most of the vocabulary is two-syllable words, as in English. Three-syllable words aren't uncommon. The writing system is actually a sort of syllabary, and the meaning of most two-character words can't be inferred from knowing what the syllables mean as standalone words.

It's similar to how lots of English words, e.g. "insight", can be parsed as two words ("in"+"sight"), but this doesn't really help you understand what the word actually means. Or, an example that shows how such things evolve is the English word "upstairs". If I say I'm going upstairs and take the elevator, did I lie to you? Of course not, because "upstairs" doesn't mean going up stairs. It did a few centuries ago, but hasn't meant that during the lifetime of anyone alive now. Similarly, proto-Chinese of N thousand years ago may have been mostly single-syllable words, but this hasn't been true for at least the few thousand years that we have readable examples of the writing system.

For a Mandarin example, which I'll write in pinyin (or pin1yin1;-) to get past the /. filters, consider the word zi4ran2. The zi4 syllable is a word, and means "from" or "since" (and is also used like "-ly" to form adverbs). The ran2 syllable is also a word, and basically means "correct" or "yes". The zi4ran2 combination means "nature" or "naturally". Like "insight", you might be able to kludge some sort of connection here, but in reality you just have to learn zi4ran2 as a separate word unrelated to its two syllables. It may have been a two-word idiom several thousand years ago; it's a two-syllable word now.

For an entertaining debunking of both this myth and a very common trope among Western pseudo-intellectuals and pop psychologists, read this article at languagelog. After chuckling at that particular bit of silliness about Chinese writing, you can find other articles there that go into the general problem in more detail. A number of experts in East-Asian linguistics regularly contribute to that blog, and they've been pushing for a campaign to debunk the nonsense that Westerners insist on saying about these languages.

Oh, well; I haven't yet heard any claim that Chinese doesn't have a word for "freedom". But I wouldn't be surprised. (Hint: the word starts with the same character as the above "zi4ran2", but has a different second character. ;-)

--
Those who do study history are doomed to stand helplessly by while everyone else repeats it.

Re:IIS's fault by phasm42 · 2007-05-22 11:31 · Score: 2, Insightful

here are 2 ways of producing the < glyph: you can use character code x8B or xFF1C.

Shouldn't that be x3C?

I'm not sure if that's right or wrong, if there is a right and wrong way to handle this issue (I suppose that means it's excellent grounds for a religious war)--it's just important that it be handled consistently.

I thought about this a little more, and I think the difference will be in what it is used for. In HTML, the "<" glyph has a special meaning, so it makes sense that a different version (in this case, full-width) of the character should have a different meaning. From an application perspective, perhaps they should be the same. IIS translates the full-width version to the regular version, probably reasoning that if a full-width angle bracket was submitted to the webserver, such as in "<something@somewhere.com>", a regular one was intended. However, this isn't a safe assumption, which leads me to another question -- anyone know if this is optional behavior in IIS, and if so, is it defaulted to on or off?

--
"No one likes working in a hamster wheel, and your shop smells of cedar shavings from here." - TaleSpinner

Re:half-wit encoding? by HeroreV · 2007-05-22 12:13 · Score: 2, Insightful

UTF-16 (fixed-width double byte encoding)

UTF-16 is a variable-width encoding. Code points from plane 0 are encoded in 16 bits and code points from planes 1 through 16 are encoded as two 16 bit surrogates. Many developers, like you, aren't aware of this, so it's very common for software to choke on UTF-16 with surrogate pairs.

I don't understand how mistaking one character for another is going to break anything

scenario:
1) You escape a Unicode string that contains fullwidth characters. The fullwidth characters have no special properties, so they aren't escaped.
2) You translate the escaped Unicode string into ASCII. Fullwidth characters are translated into halfwidth characters. Some of those halfwidth characters, like quotes, have special properties.

The fullwidth quote was perfectly safe, because it wasn't treated like a quote. It was treated the same as an "A" or "b". But when it was translated to a "normal" quote, it went from being a plain old character to being a quote character, with a completely different meaning.

The lesson here is that you should never translate fullwidth characters into halfwidth characters unless you know whether they should be escaped or not, and you should escape them during translation if they need to be. Also, it's not a good idea to translate an escaped string between character sets.

Re:Smelly foreigners by jc42 · 2007-05-23 03:23 · Score: 2, Informative

[T]he classic Chinese numeric notation is not as convenient as 'arabic' notation. But it's much less unwieldy than say Roman numerals, so I don't think it would have been an insumountable hurdle either.

Actually, classical Chinese numbers are only slightly worse than Arabic notation (which apparently developed in India but was spread by Arab traders who knew a good accounting system when they saw it). The Chinese notation was far better than any of the Western number notations that the Arabic notation supplanted, such as the Greek or Hebrew notation. Roman was probably the worst notation ever invented, and nobody ever really used it for accounting.

The basis of the Chinese system was symbols for 1 to 9, and symbols for powers of 10. To illustrate with ascii characters, the symbol for 10 looks like a large '+' sign, so we can use + for 10, H for hundred, T for thousand. We'd write the number 5347 as 5T3H4+7. Unused powers of 10 are omitted, so 2007 would be 2T7. 1024 would be T2+4 or 1T2+4. And so on. There are symbols for a few more powers of 10, and they can be chained to get higher powers of 10, so HT could be used for 100,000.

Nowadays, most numeric work in East Asia is done using the Western version of Arabic notation. But you also see a hybrid form that uses the Chinese 1-9 characters plus the Western 0. Converting between this notation and the traditional Chinese notation is essentially trivial and can be done as fast as you can write the numbers. But for arithmetic on paper, the Arabic form (or Arabic with Chinese digits) is a bit simpler than the traditional Chinese notation, since using 0 as a place holder results in correct alignment in columns of numbers, and the digits 1-9 are a bit faster to write than the Chinese digits.

An interesting aspect of the Chinese system is that the basic symbols have alternate "fancy" forms with a lot more strokes. These characters have the property that you can't add strokes to convert them to a different character. So they're an anti-tampering, fraud-proof way of writing numbers. I don't know of another numeric notation with this feature. Asian financial documents have historically used these fancy forms of numbers.

Actually, the Chinese and Arabic notations are the 3rd and 2nd easiest numeric notation that various societies have invented. A few years ago, Scientific American had an interesting article explaining the Mayan number system, and included an explanation of why it was a lot easier to use than the Arabic system. For example, instead of the big multiplication table that we memorized in school, the Mayan system really only needs one rule: 5x5=15. (This makes sense if you understand that they used base 20.) The rest of the rules for adding, subtracting and multiplying consist of the techniques for "carrying" and "borrowing", and are essentially similar to what you do with an abacus.

But I suppose we're stuck with the Arabic system. It's good enough, really, for the remaining uses where we don't bother with a computer.

--
Those who do study history are doomed to stand helplessly by while everyone else repeats it.

Slashdot Mirror

Unicode Encoding Flaw Widespread

38 of 184 comments (clear)