Domain: unicode.org
Stories and comments across the archive that link to unicode.org.
Comments · 276
-
Re: Lol
So you are saying "fix the library". I am saying "sanitize input for library".
Both work, but I would argue that sanitizing for the library is usually a lot less problems.
"Programming for international environments is hard, let's go shopping!"
I would argue that you have perhaps not considered all the possible problems and have thus perhaps miscounted the problems with "work around a broken library by transforming perfectly legitimate Unicode character sequences into sequences that might not represent what the person sending the message intended", that being the correct description of the second approach to this problem in the list above.
Yeah, correctly truncating a message that could be an arbitrary sequence of text in multiple languages with combining character sequences and bidirectional text isn't easy, but, well, if you want to be thought of as a company that makes stuff that "just works", you'd better figure out how to make that complicated process "just work".
Maybe iOS 8.3.1 needs to have a quick fix of some sort, but iOS 10, if not iOS 9, should fix the truncation code.
-
Re: Lol
In this case, the illegal UTF-8 sequence is the string after you have blown part of its funny foreign squiggle.
Where has it been proven that the bug is the trashing of a UTF-8 sequence?
First of all, Apple tends to use UTF-16 in the higher-level frameworks, e.g. that's how CFString/NSString work internally.
Second of all, processing entire characters rather than bytes is something I suspect Apple got right fairly early in the process. I suspect the problem is either that 1) when truncating the message for display, they're not processing entire graphemes, they're processing entire characters or 2) they're not taking bidirectionality into account or 3) they're not handling a combination of both issues.
He's saying that thing you call with your newly minted mangled string shouldn't fail.
Which is one way to solve it.
There are multiple things here that should be fixed. That's one of them - the renderer shouldn't crash if handed a bad string, it should fail more softly, e.g. put in a REPLACEMENT CHARACTER for all bad sequences and, if possible, log the error in a way that indicates that routine XXX has handed a bad character sequence to it.
I would argue, if the thing you calls mangles strings, sanitize its inputs so it doesn't get a string with a bad character (a unicode character of whatever format it uses internally, post-mangle).
And I would argue (all the way to the heat death of the universe) that, if you know that the thing you call mangles strings, and if it's produced by somebody else working on the same OS, you get it fixed so that it doesn't do that; you don't mangle user input (which includes text messages from other users) in released software, unless you don't have time to fix the underlying problem for the release.
-
Re: Lol
In this case, the illegal UTF-8 sequence is the string after you have blown part of its funny foreign squiggle.
Where has it been proven that the bug is the trashing of a UTF-8 sequence?
First of all, Apple tends to use UTF-16 in the higher-level frameworks, e.g. that's how CFString/NSString work internally.
Second of all, processing entire characters rather than bytes is something I suspect Apple got right fairly early in the process. I suspect the problem is either that 1) when truncating the message for display, they're not processing entire graphemes, they're processing entire characters or 2) they're not taking bidirectionality into account or 3) they're not handling a combination of both issues.
He's saying that thing you call with your newly minted mangled string shouldn't fail.
Which is one way to solve it.
There are multiple things here that should be fixed. That's one of them - the renderer shouldn't crash if handed a bad string, it should fail more softly, e.g. put in a REPLACEMENT CHARACTER for all bad sequences and, if possible, log the error in a way that indicates that routine XXX has handed a bad character sequence to it.
I would argue, if the thing you calls mangles strings, sanitize its inputs so it doesn't get a string with a bad character (a unicode character of whatever format it uses internally, post-mangle).
And I would argue (all the way to the heat death of the universe) that, if you know that the thing you call mangles strings, and if it's produced by somebody else working on the same OS, you get it fixed so that it doesn't do that; you don't mangle user input (which includes text messages from other users) in released software, unless you don't have time to fix the underlying problem for the release.
-
Re:Lol
No you don't. You are demonstrating the typical moronic attempts to deal with UTF-8.
Here is how you do it:
Go X bytes into the string. If that byte is a continuation byte, back up. Back up a maximum of 3 times. This will find a truncation point that will not introduce more errors into the string than are already there.
As long as you're not splitting a sequence of multiple characters (multiple characters, some of which might be encoded in multiple bytes with UTF-8) some of which are combining characters. Don't split a character from a combining character following it. Splitting a sequence like that can introduce more rendering errors into the string than are already there.
(I suspect that's what the problem is in this bug, given that there are several combining characters in the string as shown in various places.)
(And you don't want to split it after N characters, if the goal is to limit the display length of the string you're displaying, as not all characters are the same width - and, of course, a base character followed by several combining characters might just have the width of the base character.)
-
Re:What is the string?
In hex, the string is:
506f 7765 7220 d984 d98f d984 d98f d8b5 d991 d8a8 d98f d984 d98f d984 d8b5 d991 d8a8 d98f d8b1 d8b1 d98b 20e0 a5a3 20e0 a5a3 6820 e0a5 a320 e0a5 a320 e586 97
That's the string encoded as UTF-8, so it's more like
50 6f 77 65 72 20 d9 84 d9 8f d9 84 d9 8f d8 b5 d9 91 d8 a8 d9 8f d9 84 d9 8f d9 84 d8 b5 d9 91 d8 a8 d9 8f d8 b1 d8 b1 d9 8b 20 e0 a5 a3 20 e0 a5 a3 68 20 e0 a5 a3 20 e0 a5 a3 20 e5 86 97
If we turn that into a sequence of (21-bit) Unicode code points, it becomes
000050 00006f 000077 000065 000072 000020 000644 00064f 000644 00064f 000635 000651 000628 00064f 000644 00064f 000644 000635 000651 000628 00064f 000631 000631 00064b 000020 000963 000020 000963
which, encoded as UTF-16, is
0050 006f 0077 0065 0072 0020 0644 064f 0644 064f 0635 0651 0628 064f 0644 064f 0644 0635 0651 0628 064f 0631 0631 064b 0020 0963 0020 0963
As UTF-16, there are no surrogate pairs, so the bug presumably isn't a problem with handling UTF-16-encoded Unicode characters bigger than 00FFFF.
I suspect that the string is probably being processed as UTF-16, because that's how CFString/NSString are encoded internally and because code handling UTF-8 that can't handle multi-byte characters couldn't handle anything other than ASCII.
U+0963 is DEVANAGARI VOWEL SIGN VOCALIC LL, which is a nonspacing mark; my guess is that it (or perhaps some other character in that sequence that's a combining character) is getting split, by the ellipsis, from the character with which it's supposed to combine, and that the rendering code is blowing up because of that.
If so, this has nothing to do with UTF-16 being too hard to handle correctly, or with the code not being able to handle characters that are "too many bytes", it has to do with sequences of characters sometimes having to be handled specially, and not just blithely split between characters.
It starts with "Power ", but I guess that's not important.
It might make the string long enough that the code displaying it on the main screen would abbreviate it and thus insert an ellipse.
-
Re:Type "bush hid the facts" into Notepad.
hmm. unicode is fine, utf-8 is fine. only windows uses boms. so who's the asshole?
The byte order mark is part of the unicode standard, and is used all over the place besides windows. Your question answers itself.
-
Scrapping DST worldwide for 24 time zones
Never mind just America, let's work to scrap DST worldwide. DST (or daylight saving time) is a great source of confusion. It complicates administration, as well as making life tough for programmers and every day people who need to make sure their clocks are reset twice a year.
However, if we scrapped DST (along with 15 or 30 minute offsets), we would only have 24 time zones - one for each hour! This is a reduction from the hundreds we currently have in use around the world. Each location would simply be assigned to an offset from UTC (0-24).
For many reasons, it'd be nice if everyone used UTC as their only time, but in the mean time, twenty four consistent, simple and clear zones should be enough for everybody. -
Re:utf-32/ucs-4
Its obvious you have little real experience with unicode, because saying 'just convert to utf-32' just papers over the problems without solving them. UTF-32 units are code points, not characters, and there are many multi-code-point (variable length) characters in utf-32. So you still have all the length and normalization problems you have with utf-8 (and even with ASCII, though people often ignore it there -- are 'a' and 'A' the same character? How do they sort?)
The real 'length' problem is that people insist on using the term ambiguously -- you have string storage space and string rendering size, and the two are completely independent.
Actually, there's three!
1. Byte count (storage space)
2. Codepoint count - the number of Unicode codepoints present in the string, regardless of whether or not they are rendered.
3. Grapheme count - number of rendered glyphs.But before you start counting those...you are normalizing your user input to your internal normalization form, right? Wait...you haven't decided what that normalization form is yet?
-
Re:Same reason blu-ray didn't take off
On the 35" the text is too small to read comfortably for any length of time
Text size has no relation to the display size. Text size is generally specified in "points", where one point is approximately 1/72 inch. If you find the text too small to read, the obvious solution is to increase the size. Display size affects how much text you can display given a certain text size. E.g., you might get 40 lines of 10 point text on a 24" monitor, and 45 lines of 10 point text on a 32" monitor.
I don't see how reading on a 27" is going to work unless you increase your font size which reduces the benefits of the higher resolution.
Why wouldn't reading on a 27" work? A long time ago, I had a 15" CRT and was able to read text on it without any problems. And even further back, there were 9" screens, and even smaller ones. You just couldn't get as much text on them (e.g., 40 columns across).
The benefit of higher resolution is that text is sharper, since you can use more pixels to draw the characters while keeping the same point size. E.g., instead of using 8x12 pixels to draw a character, you can use 16x24, which looks a lot better. It's even more noticeable if you work with Chinese/Japanese/Korean text, where the characters are much more detailed than the Roman alphabet. Some characters (such as this one) turn into an indistinct mess if you have to squeeze it into a 12x12 pixel cell, but if you have 24x24 to work with, it looks a lot better.
In any case, this Dell monitor sounds interesting... I was considering their previous 4K 24" monitor, but the way it faked being two half-screens (to work around HDMI limitations?) seemed annoying and glitch-prone, and I heard that the next generation of monitors wouldn't have to do that. I currently have a 24" monitor, and am looking for something the same size, but I suppose 27" isn't too much bigger.
-
Re:Next wave of phishing?
I think that's the way to go - only allow characters from a single unicode script in the username and in the domain name. The domain name part is currently handled by registras so that may not need any additional rules.
However this really should be part of the RFC, or else anyone banning mixed names would be "non compliant". If the RCF does not specify this then the best that gmail (or any other system could do) would be to prevent people registering mixed names themselves and giving a warning (and maybe colour characters) if email is recieived from an address with mixed scripts.
-
Re:The bashing is sometimes justified...
I can also show a swastika on my U.S.-hosted site and criticize public officials without fear of ridiculously heavy-handed libel/defamation laws. And don't even get me started with the bullshit cultural and language laws in France. It's amazing anything gets done in that country at all.
Oh, I dunno; I've seen any number of sites similar to this one, whose information is mirrored at zillions of locations on the web, including many outside the US. There are historical and cultural reasons for including the symbols at code points 534D and 5350 in Unicode, and I doubt that anyone has ever been prosecuted for installing full Unicode charsets or lookup software on their web sites.
I haven't looked for such pages on French sites, but I'd be surprised if they don't exist (with the text in French rather than English), and I'd also be surprised if the French government has tried to suppress such character codes in the Uncode lookups.
It's possible that such things has happened and I just haven't read about them. Does anyone know of cases of official harrassment for including pages like the above on a web site? For example, has any Islamic or other religious government ever harrassed people for allowing the U+271D char code on a web page?
(And yes, I do have a couple of experimental dictionaries on my own web sites, including one dealing with Chinese characters which includes an entry for the swastika characters. Nobody has even suggested that these glyphs shouldn't be there. Possibly it's because nobody has ever looked at my dictionaries, but still
... ;-) -
Re:Middle finger
I believe you are referring to U1F595.
-
Re:It's kind of long and meandering
Too contrived. The only book one needs is the UTF specification.
-
Re:Wow, an amazing co-incidence
How is it a "huge problem"? ASCII has a number of control characters too. A whitelist is a great idea, but why is the whitelist so restrictive? Just grab a copy of the current Unicode Data file and whitelist all current non-control characters. And if you're concerned that Zalgo might come, I suppose you could omit any non-spacing chars from the whitelist without people complaining too much (though perhaps it'd be good to include the ones that are actual letters in various Indic scripts).
-
Re:Emoticons are already free and open source.
Why bother with emoji, though? Just use Chinese ideographs. They're the natural final progression of this idea, after all. Moreover, if you're just after basic emoticons, there's a Unicode range from 1F600 to 1F64F.
-
Re:Mahjongg?
-
Re:Favourite unicode character
Actually, it was ISO/IEC 10646 that started out as a Han Unification project. Unicode actually began as a universal character encoding standard. Between version 1.0 and 1.01, Unicode merged with 10646, and they became one big squabbling family, where everyone got to act like they were Unicode, but got named after 10646. The Tibetans got lost when they moved into the new house, and somehow the Koreans ended up being triplets, but they eventually found their way back home. Eventually the Cherokees brought some native flair, while the Mormons made everyone stop drinking, at least for a while. Eventually the Chinese decided they needed a place for all their ancestors ashes, and the Japanese kids spread lolcats all over the place. Of course, now we've got French stenographers and old Hungarians knocking at the door, trying to get in, not to mention a bunch of African tribesmen and some more Minoans trying to force the gate.
PS, the "pictographs" are encoded because of a need to catalogue all of those emoji-laden text messages they send in Japan.
-
Re:Favourite unicode character
Actually, it was ISO/IEC 10646 that started out as a Han Unification project. Unicode actually began as a universal character encoding standard. Between version 1.0 and 1.01, Unicode merged with 10646, and they became one big squabbling family, where everyone got to act like they were Unicode, but got named after 10646. The Tibetans got lost when they moved into the new house, and somehow the Koreans ended up being triplets, but they eventually found their way back home. Eventually the Cherokees brought some native flair, while the Mormons made everyone stop drinking, at least for a while. Eventually the Chinese decided they needed a place for all their ancestors ashes, and the Japanese kids spread lolcats all over the place. Of course, now we've got French stenographers and old Hungarians knocking at the door, trying to get in, not to mention a bunch of African tribesmen and some more Minoans trying to force the gate.
PS, the "pictographs" are encoded because of a need to catalogue all of those emoji-laden text messages they send in Japan.
-
Re:Favourite unicode character
Actually, it was ISO/IEC 10646 that started out as a Han Unification project. Unicode actually began as a universal character encoding standard. Between version 1.0 and 1.01, Unicode merged with 10646, and they became one big squabbling family, where everyone got to act like they were Unicode, but got named after 10646. The Tibetans got lost when they moved into the new house, and somehow the Koreans ended up being triplets, but they eventually found their way back home. Eventually the Cherokees brought some native flair, while the Mormons made everyone stop drinking, at least for a while. Eventually the Chinese decided they needed a place for all their ancestors ashes, and the Japanese kids spread lolcats all over the place. Of course, now we've got French stenographers and old Hungarians knocking at the door, trying to get in, not to mention a bunch of African tribesmen and some more Minoans trying to force the gate.
PS, the "pictographs" are encoded because of a need to catalogue all of those emoji-laden text messages they send in Japan.
-
Re:Favourite unicode character
Actually, it was ISO/IEC 10646 that started out as a Han Unification project. Unicode actually began as a universal character encoding standard. Between version 1.0 and 1.01, Unicode merged with 10646, and they became one big squabbling family, where everyone got to act like they were Unicode, but got named after 10646. The Tibetans got lost when they moved into the new house, and somehow the Koreans ended up being triplets, but they eventually found their way back home. Eventually the Cherokees brought some native flair, while the Mormons made everyone stop drinking, at least for a while. Eventually the Chinese decided they needed a place for all their ancestors ashes, and the Japanese kids spread lolcats all over the place. Of course, now we've got French stenographers and old Hungarians knocking at the door, trying to get in, not to mention a bunch of African tribesmen and some more Minoans trying to force the gate.
PS, the "pictographs" are encoded because of a need to catalogue all of those emoji-laden text messages they send in Japan.
-
Re:Favourite unicode character
Actually, it was ISO/IEC 10646 that started out as a Han Unification project. Unicode actually began as a universal character encoding standard. Between version 1.0 and 1.01, Unicode merged with 10646, and they became one big squabbling family, where everyone got to act like they were Unicode, but got named after 10646. The Tibetans got lost when they moved into the new house, and somehow the Koreans ended up being triplets, but they eventually found their way back home. Eventually the Cherokees brought some native flair, while the Mormons made everyone stop drinking, at least for a while. Eventually the Chinese decided they needed a place for all their ancestors ashes, and the Japanese kids spread lolcats all over the place. Of course, now we've got French stenographers and old Hungarians knocking at the door, trying to get in, not to mention a bunch of African tribesmen and some more Minoans trying to force the gate.
PS, the "pictographs" are encoded because of a need to catalogue all of those emoji-laden text messages they send in Japan.
-
Re:Favourite unicode character
Actually, it was ISO/IEC 10646 that started out as a Han Unification project. Unicode actually began as a universal character encoding standard. Between version 1.0 and 1.01, Unicode merged with 10646, and they became one big squabbling family, where everyone got to act like they were Unicode, but got named after 10646. The Tibetans got lost when they moved into the new house, and somehow the Koreans ended up being triplets, but they eventually found their way back home. Eventually the Cherokees brought some native flair, while the Mormons made everyone stop drinking, at least for a while. Eventually the Chinese decided they needed a place for all their ancestors ashes, and the Japanese kids spread lolcats all over the place. Of course, now we've got French stenographers and old Hungarians knocking at the door, trying to get in, not to mention a bunch of African tribesmen and some more Minoans trying to force the gate.
PS, the "pictographs" are encoded because of a need to catalogue all of those emoji-laden text messages they send in Japan.
-
Re:Favourite unicode character
Actually, it was ISO/IEC 10646 that started out as a Han Unification project. Unicode actually began as a universal character encoding standard. Between version 1.0 and 1.01, Unicode merged with 10646, and they became one big squabbling family, where everyone got to act like they were Unicode, but got named after 10646. The Tibetans got lost when they moved into the new house, and somehow the Koreans ended up being triplets, but they eventually found their way back home. Eventually the Cherokees brought some native flair, while the Mormons made everyone stop drinking, at least for a while. Eventually the Chinese decided they needed a place for all their ancestors ashes, and the Japanese kids spread lolcats all over the place. Of course, now we've got French stenographers and old Hungarians knocking at the door, trying to get in, not to mention a bunch of African tribesmen and some more Minoans trying to force the gate.
PS, the "pictographs" are encoded because of a need to catalogue all of those emoji-laden text messages they send in Japan.
-
Re:Favourite unicode character
Actually, it was ISO/IEC 10646 that started out as a Han Unification project. Unicode actually began as a universal character encoding standard. Between version 1.0 and 1.01, Unicode merged with 10646, and they became one big squabbling family, where everyone got to act like they were Unicode, but got named after 10646. The Tibetans got lost when they moved into the new house, and somehow the Koreans ended up being triplets, but they eventually found their way back home. Eventually the Cherokees brought some native flair, while the Mormons made everyone stop drinking, at least for a while. Eventually the Chinese decided they needed a place for all their ancestors ashes, and the Japanese kids spread lolcats all over the place. Of course, now we've got French stenographers and old Hungarians knocking at the door, trying to get in, not to mention a bunch of African tribesmen and some more Minoans trying to force the gate.
PS, the "pictographs" are encoded because of a need to catalogue all of those emoji-laden text messages they send in Japan.
-
Re:Checking for the release of a new version
🙋 If I were writing such a parser, I don't know how I'd get it to automatically check for the release of a new version of the standard and determine which code points are new bidi characters to be popped.
Bidi ranges are already set by the Unicode roadmaps. It's just a range check.
-
Re:emoticons?
Well then, we need to include:
- heck, instead of just the suit symbols why not 52 glyphs for a standard deck of cards
- throw the Major Arcana tarot cards in there too
-
Re:emoticons?
Well then, we need to include:
- heck, instead of just the suit symbols why not 52 glyphs for a standard deck of cards
- throw the Major Arcana tarot cards in there too
-
Re:So, The Crutch for Bad Writers...
>
...has just been given more padding and legitimacy?What do you mean? Emoticons have plenty of legitimacy already [U+1F638 GRINNING CAT FACE WITH SMILING EYES]
For anybody else who (like me) thought that this was a joke:
Unicode emoticon section (pdf for people who don't have the font support). -
Re:Project Gutenberg
This is untrue.
First off, Simplified and Traiditional characters are separated in Unicode.
Second off, Cyrillic characters and Latin characters have always been considered two different scripts, while Chinese logographs are considered to be the same script, used in different contexts.
See http://unicode.org/notes/tn26/.
In any event, it would make good sense for programming environments to be able to handle Unicode source.
-
Re:Oilean Ruadh
jvonk is right; this should be done in Unicode:
- Á: Á
- á: á
- É: É
- é: é
- Í: Í
- í: í
- Ó: Ó
- ó: ó
- Ú: Ú
- ú: ú
From the Latin-1 Supplement Character Code Chart.
-
Re:Oilean Ruadh
jvonk is right; this should be done in Unicode:
- Á: Á
- á: á
- É: É
- é: é
- Í: Í
- í: í
- Ó: Ó
- ó: ó
- Ú: Ú
- ú: ú
From the Latin-1 Supplement Character Code Chart.
-
Re:"just google it"
That's because you have to enter the name in Akkadian (ISO 15924 code "Xsux") rather than pseudo-phonetic English. Here is a crib sheet: http://www.unicode.org/charts/PDF/U12000.pdf.
-
U+0161 already exists
http://www.unicode.org/charts/PDF/U0100.pdf
"Small Latin Letter S with Caron"
"Czech, Estonian, Finnish, Slovak, and many other languages."-molo
-
Re:Its nice to see
http://www.unicode.org/charts/PDF/U0900.pdf 0930 would do for the time being!
-
Re:India is the 5th country...
India is the 5th country...to get a symbol for its currency.
Ummm... The Unicode Code Charts show many more than 5 country's currency symbols. And the currency code section has room for 23 more currency symbols.
-
Re:Some quasi-scientific experiments
I'm curious about that an2 that I found in the dictionary. It won't show up here, due to
/.'s anti-UTF-8 policy, but you can see it at U+557D. Like the wiktionary.org and mandarintools.com entries, this includes a mention of Cantonese in the definition, probably meaning that it's one of those chars that really isn't used in Mandarin (although a Mandarin pronunciation is given). This interpretation is encouraged by the 'kou' radical in the char, which is common in the collection of Cantonese-only chars. OTOH, a Kangxi index is listed for it, so it's an old character. Maybe it has died out in Mandarin.There are a lot of obscure nooks and corners in Chinese writing. If they had any sense, they'd've gotten rid of it centuries ago. But I suppose that'll happen about the same time that English adopts a phonetic spelling system.
;-) -
Use Word 2007's Equation support
This is not the old Equation Editor 3.0 from Word 2003, which is a crippled version of MathType, but rather a brand new equation facility in Word 2007, which is also the basis for the new equation support in the OneNote 2010 beta another poster has referred to.
The Word 2007 equation editor supports a "linear format" for completely keyboard-based input, which is based on TeX-like commands like "\sum" and "\int" and is documented in this Unicode technical note: Unicode Nearly Plain-Text Encoding of Mathematics
I've been using this for my math classes since last semester, with great success. Once you master the linear format, it's not difficult to keep up if you have a reasonable typing rate to begin with.
-
Re:"three strikes"
The Unicode standard includes a database (trivial to parse in Perl) that allows one to filter "safe" characters with the greatest of ease. There really is no excuse here, other than "we can't maintain the code".
-
Re:TFA says "18 microseconds", not "18 seconds"
It used to support Unicode, but apparently, due to people using control characters (RTL overrides and such) to do clever things, anything non-ASCII is now filtered or mangled. Too bad they went overboard--seems like the easy way to fix this would be to only filter out control characters. Unicode publishes a handy database that you can use to find out which characters are control characters.
-
Re:So?
Python uses UTF-16 internally on Windows and UTF-32 on Unix (I think "UCS-4" implies that values greater than 0x10ffff are allowed, but Pythons converters to UTF-16 and UTF-8 do not handle this, so it is better to say it supports UTF-32).
Python uses UCS-2 or UCS-4 internally but sys.maxunicode maxes out at 0x1114111 on UCS-4 builds because there aren't any defined characters that high. Read Include/unicodeobject.h in Python's source code to see for yourself.
I believe that if they had used UTF-8 from the start none of this crap would be happening.
Storing unicode strings internally as UTF-8 is madness. Imagine the trivial case of s[3], which bytes would we return? We'd have to walk the string to find the start of that fourth character, then walk it some more to get the whole thing. It'd transform a whole pile of O(1) operations into O(n) ones, and performance would suffer greatly for it.
-
Re:amen. sort of.
Tag characters? Like described in this document, and the linked PDF? http://unicode.org/faq/languagetagging.html
If so, that's an interesting feature. (That I'm never gonna use 'cause they say not to, natch.)
-
Time is moneyBidirectional codes are dangerous, as the erocS cases I mentioned above demonstrate. Some other codes look more useful for ASCII art and aren't dangerous as much as lame, such as U+0B08 ORIYA LETTER II from Oriya, which looks more like the head of a Smurf than a letter to English speakers.
Is IPA somehow dangerous?
X-SAMPA is a workaround.
Cyrillic?
Yes. Remember the IDN homoglyph attack?
Or the Euro sign?
€ produces €.
I get the feeling that the whitelist was put in place by someone who doesn't really know Unicode (and/or didn't want to spend time with it)
Correct. It's easier to whitelist Latin-1 than to comb the entire Basic Multilingual Plane looking for anything that's not part of a bidirectional or complex script. If something won't increase SourgeForge, Inc.'s ad revenue, it's not worth spending time on.
-
Re:I love Ruby and Rails, don't get me wrong...since Unicode is actually somewhat broken for the language (not all needed characters are actually defined). Do you have any more details on that? My understanding was that all characters in traditional/legacy Japanese encodings are included in Unicode (with some exceptions, as mentioned in the standard (PDF warning). Do you have an example of a character encoded in, e.g., ISO-2022-JP, that's not in Unicode?
-
Re:Nano?
(Anonymously, since I don't want to lose the mod points I used)
Unfortunately, ISO-8859-1 is quite limited, and does not include support for the Greek letter mu, nor the micro symbol, which look identical, but actually each have their own code in Unicode.
You might want to check your facts better, before posting. "MICRO SIGN", unicode code point 0x00B5, 0xB5 in ISO8859-1.
-
Re:"Design Patterns" says it all
How often do you see a Factory or Observer in Java? How often do you see them in Python? Or even C# for that matter?
Quite frankly, I think the problem stems from the lack of closures.
Closures are a tool that can be used to implement patterns like Factory or Observer. The patterns are not a replacement for closures. When, in C#, I do something like "SomeButton.OnClick += SomeButton_OnClick;", I'm making the current object an Observer of things that happen in SomeButton. Sure, it's a better syntax, but the basic idea's still the same.
Instead of being able to sort a list of strings case-insentively by just calling list.sort(x,y -> string.compare(x.toLower(), y.toLower())), you have to create a class that implements Comparator, put your sort function in there, and then call list.sort() with an instance of that function. All those contortions result in bloat and complexity which makes the language suck.
I don't think it's that bad. Sure, I'd prefer being able to do it the way you wrote it, and I know there are languages out there that allow that kind of syntax. But my experience is that they all run a lot slower than Java / C# do. In many applications, that performance is necessary. And I don't think this:
Arrays.sort (list, new Comparator(){ public int compare(Object x, Object y) { return ((String)x).compareToIgnoreCase((String)y); });
is enough worse to warrant the difference.
(BTW, you can't correctly implement case insensitive sorting by uppercasing all strings like that, because there are some cases where this approach will not work, e.g. the German gothic-s character uppercases to 'SS', but should sort as a different letter. See here for a discussion.) -
Re:AHAHAHA (This guy is a joke)
-
Successful computer industry alliances
- VESA
- The Open Group
- IEEE
- GSM
- The Unicode Consortium
- Bluetooth SIG
- CAN
- EIA (responsible for, among other things, JEDEC, who are responsible for DDR and related standards)
-
Re:Equation Editor/Matlab
OpenOffice.org's math editor is just barely OK; the STIX fonts may help somewhat if they're integrated. (But why did they need to invent that awful syntax?) The math editor that comes with MS Office 2007 is much better. See Unicode Technical Note #28 (http://www.unicode.org/notes/tn28/) for an explanation of the syntax and an example document created with that system.
-
Re:mathml
There's a (longish) explanation of how it works in this pdf (1 MB) - the quadratic formula isn't there, but there's some examples on pp. 5 and 6. As far as I can tell, all of it is set with the Cambria (Math) font. To me it looks better on screen than the CM font, but I haven't compared them on paper.
Word still has a lot of problems with equation numbering and so on - it's possible, but it's an ugly hack. I'd be ecstatic if there was some interpreter that could convert from the Word-input method to LaTeX (I use a Danish keyboard, which doesn't help at all - the { and } are located at ctrl+alt+7 and ctrl+alt+0, for instance). -
Re:no unicode support?
Not yet, though they are on their way to being in the standard. As far as sumerian cuneiform, they are already in utf-8, part of the ancient languages section.
"One character encoding to rule them all." ;-)