unicode.org · Domains · Slashdot Mirror

Re: Lol by Guy+Harris · 2015-05-28 06:21 · Score: 1 · on A Text Message Can Crash An iPhone and Force It To Reboot

So you are saying "fix the library". I am saying "sanitize input for library".

Both work, but I would argue that sanitizing for the library is usually a lot less problems.

"Programming for international environments is hard, let's go shopping!"

I would argue that you have perhaps not considered all the possible problems and have thus perhaps miscounted the problems with "work around a broken library by transforming perfectly legitimate Unicode character sequences into sequences that might not represent what the person sending the message intended", that being the correct description of the second approach to this problem in the list above.

Yeah, correctly truncating a message that could be an arbitrary sequence of text in multiple languages with combining character sequences and bidirectional text isn't easy, but, well, if you want to be thought of as a company that makes stuff that "just works", you'd better figure out how to make that complicated process "just work".

Maybe iOS 8.3.1 needs to have a quick fix of some sort, but iOS 10, if not iOS 9, should fix the truncation code.

Re: Lol by Guy+Harris · 2015-05-28 06:06 · Score: 1 · on A Text Message Can Crash An iPhone and Force It To Reboot

In this case, the illegal UTF-8 sequence is the string after you have blown part of its funny foreign squiggle.

Where has it been proven that the bug is the trashing of a UTF-8 sequence?

First of all, Apple tends to use UTF-16 in the higher-level frameworks, e.g. that's how CFString/NSString work internally.

Second of all, processing entire characters rather than bytes is something I suspect Apple got right fairly early in the process. I suspect the problem is either that 1) when truncating the message for display, they're not processing entire graphemes, they're processing entire characters or 2) they're not taking bidirectionality into account or 3) they're not handling a combination of both issues.

He's saying that thing you call with your newly minted mangled string shouldn't fail.

Which is one way to solve it.

There are multiple things here that should be fixed. That's one of them - the renderer shouldn't crash if handed a bad string, it should fail more softly, e.g. put in a REPLACEMENT CHARACTER for all bad sequences and, if possible, log the error in a way that indicates that routine XXX has handed a bad character sequence to it.

I would argue, if the thing you calls mangles strings, sanitize its inputs so it doesn't get a string with a bad character (a unicode character of whatever format it uses internally, post-mangle).

And I would argue (all the way to the heat death of the universe) that, if you know that the thing you call mangles strings, and if it's produced by somebody else working on the same OS, you get it fixed so that it doesn't do that; you don't mangle user input (which includes text messages from other users) in released software, unless you don't have time to fix the underlying problem for the release.

Re: Lol by Guy+Harris · 2015-05-28 06:06 · Score: 1 · on A Text Message Can Crash An iPhone and Force It To Reboot

In this case, the illegal UTF-8 sequence is the string after you have blown part of its funny foreign squiggle.

Where has it been proven that the bug is the trashing of a UTF-8 sequence?

First of all, Apple tends to use UTF-16 in the higher-level frameworks, e.g. that's how CFString/NSString work internally.

Second of all, processing entire characters rather than bytes is something I suspect Apple got right fairly early in the process. I suspect the problem is either that 1) when truncating the message for display, they're not processing entire graphemes, they're processing entire characters or 2) they're not taking bidirectionality into account or 3) they're not handling a combination of both issues.

He's saying that thing you call with your newly minted mangled string shouldn't fail.

Which is one way to solve it.

There are multiple things here that should be fixed. That's one of them - the renderer shouldn't crash if handed a bad string, it should fail more softly, e.g. put in a REPLACEMENT CHARACTER for all bad sequences and, if possible, log the error in a way that indicates that routine XXX has handed a bad character sequence to it.

I would argue, if the thing you calls mangles strings, sanitize its inputs so it doesn't get a string with a bad character (a unicode character of whatever format it uses internally, post-mangle).

And I would argue (all the way to the heat death of the universe) that, if you know that the thing you call mangles strings, and if it's produced by somebody else working on the same OS, you get it fixed so that it doesn't do that; you don't mangle user input (which includes text messages from other users) in released software, unless you don't have time to fix the underlying problem for the release.

Re:Lol by Guy+Harris · 2015-05-27 19:06 · Score: 1 · on A Text Message Can Crash An iPhone and Force It To Reboot

No you don't. You are demonstrating the typical moronic attempts to deal with UTF-8.

Here is how you do it:

Go X bytes into the string. If that byte is a continuation byte, back up. Back up a maximum of 3 times. This will find a truncation point that will not introduce more errors into the string than are already there.

As long as you're not splitting a sequence of multiple characters (multiple characters, some of which might be encoded in multiple bytes with UTF-8) some of which are combining characters. Don't split a character from a combining character following it. Splitting a sequence like that can introduce more rendering errors into the string than are already there.

(I suspect that's what the problem is in this bug, given that there are several combining characters in the string as shown in various places.)

(And you don't want to split it after N characters, if the goal is to limit the display length of the string you're displaying, as not all characters are the same width - and, of course, a base character followed by several combining characters might just have the width of the base character.)

Re:What is the string? by Guy+Harris · 2015-05-27 13:36 · Score: 1 · on A Text Message Can Crash An iPhone and Force It To Reboot

In hex, the string is:

506f 7765 7220 d984 d98f d984 d98f d8b5 d991 d8a8 d98f d984 d98f d984 d8b5 d991 d8a8 d98f d8b1 d8b1 d98b 20e0 a5a3 20e0 a5a3 6820 e0a5 a320 e0a5 a320 e586 97

That's the string encoded as UTF-8, so it's more like

50 6f 77 65 72 20 d9 84 d9 8f d9 84 d9 8f d8 b5 d9 91 d8 a8 d9 8f d9 84 d9 8f d9 84 d8 b5 d9 91 d8 a8 d9 8f d8 b1 d8 b1 d9 8b 20 e0 a5 a3 20 e0 a5 a3 68 20 e0 a5 a3 20 e0 a5 a3 20 e5 86 97

If we turn that into a sequence of (21-bit) Unicode code points, it becomes

000050 00006f 000077 000065 000072 000020 000644 00064f 000644 00064f 000635 000651 000628 00064f 000644 00064f 000644 000635 000651 000628 00064f 000631 000631 00064b 000020 000963 000020 000963

which, encoded as UTF-16, is

0050 006f 0077 0065 0072 0020 0644 064f 0644 064f 0635 0651 0628 064f 0644 064f 0644 0635 0651 0628 064f 0631 0631 064b 0020 0963 0020 0963

As UTF-16, there are no surrogate pairs, so the bug presumably isn't a problem with handling UTF-16-encoded Unicode characters bigger than 00FFFF.

I suspect that the string is probably being processed as UTF-16, because that's how CFString/NSString are encoded internally and because code handling UTF-8 that can't handle multi-byte characters couldn't handle anything other than ASCII.

U+0963 is DEVANAGARI VOWEL SIGN VOCALIC LL, which is a nonspacing mark; my guess is that it (or perhaps some other character in that sequence that's a combining character) is getting split, by the ellipsis, from the character with which it's supposed to combine, and that the rendering code is blowing up because of that.

If so, this has nothing to do with UTF-16 being too hard to handle correctly, or with the code not being able to handle characters that are "too many bytes", it has to do with sequences of characters sometimes having to be handled specially, and not just blithely split between characters.

It starts with "Power ", but I guess that's not important.

It might make the string long enough that the code displaying it on the main screen would abbreviate it and thus insert an ellipse.

Re:Type "bush hid the facts" into Notepad. by Anonymous Coward · 2015-03-21 13:30 · Score: 0 · on OS X Users: 13 Characters of Assyrian Can Crash Your Chrome Tab

hmm. unicode is fine, utf-8 is fine. only windows uses boms. so who's the asshole?

The byte order mark is part of the unicode standard, and is used all over the place besides windows. Your question answers itself.

Scrapping DST worldwide for 24 time zones by Twinbee · 2015-03-07 07:18 · Score: 1 · on Daylight Saving Time Change On Sunday For N. America

Never mind just America, let's work to scrap DST worldwide. DST (or daylight saving time) is a great source of confusion. It complicates administration, as well as making life tough for programmers and every day people who need to make sure their clocks are reset twice a year.

However, if we scrapped DST (along with 15 or 30 minute offsets), we would only have 24 time zones - one for each hour! This is a reduction from the hundreds we currently have in use around the world. Each location would simply be assigned to an offset from UTC (0-24).

For many reasons, it'd be nice if everyone used UTC as their only time, but in the mean time, twenty four consistent, simple and clear zones should be enough for everybody.

Re:utf-32/ucs-4 by Anonymous Coward · 2015-01-12 09:06 · Score: 0 · on NetHack Development Team Polls Community For Advice On Unicode

Its obvious you have little real experience with unicode, because saying 'just convert to utf-32' just papers over the problems without solving them. UTF-32 units are code points, not characters, and there are many multi-code-point (variable length) characters in utf-32. So you still have all the length and normalization problems you have with utf-8 (and even with ASCII, though people often ignore it there -- are 'a' and 'A' the same character? How do they sort?)

The real 'length' problem is that people insist on using the term ambiguously -- you have string storage space and string rendering size, and the two are completely independent.

Actually, there's three!

1. Byte count (storage space)
2. Codepoint count - the number of Unicode codepoints present in the string, regardless of whether or not they are rendered.
3. Grapheme count - number of rendered glyphs.

But before you start counting those...you are normalizing your user input to your internal normalization form, right? Wait...you haven't decided what that normalization form is yet?

Re:Same reason blu-ray didn't take off by Dahan · 2014-09-05 21:51 · Score: 1 · on Dell Demos 5K Display

On the 35" the text is too small to read comfortably for any length of time

Text size has no relation to the display size. Text size is generally specified in "points", where one point is approximately 1/72 inch. If you find the text too small to read, the obvious solution is to increase the size. Display size affects how much text you can display given a certain text size. E.g., you might get 40 lines of 10 point text on a 24" monitor, and 45 lines of 10 point text on a 32" monitor.

I don't see how reading on a 27" is going to work unless you increase your font size which reduces the benefits of the higher resolution.

Why wouldn't reading on a 27" work? A long time ago, I had a 15" CRT and was able to read text on it without any problems. And even further back, there were 9" screens, and even smaller ones. You just couldn't get as much text on them (e.g., 40 columns across).

The benefit of higher resolution is that text is sharper, since you can use more pixels to draw the characters while keeping the same point size. E.g., instead of using 8x12 pixels to draw a character, you can use 16x24, which looks a lot better. It's even more noticeable if you work with Chinese/Japanese/Korean text, where the characters are much more detailed than the Roman alphabet. Some characters (such as this one) turn into an indistinct mess if you have to squeeze it into a 12x12 pixel cell, but if you have 24x24 to work with, it looks a lot better.

In any case, this Dell monitor sounds interesting... I was considering their previous 4K 24" monitor, but the way it faked being two half-screens (to work around HDMI limitations?) seemed annoying and glitch-prone, and I heard that the next generation of monitors wouldn't have to do that. I currently have a 24" monitor, and am looking for something the same size, but I suppose 27" isn't too much bigger.

Re:Next wave of phishing? by Chrisq · 2014-08-05 20:48 · Score: 2 · on Gmail Recognizes Addresses Containing Non-Latin Characters

I think that's the way to go - only allow characters from a single unicode script in the username and in the domain name. The domain name part is currently handled by registras so that may not need any additional rules.

However this really should be part of the RFC, or else anyone banning mixed names would be "non compliant". If the RCF does not specify this then the best that gmail (or any other system could do) would be to prevent people registering mixed names themselves and giving a warning (and maybe colour characters) if email is recieived from an address with mixed scripts.

Re:The bashing is sometimes justified... by jc42 · 2014-07-31 06:28 · Score: 2 · on Countries Don't Own Their Internet Domains, ICANN Says

I can also show a swastika on my U.S.-hosted site and criticize public officials without fear of ridiculously heavy-handed libel/defamation laws. And don't even get me started with the bullshit cultural and language laws in France. It's amazing anything gets done in that country at all.

Oh, I dunno; I've seen any number of sites similar to this one, whose information is mirrored at zillions of locations on the web, including many outside the US. There are historical and cultural reasons for including the symbols at code points 534D and 5350 in Unicode, and I doubt that anyone has ever been prosecuted for installing full Unicode charsets or lookup software on their web sites.

I haven't looked for such pages on French sites, but I'd be surprised if they don't exist (with the text in French rather than English), and I'd also be surprised if the French government has tried to suppress such character codes in the Uncode lookups.

It's possible that such things has happened and I just haven't read about them. Does anyone know of cases of official harrassment for including pages like the above on a web site? For example, has any Islamic or other religious government ever harrassed people for allowing the U+271D char code on a web page?

(And yes, I do have a couple of experimental dictionaries on my own web sites, including one dealing with Chinese characters which includes an entry for the swastika characters. Nobody has even suggested that these glyphs shouldn't be there. Possibly it's because nobody has ever looked at my dictionaries, but still ... ;-)

Re:Middle finger by DaphneDiane · 2014-06-16 19:37 · Score: 1 · on Unicode 7.0 Released, Supporting 23 New Scripts

I believe you are referring to U1F595.

Re:It's kind of long and meandering by melikamp · 2014-01-01 13:22 · Score: 1 · on Ask Slashdot: What Are the Books Everyone Should Read?

Too contrived. The only book one needs is the UTF specification.

Re:Wow, an amazing co-incidence by Dahan · 2013-07-18 18:54 · Score: 1 · on ICANN Approves First Set of New gTLDs

How is it a "huge problem"? ASCII has a number of control characters too. A whitelist is a great idea, but why is the whitelist so restrictive? Just grab a copy of the current Unicode Data file and whitelist all current non-control characters. And if you're concerned that Zalgo might come, I suppose you could omit any non-spacing chars from the whitelist without people complaining too much (though perhaps it'd be good to include the ones that are actual letters in various Indic scripts).

Re:Emoticons are already free and open source. by martin-boundary · 2013-02-23 15:03 · Score: 1 · on Open Source Emoji Project Wants Money For Icons

Why bother with emoji, though? Just use Chinese ideographs. They're the natural final progression of this idea, after all. Moreover, if you're just after basic emoticons, there's a Unicode range from 1F600 to 1F64F.

Re:Mahjongg? by Anonymous Coward · 2012-02-01 19:38 · Score: 0 · on Unicode 6.1 Released

Done as well.

Re:Favourite unicode character by Anonymous Coward · 2012-02-01 13:32 · Score: 0 · on Unicode 6.1 Released

Actually, it was ISO/IEC 10646 that started out as a Han Unification project. Unicode actually began as a universal character encoding standard. Between version 1.0 and 1.01, Unicode merged with 10646, and they became one big squabbling family, where everyone got to act like they were Unicode, but got named after 10646. The Tibetans got lost when they moved into the new house, and somehow the Koreans ended up being triplets, but they eventually found their way back home. Eventually the Cherokees brought some native flair, while the Mormons made everyone stop drinking, at least for a while. Eventually the Chinese decided they needed a place for all their ancestors ashes, and the Japanese kids spread lolcats all over the place. Of course, now we've got French stenographers and old Hungarians knocking at the door, trying to get in, not to mention a bunch of African tribesmen and some more Minoans trying to force the gate.