unicode.org · Domains · Slashdot Mirror

Re:Cry for relevency by tepples · 2007-07-25 06:53 · Score: 1 · on W3C Considering An HTML 5

But the same applies to all the arrows used i.e. in pagination widgets (first, previous, next, last). The alt attribute of a navigation arrow could be the corresponding arrow in one of the Unicode symbol collections, such as Arrows (U+2190) (PDF).

Re:case-insensitive: performance, i18n, safety by netfunk · 2007-06-07 09:40 · Score: 3, Insightful · on Sun CEO Says ZFS Will Be 'the File System' for OSX

(I don't know what various filesystems actually do, this is just how I would assume it's done, at least on systems designed for case-insensitivity...ext2 or FFS probably would suffer from the issues you mention about scanning the whole directory.)

On a case-insensitive filesystem, your done if you're lucky. If not lucky, you need to do a linear scan of the whole damn directory. And yet Windows and Mac OS have had case-insensitive filesystems for years and somehow they are usable, even with Unicode filenames.

You can't restore the original case of a string afterwards, but you can always make it lowercase. This is called "case folding." You can fold two strings to a lowercase form, and then compare them for equality or whatnot. Works with Unicode, too.

Then there is the issue of internationalization. For example, consider "I" and "i". Some places have an uppercase with the dot, and other places have a lowercase without the dot. The rules for uppercasing and lowercasing differ from what most people are used to. Oh crap! This issue doesn't exist on a case-sensitive filesystem. While folding Unicode chars is frequently presented as an unsolvable problem ("what do you do with the letter with the squiggly thing above it? Or converting that German capital 'B' thing to two lowercase 's' chars? There are MILLIONS OF THESE!") ... there are actually very few cases in the grand scheme of things. Most languages don't have upper and lower case, after all.

Here's the whole list of characters that need to "folded" to a lowercase form, accounting for instances where it will cause the string to grow (like that German 'B' thing):

http://www.unicode.org/Public/3.2-Update/CaseFoldi ng-3.2.0.txt

(And you can hash those chars too, so folding a string doesn't involve hundreds of conditionals.)

If you don't care about Unicode, case folding an English ASCII char is 2 lines of C code, and a few more if you want extended ASCII.

Once you have a filename, you can store it in the filesystem as the specifically-entered characters, so you don't lose the original casing, but also store with it a hash of the case-folded version. Now whenever you need to look up a specific filename, you case-fold it, hash that folded string, and look it up that way against the hash you previous calculated when creating the file. Now it's as fast as the case-sensitive filesystem, minus the overhead of folding a small string.

Because of the way directory listings are done (read then look up stats) you can generally square the above numbers. Ouch. The way directory listings are done doesn't change...readdir() is the same in all cases, and your lookup is still a hash. If you had to scan, the first run is slow anyhow due to disk bandwidth and seek speeds, but then a modern OS can cache the inodes to speed this up for the next run.

App needs to make a file. App sees that file does not seem to exist. App writes file. Complex international case rules mean that no, the file DOES exist, and it gets clobbered. I would think that stat(filename) would not report the file doesn't exist if open() would then clobber it, at least not for case-sensitivity issues.

If your app decides about a file's existence by using readdir() until it finds it, and doesn't properly case-fold, and didn't call open() with O_EXCL, then not only did you go the long way about it, you got what you deserved for clobbering the file.

Actually, if you don't just open(O_CREAT | O_EXCL) to check for existence and create if missing in one step, then you'll have an atomicity problem anyhow. Use the services the OS provides, they are there for a reason.

--ryan.

Re:Word processors seem unsuited for this by jc42 · 2007-06-03 03:50 · Score: 1 · on Some Journals Rejecting Office 2007 Format

If you want a nearly-inexhaustible supply of characters, the Chinese have the answer!

Of course, they do have a few examples of characters that are easy to confuse. For example, compare Unicode chars 5E02 and 5DFF. Those really are different characters, with different pronunciations and meanings. They even have different stroke counts.

But even with a 24x46 char size, there's a limit to the number of distinct glyphs you can draw (and there are more recognized Chinese characters than that ;-).

Re:Word processors seem unsuited for this by jc42 · 2007-06-03 03:50 · Score: 1 · on Some Journals Rejecting Office 2007 Format

If you want a nearly-inexhaustible supply of characters, the Chinese have the answer!

Of course, they do have a few examples of characters that are easy to confuse. For example, compare Unicode chars 5E02 and 5DFF. Those really are different characters, with different pronunciations and meanings. They even have different stroke counts.

But even with a 24x46 char size, there's a limit to the number of distinct glyphs you can draw (and there are more recognized Chinese characters than that ;-).

Re:IIS's fault by DrVomact · 2007-05-22 06:04 · Score: 1 · on Unicode Encoding Flaw Widespread

"Full width" vs. "Half width" (or, as I prefer, "half-wit") characters exist for typographical convenience in rendering Japanese characters. (Take a look at the Unicode spec, section 10.3 for example http://www.unicode.org/book/ch10.pdf/). This does not, however, explain why certain symbols that are already defined in other parts of the Unicode standard, such as the less-than symbol (or left angle bracket) are duplicated there. I suspect that it has something to do with possible confusions that might arise when parsing or transcoding mixed double-byte and single-byte characters...but that's just a guess.

In any case, the effect of this is that there are 2 ways of producing the < glyph: you can use character code x8B or xFF1C. However, your experiments have shown that browsers do not treat these two codes as being the same character...even though they look the same. I'm not sure if that's right or wrong, if there is a right and wrong way to handle this issue (I suppose that means it's excellent grounds for a religious war)--it's just important that it be handled consistently. From what you found, IE and FF are consistent with each other, while IIS handles the two codes as identical characters. I would think that IIS would at least be on the same page with IE...but wait, thats MS we're talking about.

Re:Not a surprise... by Intron · 2007-05-22 05:02 · Score: 1 · on Unicode Encoding Flaw Widespread

http://www.unicode.org/versions/

Any time a standard has been changed, you will have some outdated, but perfectly correct software. Hence, two pieces of software may not agree on the meaning of a Unicode string even without a software error.

Re:Not a surprise... by jhol13 · 2007-05-22 03:49 · Score: 1 · on Unicode Encoding Flaw Widespread

unicode allows more than one representation for some characters Unicode states how normalization should occur: http://www.unicode.org/unicode/reports/tr15/. Is there some problems in this or what are you referring to?

Re:Big Trouble in Little China. Don't use UCS-2. by Anonymous Coward · 2007-05-12 20:14 · Score: 0 · on Migrate a MySQL Database Preserving Special Characters

Precomposed forms only exist for lossless round-trips to/from legacy character sets. NFC is frozen, so there aren't going to be precomposed forms for any new diacritical marks. You can use NFD or NFKD to remove everything precomposed from your data. They recommend NFC (everything precomposed where possible) for the Web, though.

What you were proposing is a dictionary compression scheme to fit each combining character sequence into a single array slot, but that doesn't even buy you much. You still can't sort or display strings by splitting them arbitrarily or processing one combining character sequence at a time because of ligatures, digraphs (e.g., "ch" sorts after "h" in a Slovak locale), bidi, and weird stuff like soft hyphen and combining grapheme joiner. The library routines need to see the whole string at once to give the right answers in context.

Re:Picture by NetHead026 · 2007-03-06 05:40 · Score: 1 · on The Blackest Material

Here you go (posted as is since /. will strip it out otherwise):

■ (PDF Warning)

Re:Improved multi-byte support? by VGPowerlord · 2007-01-20 16:57 · Score: 1 · on Ruby On Rails 1.2 Released

No, they're not. UCS-2 and UTF-32 are fixed width encodings, not multi-byte.

UCS-2 was a bad example, as it has been phased out in favor of UTF-16.

The technical introduction to Unicode states "The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit)."

You'll notice that only the first is listed as byte? That's because a word as they have defined it is two bytes long. Two bytes is, of course, more than one byte, thus the term "multi-byte." The UTF-8, UTF-16, UTF-32 & BOM FAQ has a nice table with the minimum and maximum bytes/character that each encoding takes.

(For reference, the Unicode standard refers to the full size of a character as a "code unit" or "code value," rather than a byte.)

And if UTF-8 is not eventually supported natively by Ruby, then the Rails implementation will still be needed. The rest of the internet is not going to drop UTF-8 just because Ruby does not support it.

This slide, from a presentation given by the Ruby's author, Yukihiro "Matz" Matsumoto, indicates upcoming support for UTF-8.

Re:Improved multi-byte support? by VGPowerlord · 2007-01-20 16:57 · Score: 1 · on Ruby On Rails 1.2 Released

No, they're not. UCS-2 and UTF-32 are fixed width encodings, not multi-byte.

UCS-2 was a bad example, as it has been phased out in favor of UTF-16.

The technical introduction to Unicode states "The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit)."

You'll notice that only the first is listed as byte? That's because a word as they have defined it is two bytes long. Two bytes is, of course, more than one byte, thus the term "multi-byte." The UTF-8, UTF-16, UTF-32 & BOM FAQ has a nice table with the minimum and maximum bytes/character that each encoding takes.

(For reference, the Unicode standard refers to the full size of a character as a "code unit" or "code value," rather than a byte.)

And if UTF-8 is not eventually supported natively by Ruby, then the Rails implementation will still be needed. The rest of the internet is not going to drop UTF-8 just because Ruby does not support it.

This slide, from a presentation given by the Ruby's author, Yukihiro "Matz" Matsumoto, indicates upcoming support for UTF-8.

Re:Improved multi-byte support? by VGPowerlord · 2007-01-20 16:57 · Score: 1 · on Ruby On Rails 1.2 Released

No, they're not. UCS-2 and UTF-32 are fixed width encodings, not multi-byte.

UCS-2 was a bad example, as it has been phased out in favor of UTF-16.

The technical introduction to Unicode states "The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit)."

You'll notice that only the first is listed as byte? That's because a word as they have defined it is two bytes long. Two bytes is, of course, more than one byte, thus the term "multi-byte." The UTF-8, UTF-16, UTF-32 & BOM FAQ has a nice table with the minimum and maximum bytes/character that each encoding takes.

(For reference, the Unicode standard refers to the full size of a character as a "code unit" or "code value," rather than a byte.)

And if UTF-8 is not eventually supported natively by Ruby, then the Rails implementation will still be needed. The rest of the internet is not going to drop UTF-8 just because Ruby does not support it.

This slide, from a presentation given by the Ruby's author, Yukihiro "Matz" Matsumoto, indicates upcoming support for UTF-8.

Easy to fix. by rumith · 2006-11-21 05:32 · Score: 1 · on ICANN Under Pressure Over Non-Latin Characters

Just introduce a restriction according to which a valid URL can only contain symbols from one alphabet. I believe it's not too hard to determine http://www.unicode.org/charts/ which character set does a UTF-8 code belong to, and if the URL uses more than one.

4 things by Anonymous Coward · 2006-10-14 03:37 · Score: 5, Interesting · on Firefox Accepting Feature Suggestions for Version 3

1. A fix for this javascript DoS attack:
for(;;) alert("Please restart your browser.");

2. Make hotkeys work everywhere, all the time. (You know when you hit CTRL+L and nothing happens)

3. Make it possible to open javascript links in new tabs.

4. Support for soft hypens.

Re:colon in Mac OS X file names by Guy+Harris · 2006-07-10 14:50 · Score: 1 · on Linux/Mac/Windows File Name Friction

In Terminal.app

...or in a X11-based app (i.e., in anything that uses standard UN*X calls to operate on files and doesn't use Apple file dialogs)...

you can create file names with colon, but such character is mapped to a forward slash when seen in Finder.

...or in standard Apple dialogs.

Historically, Mac OSes use colon to separate folder names in a path.

...which is why the Carbon layer does colon slash mapping for file/path names passed to UN*X calls or file names and file/path names returned by UN*X calls.

There is a subtle restriction in HFS+. All files in HFS+ have their names in normalized unicode

Normalization Form D, to be precise - unlike Normalization Form C, which Windows and most other UN*Xes use. This can cause some additional problems.

colon in Mac OS X file names by pikine · 2006-07-10 03:29 · Score: 3, Insightful · on Linux/Mac/Windows File Name Friction

OS X supports up to 255 characters and can use the same characters as Linux, except for a colon (:).

In Terminal.app, you can create file names with colon, but such character is mapped to a forward slash when seen in Finder. On the other hand, you can use forward slash in Finder, and it is mapped to a colon in the command line.

Historically, Mac OSes use colon to separate folder names in a path.

There is a subtle restriction in HFS+. All files in HFS+ have their names in normalized unicode, and in order to normalize in the first place, file names must be in valid UTF-8 encoding. You cannot use random character string for file names.

There is no such restriction for UFS on Mac OS X. I think UFS supports roughly the same characters as in BSD and Linux and any other Unices. If you're transferring files from Linux with names in a legacy encoding, you can create a UFS disk image and convert file names to UTF-8 before copying them to HFS+.

Re:Unforseen problems by clarkcox3 · 2006-04-17 04:26 · Score: 1 · on Is It Time For .tel?

Actually, no. I do not believe incorrectly. Chang is indeed a surname. I am quite familiar with the order used in Chinese names; pleas check your facts before you attempt to correct someone else.