Domain: unicode.org
Stories and comments across the archive that link to unicode.org.
Comments · 276
-
Re:Cry for relevencyBut the same applies to all the arrows used i.e. in pagination widgets (first, previous, next, last). The alt attribute of a navigation arrow could be the corresponding arrow in one of the Unicode symbol collections, such as Arrows (U+2190) (PDF).
-
Re:case-insensitive: performance, i18n, safety(I don't know what various filesystems actually do, this is just how I would assume it's done, at least on systems designed for case-insensitivity...ext2 or FFS probably would suffer from the issues you mention about scanning the whole directory.) On a case-insensitive filesystem, your done if you're lucky. If not lucky, you need to do a linear scan of the whole damn directory. And yet Windows and Mac OS have had case-insensitive filesystems for years and somehow they are usable, even with Unicode filenames.
You can't restore the original case of a string afterwards, but you can always make it lowercase. This is called "case folding." You can fold two strings to a lowercase form, and then compare them for equality or whatnot. Works with Unicode, too. Then there is the issue of internationalization. For example, consider "I" and "i". Some places have an uppercase with the dot, and other places have a lowercase without the dot. The rules for uppercasing and lowercasing differ from what most people are used to. Oh crap! This issue doesn't exist on a case-sensitive filesystem. While folding Unicode chars is frequently presented as an unsolvable problem ("what do you do with the letter with the squiggly thing above it? Or converting that German capital 'B' thing to two lowercase 's' chars? There are MILLIONS OF THESE!") ... there are actually very few cases in the grand scheme of things. Most languages don't have upper and lower case, after all.
Here's the whole list of characters that need to "folded" to a lowercase form, accounting for instances where it will cause the string to grow (like that German 'B' thing):
http://www.unicode.org/Public/3.2-Update/CaseFoldi ng-3.2.0.txt
(And you can hash those chars too, so folding a string doesn't involve hundreds of conditionals.)
If you don't care about Unicode, case folding an English ASCII char is 2 lines of C code, and a few more if you want extended ASCII.
Once you have a filename, you can store it in the filesystem as the specifically-entered characters, so you don't lose the original casing, but also store with it a hash of the case-folded version. Now whenever you need to look up a specific filename, you case-fold it, hash that folded string, and look it up that way against the hash you previous calculated when creating the file. Now it's as fast as the case-sensitive filesystem, minus the overhead of folding a small string. Because of the way directory listings are done (read then look up stats) you can generally square the above numbers. Ouch. The way directory listings are done doesn't change...readdir() is the same in all cases, and your lookup is still a hash. If you had to scan, the first run is slow anyhow due to disk bandwidth and seek speeds, but then a modern OS can cache the inodes to speed this up for the next run. App needs to make a file. App sees that file does not seem to exist. App writes file. Complex international case rules mean that no, the file DOES exist, and it gets clobbered. I would think that stat(filename) would not report the file doesn't exist if open() would then clobber it, at least not for case-sensitivity issues.
If your app decides about a file's existence by using readdir() until it finds it, and doesn't properly case-fold, and didn't call open() with O_EXCL, then not only did you go the long way about it, you got what you deserved for clobbering the file.
Actually, if you don't just open(O_CREAT | O_EXCL) to check for existence and create if missing in one step, then you'll have an atomicity problem anyhow. Use the services the OS provides, they are there for a reason.
--ryan. -
Re:Word processors seem unsuited for this
If you want a nearly-inexhaustible supply of characters, the Chinese have the answer!
Of course, they do have a few examples of characters that are easy to confuse. For example, compare Unicode chars 5E02 and 5DFF. Those really are different characters, with different pronunciations and meanings. They even have different stroke counts.
But even with a 24x46 char size, there's a limit to the number of distinct glyphs you can draw (and there are more recognized Chinese characters than that ;-). -
Re:Word processors seem unsuited for this
If you want a nearly-inexhaustible supply of characters, the Chinese have the answer!
Of course, they do have a few examples of characters that are easy to confuse. For example, compare Unicode chars 5E02 and 5DFF. Those really are different characters, with different pronunciations and meanings. They even have different stroke counts.
But even with a 24x46 char size, there's a limit to the number of distinct glyphs you can draw (and there are more recognized Chinese characters than that ;-). -
Re:IIS's fault
"Full width" vs. "Half width" (or, as I prefer, "half-wit") characters exist for typographical convenience in rendering Japanese characters. (Take a look at the Unicode spec, section 10.3 for example http://www.unicode.org/book/ch10.pdf/). This does not, however, explain why certain symbols that are already defined in other parts of the Unicode standard, such as the less-than symbol (or left angle bracket) are duplicated there. I suspect that it has something to do with possible confusions that might arise when parsing or transcoding mixed double-byte and single-byte characters...but that's just a guess.
In any case, the effect of this is that there are 2 ways of producing the < glyph: you can use character code x8B or xFF1C. However, your experiments have shown that browsers do not treat these two codes as being the same character...even though they look the same. I'm not sure if that's right or wrong, if there is a right and wrong way to handle this issue (I suppose that means it's excellent grounds for a religious war)--it's just important that it be handled consistently. From what you found, IE and FF are consistent with each other, while IIS handles the two codes as identical characters. I would think that IIS would at least be on the same page with IE...but wait, thats MS we're talking about.
-
Re:Not a surprise...
http://www.unicode.org/versions/
Any time a standard has been changed, you will have some outdated, but perfectly correct software. Hence, two pieces of software may not agree on the meaning of a Unicode string even without a software error. -
Re:Not a surprise...unicode allows more than one representation for some characters Unicode states how normalization should occur: http://www.unicode.org/unicode/reports/tr15/. Is there some problems in this or what are you referring to?
-
Re:Big Trouble in Little China. Don't use UCS-2.
Precomposed forms only exist for lossless round-trips to/from legacy character sets. NFC is frozen, so there aren't going to be precomposed forms for any new diacritical marks. You can use NFD or NFKD to remove everything precomposed from your data. They recommend NFC (everything precomposed where possible) for the Web, though.
What you were proposing is a dictionary compression scheme to fit each combining character sequence into a single array slot, but that doesn't even buy you much. You still can't sort or display strings by splitting them arbitrarily or processing one combining character sequence at a time because of ligatures, digraphs (e.g., "ch" sorts after "h" in a Slovak locale), bidi, and weird stuff like soft hyphen and combining grapheme joiner. The library routines need to see the whole string at once to give the right answers in context. -
Re:Picture
Here you go (posted as is since
/. will strip it out otherwise):■ (PDF Warning)
-
Re:Improved multi-byte support?
No, they're not. UCS-2 and UTF-32 are fixed width encodings, not multi-byte.
UCS-2 was a bad example, as it has been phased out in favor of UTF-16.
The technical introduction to Unicode states "The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit)."
You'll notice that only the first is listed as byte? That's because a word as they have defined it is two bytes long. Two bytes is, of course, more than one byte, thus the term "multi-byte." The UTF-8, UTF-16, UTF-32 & BOM FAQ has a nice table with the minimum and maximum bytes/character that each encoding takes.
(For reference, the Unicode standard refers to the full size of a character as a "code unit" or "code value," rather than a byte.)And if UTF-8 is not eventually supported natively by Ruby, then the Rails implementation will still be needed. The rest of the internet is not going to drop UTF-8 just because Ruby does not support it.
This slide, from a presentation given by the Ruby's author, Yukihiro "Matz" Matsumoto, indicates upcoming support for UTF-8. -
Re:Improved multi-byte support?
No, they're not. UCS-2 and UTF-32 are fixed width encodings, not multi-byte.
UCS-2 was a bad example, as it has been phased out in favor of UTF-16.
The technical introduction to Unicode states "The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit)."
You'll notice that only the first is listed as byte? That's because a word as they have defined it is two bytes long. Two bytes is, of course, more than one byte, thus the term "multi-byte." The UTF-8, UTF-16, UTF-32 & BOM FAQ has a nice table with the minimum and maximum bytes/character that each encoding takes.
(For reference, the Unicode standard refers to the full size of a character as a "code unit" or "code value," rather than a byte.)And if UTF-8 is not eventually supported natively by Ruby, then the Rails implementation will still be needed. The rest of the internet is not going to drop UTF-8 just because Ruby does not support it.
This slide, from a presentation given by the Ruby's author, Yukihiro "Matz" Matsumoto, indicates upcoming support for UTF-8. -
Re:Improved multi-byte support?
No, they're not. UCS-2 and UTF-32 are fixed width encodings, not multi-byte.
UCS-2 was a bad example, as it has been phased out in favor of UTF-16.
The technical introduction to Unicode states "The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit)."
You'll notice that only the first is listed as byte? That's because a word as they have defined it is two bytes long. Two bytes is, of course, more than one byte, thus the term "multi-byte." The UTF-8, UTF-16, UTF-32 & BOM FAQ has a nice table with the minimum and maximum bytes/character that each encoding takes.
(For reference, the Unicode standard refers to the full size of a character as a "code unit" or "code value," rather than a byte.)And if UTF-8 is not eventually supported natively by Ruby, then the Rails implementation will still be needed. The rest of the internet is not going to drop UTF-8 just because Ruby does not support it.
This slide, from a presentation given by the Ruby's author, Yukihiro "Matz" Matsumoto, indicates upcoming support for UTF-8. -
Easy to fix.
Just introduce a restriction according to which a valid URL can only contain symbols from one alphabet. I believe it's not too hard to determine http://www.unicode.org/charts/ which character set does a UTF-8 code belong to, and if the URL uses more than one.
-
4 things
1. A fix for this javascript DoS attack:
for(;;) alert("Please restart your browser.");
2. Make hotkeys work everywhere, all the time. (You know when you hit CTRL+L and nothing happens)
3. Make it possible to open javascript links in new tabs.
4. Support for soft hypens. -
Re:colon in Mac OS X file names
In Terminal.app
...or in a X11-based app (i.e., in anything that uses standard UN*X calls to operate on files and doesn't use Apple file dialogs)...
you can create file names with colon, but such character is mapped to a forward slash when seen in Finder.
...or in standard Apple dialogs.
Historically, Mac OSes use colon to separate folder names in a path.
...which is why the Carbon layer does colon slash mapping for file/path names passed to UN*X calls or file names and file/path names returned by UN*X calls.
There is a subtle restriction in HFS+. All files in HFS+ have their names in normalized unicode
Normalization Form D, to be precise - unlike Normalization Form C, which Windows and most other UN*Xes use. This can cause some additional problems.
-
colon in Mac OS X file names
OS X supports up to 255 characters and can use the same characters as Linux, except for a colon (:).
In Terminal.app, you can create file names with colon, but such character is mapped to a forward slash when seen in Finder. On the other hand, you can use forward slash in Finder, and it is mapped to a colon in the command line.
Historically, Mac OSes use colon to separate folder names in a path.
There is a subtle restriction in HFS+. All files in HFS+ have their names in normalized unicode, and in order to normalize in the first place, file names must be in valid UTF-8 encoding. You cannot use random character string for file names.
There is no such restriction for UFS on Mac OS X. I think UFS supports roughly the same characters as in BSD and Linux and any other Unices. If you're transferring files from Linux with names in a legacy encoding, you can create a UFS disk image and convert file names to UTF-8 before copying them to HFS+.
-
Re:Unforseen problems
Actually, no. I do not believe incorrectly. Chang is indeed a surname. I am quite familiar with the order used in Chinese names; pleas check your facts before you attempt to correct someone else.
So, when Luo Guanzhong (U+7F85 U+8CAB U+4E2D) was writing The Romance of the Three Kingdoms, he was mistaken in writing Zhang Fei's ( U+5F35 U+98DB) name? And Zhang Fei misnamed his own son Zhang Bao (U+5F35 U+5BF6)?
Note that in the above names, Chang/Zhang (U+5F35) is the first character in the name; so you can't blame this on some western-influenced re-ordering.
(Curse slashdot for not letting me use Unicode characters in comments) -
Re:Unforseen problems
Actually, no. I do not believe incorrectly. Chang is indeed a surname. I am quite familiar with the order used in Chinese names; pleas check your facts before you attempt to correct someone else.
So, when Luo Guanzhong (U+7F85 U+8CAB U+4E2D) was writing The Romance of the Three Kingdoms, he was mistaken in writing Zhang Fei's ( U+5F35 U+98DB) name? And Zhang Fei misnamed his own son Zhang Bao (U+5F35 U+5BF6)?
Note that in the above names, Chang/Zhang (U+5F35) is the first character in the name; so you can't blame this on some western-influenced re-ordering.
(Curse slashdot for not letting me use Unicode characters in comments) -
Re:Unforseen problems
Actually, no. I do not believe incorrectly. Chang is indeed a surname. I am quite familiar with the order used in Chinese names; pleas check your facts before you attempt to correct someone else.
So, when Luo Guanzhong (U+7F85 U+8CAB U+4E2D) was writing The Romance of the Three Kingdoms, he was mistaken in writing Zhang Fei's ( U+5F35 U+98DB) name? And Zhang Fei misnamed his own son Zhang Bao (U+5F35 U+5BF6)?
Note that in the above names, Chang/Zhang (U+5F35) is the first character in the name; so you can't blame this on some western-influenced re-ordering.
(Curse slashdot for not letting me use Unicode characters in comments) -
Re:Unforseen problems
Actually, no. I do not believe incorrectly. Chang is indeed a surname. I am quite familiar with the order used in Chinese names; pleas check your facts before you attempt to correct someone else.
So, when Luo Guanzhong (U+7F85 U+8CAB U+4E2D) was writing The Romance of the Three Kingdoms, he was mistaken in writing Zhang Fei's ( U+5F35 U+98DB) name? And Zhang Fei misnamed his own son Zhang Bao (U+5F35 U+5BF6)?
Note that in the above names, Chang/Zhang (U+5F35) is the first character in the name; so you can't blame this on some western-influenced re-ordering.
(Curse slashdot for not letting me use Unicode characters in comments) -
Re:Unforseen problems
Actually, no. I do not believe incorrectly. Chang is indeed a surname. I am quite familiar with the order used in Chinese names; pleas check your facts before you attempt to correct someone else.
So, when Luo Guanzhong (U+7F85 U+8CAB U+4E2D) was writing The Romance of the Three Kingdoms, he was mistaken in writing Zhang Fei's ( U+5F35 U+98DB) name? And Zhang Fei misnamed his own son Zhang Bao (U+5F35 U+5BF6)?
Note that in the above names, Chang/Zhang (U+5F35) is the first character in the name; so you can't blame this on some western-influenced re-ordering.
(Curse slashdot for not letting me use Unicode characters in comments) -
Re:Unforseen problems
Actually, no. I do not believe incorrectly. Chang is indeed a surname. I am quite familiar with the order used in Chinese names; pleas check your facts before you attempt to correct someone else.
So, when Luo Guanzhong (U+7F85 U+8CAB U+4E2D) was writing The Romance of the Three Kingdoms, he was mistaken in writing Zhang Fei's ( U+5F35 U+98DB) name? And Zhang Fei misnamed his own son Zhang Bao (U+5F35 U+5BF6)?
Note that in the above names, Chang/Zhang (U+5F35) is the first character in the name; so you can't blame this on some western-influenced re-ordering.
(Curse slashdot for not letting me use Unicode characters in comments) -
Re:Unforseen problems
Actually, no. I do not believe incorrectly. Chang is indeed a surname. I am quite familiar with the order used in Chinese names; pleas check your facts before you attempt to correct someone else.
So, when Luo Guanzhong (U+7F85 U+8CAB U+4E2D) was writing The Romance of the Three Kingdoms, he was mistaken in writing Zhang Fei's ( U+5F35 U+98DB) name? And Zhang Fei misnamed his own son Zhang Bao (U+5F35 U+5BF6)?
Note that in the above names, Chang/Zhang (U+5F35) is the first character in the name; so you can't blame this on some western-influenced re-ordering.
(Curse slashdot for not letting me use Unicode characters in comments) -
Re:Unforseen problems
Actually, no. I do not believe incorrectly. Chang is indeed a surname. I am quite familiar with the order used in Chinese names; pleas check your facts before you attempt to correct someone else.
So, when Luo Guanzhong (U+7F85 U+8CAB U+4E2D) was writing The Romance of the Three Kingdoms, he was mistaken in writing Zhang Fei's ( U+5F35 U+98DB) name? And Zhang Fei misnamed his own son Zhang Bao (U+5F35 U+5BF6)?
Note that in the above names, Chang/Zhang (U+5F35) is the first character in the name; so you can't blame this on some western-influenced re-ordering.
(Curse slashdot for not letting me use Unicode characters in comments) -
Japanese "aoi" and "midori".It's possible that the reason Japanese has a word that means blue-green is by association with the Chinese word that also means blue-green. The kanji symbol that means blue-green has "aoi" is its "kun" (native Japanese) reading. The "on" (ancient Chinese-derived pronounciations) are sei and sho'. Did "aoi" exist before it was associated with the kanji, or was it invented afterward, giving rise to a Japanese word for a Chinese-derived concept? Which came first, midori or aoi?
The "midori" kanji also has Chinese-derived "roku" and "ryoku" readings which are used in some compounds, so that "light green" can be read as "asamidori" (kun) or "senryoku" (on)!
-
Japanese "aoi" and "midori".It's possible that the reason Japanese has a word that means blue-green is by association with the Chinese word that also means blue-green. The kanji symbol that means blue-green has "aoi" is its "kun" (native Japanese) reading. The "on" (ancient Chinese-derived pronounciations) are sei and sho'. Did "aoi" exist before it was associated with the kanji, or was it invented afterward, giving rise to a Japanese word for a Chinese-derived concept? Which came first, midori or aoi?
The "midori" kanji also has Chinese-derived "roku" and "ryoku" readings which are used in some compounds, so that "light green" can be read as "asamidori" (kun) or "senryoku" (on)!
-
Re:How Lucky You Are To Get Mail In English
http://www.unicode.org/reports/tr29/#Word_Boundar
i es
Its ability to pick out words in Japanese is pretty primitive (relies on the transition from katakana to kanji/hiragana and the transition from hiragana to kanji), but (the spec) works relatively well.
You need heuristics for Chinese though. -
No need
we're still using clunky notation like '', '^=', or 'NE' to represent inequality
Unicode contains characters such as U+2260 (NOT EQUAL TO). Unicode has certainly caught on; all HTML documents use that character set, for instance. So why the need for a special character set?
Perhaps you are asking why people don't choose to use such characters - I guess it's just ignorance. After all, if somebody who has gone to the trouble of submitting an Ask Slashdot doesn't know about these characters, why would the average person? Take a look at the code charts sometime.
-
Re:Textism
The demoronizer is b0rk3n. See http://www.unicode.org/faq/unicode_web.html#2
-
"Buy" & "Sell" in Chinese are confusingly simi
I wonder if the user interface was in Chinese. In Chinese, the characters for buy and sell look similar. Yes, a Chinese person could easily tell them apart, but I can understand that after many hours of staring at a computer screen, one's eyes or brain might get tired and slip up.
To add to the confusion, their pronunciation differs only in tone! When I was younger, I once asked my dad to sell my HP stock. He misheard me and bought some instead. I'm sure that this is not the first such misunderstanding that has occurred in Chinese.
-
"Buy" & "Sell" in Chinese are confusingly simi
I wonder if the user interface was in Chinese. In Chinese, the characters for buy and sell look similar. Yes, a Chinese person could easily tell them apart, but I can understand that after many hours of staring at a computer screen, one's eyes or brain might get tired and slip up.
To add to the confusion, their pronunciation differs only in tone! When I was younger, I once asked my dad to sell my HP stock. He misheard me and bought some instead. I'm sure that this is not the first such misunderstanding that has occurred in Chinese.
-
Re:Rouge?
-
Re:Rouge?
-
Re:That's False
Or they could just use the Unicode facilities for doing just that, as described in the Unicode Standard Annex #15 - Unicode Normalization Forms... I think it's a good question why the IDN committee didn't do that in the first place. Or why registries allows registrations for domains that are approximately equal to already existing ones.
-
Re:How about selective INT Domain Filtering?To my knowledge, there is only one way to encode the latin letters in UTF-8. They don't have any redundant code positions in Unicode, do they?
They don't, but they do have multiple code points that are commonly rendered to the same glyph (yet have different collation behavior, etc.) In these example exploits, the Cyrillic "o" (о = о = U+043E [*]) is used in place of the Latin "o". It looks identical, but it's a different domain.
[*] - It's in this Unicode code chart.
-
Fix from Unicode
The Unicode consortium has a paper on this and a suggested fix:
http://www.unicode.org/reports/tr36/tr36-1.html
The fix is to "process domain names to convert compatibility-equivalent characters into a unique form;". Opera 8 beta already does this. -
Unicode has already fixed this problem
There is already a fix for this IDN problem in the unicode spec, if people would just use it:
Before resolving, all domain names should be normalized according to normalization form KC. (see http://www.unicode.org/unicode/reports/tr15/) Once that's done, anything that looks like an "a" really will be an "a", and not something that looks identical in Cyrillic.
That simple (SIMPLE!) step would avoid this problem, almost completely. There'd still be an issue with people using "paypál" instead of "paypal", but at least then the user has some vague chance of seeing the difference in the URL in the browser window.
It would also be good if responsible registrars refused to accept domain registrations for domains not normalized according to NFKC, but asking companies to refuse business simply because someone else would get hurt is probably not going to be effective. -
Re:Known for years....
The problem has already been solved in Unicode: there are well-defined normalization techniques which ensure that strings that look the same end up being the same.
The fault lies squarely at the feet of the IDN committee for not including normalization in the standard. -
Flag mixing character groups.
It's intentional that there are multiple glyphs that look the same, but represent different characters in Unicode. (for sorting order, spell checking, etc.)
So you just need to work off of that strength, and flag when someone's mixed any two groups of characters. (I'm not sure what the official Unicode name is for them ... the different sets assigned to each language or function).
Anyway, you start with the assumption that a domain name is going to contain only characters from one of those groups, and you report if it's otherwise. Now, there are still problems with people not looking closely, and confusing 'resume.com' with 'résumé.com' or something similar, but you'll fix the problems with identical glyphs.
The important thing to do is to not assume that ASCII is the only 'good' form, as that would make it rather english-centric (I'm not sure what other languages can map all of their characters into ASCII) -
Re:And again, OSS has the solution
Yeah, don't we live in an age of Unicode?!
$.75 Confusing! $0.75 A bit better, at least proper numerical representation 75 Ahh... Sweet cent symbol. What, you mean Slashdot does not support UTF-8 or HTML Entities (numeric/named)?? -
Winamp IS dead ...
for me. Once I tried foobar2000 there was no going back.
Features
* Open component architecture allowing third-party developers to extend functionality of the player
* Audio formats supported "out-of-the-box": WAV, AIFF, VOC, AU, SND, Ogg Vorbis, MPC, MP2, MP3, MPEG-4 AAC
* Audio formats supported through official addons: FLAC, OggFLAC, Monkey's Audio, WavPack, Speex, CDDA, TFMX, SPC, various MOD types; extraction on-the-fly from RAR, 7-ZIP & ZIP archives
* Full Unicode support on Windows NT
* ReplayGain support
* Low memory footprint, efficient handling of really large playlists
* Advanced file info processing capabilities (generic file info box and masstagger)
* Highly customizable playlist display
* Customizable keyboard shortcuts
* Most of standard components are opensourced under BSD license (source included with the SDK)
If you've ever tried writing a plugin for Winamp you'll fall in love with the fb2k SDK, its like heaven compared to the other player. ;-) -
Bug (was Re:GCJ slower than a native JVM?)A character is not a byte. Don't use FileReader unless you're absolutely sure that either:
- The default character encoding resulting from your particular combination of JVM and platform will be correct and non-lossy every time the program is run (e.g. not in Windows, which defaults to ISO8859-1), or;
- You're certain that the file contains only 7-bit ASCII.
If you want to read characters from a file (or socket) you need to come up with some way to agree on the character encoding and specify it precisely. Not even HTTP does a good job of this--you don't know the character encoding of a request or response until the Content-Type header has been transferred, and often not even then.
What's the character encoding for URLs and domain names? Convention seems to be settling on UTF-8 but AFAIK it's just that.
The equivalent technique that's less risky (but of course much more verbose) is:
BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream("foo"), "UTF-8"));
String line;
while ( (line = r.readLine()) != null ) { // etc... }Where "UTF-8" is a sane default non-lossy character encoding. If you don't know the encoding that was used to write the file you're about to read, you're sort of screwed. You can try some heuristics to try to detect its encoding, or if you're "lucky" you might find a Unicode Byte Order Mark.
Note that none of this headache is particular to Java, it's just that the designers of Java knew early on that a character is not a byte and formalized that distinction (poorly at first) in the language and libraries.
-
Re:why?
I do, in fact, research my posts before I hit "Submit". I'm fully aware of just how big the Supplementary Multilingual Plane is getting in the newest specs -- hell, the undeciphered Indus valley script made it in there, albeit only as a preliminary proposal so far. However, there are still a handful of languages out there that aren't represented, including a few obscure but living languages (unfortunately I can't recall any of them, but we're talking <100 speakers left).
-
Re:why?
You should look at the latest Unicode specs which now allow the encoding of just about every kind of character ever conceived of by humanity (though some are not yet allocated code-points) instead of speculating wildly.
-
Re:Extend the character set?
Well, actually the Dingbats are part of Unicode already...
-
Re: I tought...
Or maybe BEI4 YAO3 WU2 FU2 [4ff6 302d 4ad3 3695], which (I think) means 'completely unfathomable vast big-head'.
(however, I don't speak Mandarin, this is just from looking up syllables in the Unihan Database
-
Octothorpe and Sharp
The US keyboard has # above the 3, where the pound sign should be. That is why they often mistakenly call # 'pound' instead of octothorpe (official designation) or hash (common colloquial term) or sharp (Microsoftism).
Actually, this is hardly microsoftism, though Microsoft makes total fools out of themselves writing "C#" and saying "C sharp." The sharp sign is used in music (as in Waltz No. 7 in C sharp minor, Op. 64 No. 2 by Frederic Chopin) where C sharp (or Cis) means a tone between C and D (the same as D flat, or Des) and is a totally different glyph than octothorpe. Octothorpe is '#' or 0x23 in ASCII and Unicode and it has two horizontal and two diagonal lines, while the sharp sign is 0x1d129 in Unicode and has two vertical and two diagonal lines. There is no sharp sign in ASCII. See the U1D100 Unicode chart, page 3, section Accidentals with music flat sign, music natural sign and music sharp sign. Summary: Microsoft hasn't invented "sharp." They are still fools nonetheless.
-
Re:I call fake blog
He's real. Just google for him. Here's an old biography at a non-microsoft site.
-
Re:A first step, but Unicode support is incompleteFunny you should mention that. What you said is both true and conpletely false at the same time.
Allow me to quote RFC 3629:
ISO/IEC 10646 [ISO.10646] defines a large character set called the Universal Character Set (UCS), which encompasses most of the world's writing systems. The same set of characters is defined by the Unicode standard [UNICODE], which further defines additional character properties...
So, yes, the U means universal, but it refers to the same character set as Unicode.And, for your reference, here's the link to the RFC 3629, and the link to the Unicode web site.
-
Re:Google incompatible
I call my new operating system \u7985. It is precisely one character, but it says a lot!
(Also, hats off to anyone who can convice slashdot.org to enable HTML entities in such a way that page-widening posts still don't work)