Domain: unicode.org
Stories and comments across the archive that link to unicode.org.
Comments · 276
-
Re: Too bad slashdot used to cause these
The vertical tick used as an apostrophe was a temporary measure put in place to simplify keyboards and to simplify the character set when every bit and byte was counted. Even the Unicode consortium recommend that a curly apostrophe be used for printed materials.
http://www.unicode.org/version...Encoding Characters with Multiple Semantic Values. Some of the punctuation characters in the ASCII range (U+0020..U+007F) have multiple uses, either through ambiguity in the original standards or through accumulated reinterpretations of a limited code set. For example, 2716 is defined in ANSI X3.4 as apostrophe (closing single quotation mark; acute accent), and 2D16 is defined as hyphen-minus. In general, the Unicode Standard provides the same interpretation for the equivalent code points, without adding to or subtracting from their semantics. The Unicode Standard supplies unambiguous codes elsewhere for the most useful particular interpretations of these ASCII values; the corresponding unambigu- ous characters are cross-referenced in the character names list for this block.
Apostrophes
U+0027 apostrophe is the most commonly used character for apostrophe. For historical reasons, U+0027 is a particularly overloaded character. In ASCII, it is used to represent a punctuation mark (such as right single quotation mark, left single quotation mark, apostrophe punctuation, vertical line, or prime) or a modifier letter (such as apostrophe modifier or acute accent). Punctuation marks generally break words; modifier letters generally are considered part of a word.
When text is set, U+2019 right single quotation mark is preferred as apostrophe, but only U+0027 is present on most keyboards. Software commonly offers a facility for automatically converting the U+0027 apostrophe to a contextually selected curly quotation glyph. In these systems, a U+0027 in the data stream is always represented as a straight vertical line and can never represent a curly apostrophe or a right quotation mark.
Punctuation Apostrophe. U+2019 right single quotation mark is preferred where the character is to represent a punctuation mark, as for contractions: “We’ve been here before.” In this latter case, U+2019 is also referred to as a punctuation apostrophe.As you said, language evolves and we've reached the stage where the systems we use have evolved beyond their original constraints that dictated a single character be used for apostrophe, single right quotation marks, prime and an acute accent and now we have the ability to use the correct character without resorting to overloading a single ASCII code point.
Most people, in the software they use on a daily basis, will end up using the correct unicode character without even knowing it as commonly used software will automatically and by default substitute curly quotes in place of straight quotes. Of course text editors used for programming where semantics are critical will not perform substitutions like this but they're not the most common use case - general purpose word processing is far more common. -
Re: Too bad slashdot used to cause these
The apostrophe has been around a lot longer than computer and typewriter keyboards. The character called an apostrophe by ASCII is named that for (recent) historical reasons and it is not a typographically correct apostrophe. The Unicode consortium recommend using U+2019 - the Right Single Quotation Mark as an apostrophe however U+0027 is the character that exists on most keyboards.
From: http://www.unicode.org/version...
Apostrophes
U+0027 apostrophe is the most commonly used character for apostrophe. For historical reasons, U+0027 is a particularly overloaded character. In ASCII, it is used to represent a punctuation mark (such as right single quotation mark, left single quotation mark, apos- trophe punctuation, vertical line, or prime) or a modifier letter (such as apostrophe modi- fier or acute accent). Punctuation marks generally break words; modifier letters generally are considered part of a word.
When text is set, U+2019 right single quotation mark is preferred as apostrophe, but only U+0027 is present on most keyboards. Software commonly offers a facility for auto- matically converting the U+0027 apostrophe to a contextually selected curly quotation glyph. In these systems, a U+0027 in the data stream is always represented as a straight ver- tical line and can never represent a curly apostrophe or a right quotation mark. -
Re: A UTF8 processing failure?
A lot of embedded systems will behave strangely if you feed them a lot of characters like this
https://en.wiktionary.org/wiki...
http://www.unicode.org/cgi-bin...That character is four bytes in UTF-8 which kills systems that assume a maximum of three - which used to be true for Chinese and Japanese, but isn't now.
It's also two UTF-16 code points, which will mess up systems that assume each character is a single code point.
Now you'll say "Those systems are all buggy". That's true now, but it wasn't true when a lot of them were designed - Unicode used to be limited to 64K characters which meant it was a fixed width encoding for UCS-2. And that three bytes was the maximum encoding for UTF-8.
When it grew those ceased to be true. Which is fine for systems that are maintained - the vendor would find bugs created by the standard change and push an update. Unfortunately a lot of systems - particularly embedded ones - aren't like that. Hell, Android isn't like that. Google push updates out to vendors but if your machine is EOL you're SOL.
-
Re:Raise your child properly
Unicode presents an unnecessary security risk for the sake of emoji. We don't need that crap here...
-
Re:Text-only Email safe?
Heh... What about unicode ?
-
Re: Glad I opted out of...
HFS+ is shit and is dangerous. It's based on very old standards and is a total mess under the hood, not so different than NTFS.
https://www.cio.com/article/2868393/linus-torvalds-apples-hfs-is-probably-the-worst-file-system-ever.html
And just because St. Linus spews out garbage, you lap it up like the good Apple-Hater you are:
But here's da facts, Jack. Read 'em and weep:
https://slashdot.org/comments....
APFS also has huge Unicode issues:
https://eclecticlight.co/2017/04/06/apfs-is-currently-unusable-with-most-non-english-languages/
Bullshit. APFS supports Unicode 9.0. PLENTY of multilanguage support!
https://developer.apple.com/li...
http://unicode.org/versions/Un...
Further, APFS is still very new. Apple is a multinational company. Do your REALLY think they won't be ironed-out sooner, rather than later?
Btrfs is still in development and has quite a while to go. Filesystems are very difficult and are something you cannot fuck up on! You needs years of testing and verifiability before you push a new fs to market.
And yet, Synology, to name a company with a LOT to lose by embracing a new filesystem, has gone all-in on btrfs on their new OS. Are THEY being foolhardy? Why not whine about THEM? They migrated from ext4 to btrfs virtually overnight!
I hope Apple at least fixed all the Unicode bugs in this APFS release. I think I'll stick with ext4.
Of course you will, you good little Linux fanboi...
-
Re:Well, i don't know...
Android, iOS and even Windows 10 supports it, but Slashdot doesn't.
Yes, and that is a feature. There is no need to take unnecessary risks.
-
Re:No unicode on Slashdot.
A pox on Unicode!
TFTFY.
-
Re:No unicode on Slashdot.
A pox on unicode!
-
Re:thereÃ(TM)s simply no foolproof way to kil
Yes, unicode is a horrible idea! Even extended ASCII is pretty scary. So don't be a slob. Clean up your posts! (Especially you so-called 'editors')
-
Re:the could develop it at least a little further
Unicode has combining characters - look up "zero width joiner" or "combining accent". (Something as simple as an é has multiple renditions in Unicode -- ('LATIN SMALL LETTER E WITH ACUTE' (U+00E9) and 'LATIN SMALL LETTER E' (U+0065) followed by 'COMBINING ACUTE ACCENT' (U+0301)' is the same length, a single letter but unless an explicit Unicode canonicalization step has been taken, they will be a different Unicode code-point sequence.)
Note that this is issues at the Unicode 'layer', it doesn't matter if that's UCS-2, UTF-16, UCS-4, UTF-8, UTF-32. They are different code point sequences for the same 'character'.
Don't think 32 bit encoding fixes these issues, sadly.
Heck, there are emjoi characters that are a sequence of many code points. The sequence U+1F469 U+1F3FD U+200D U+1F393 (FOUR Unicode code-points) is a single emjoi, of 'woman student: medium skin tone'. How many characters is that? The literal 'translation' is "woman", "medium skin tone", "zero-width joiner", "graduation cap".
In fact, that depends on your implementation. It might recognise the sequence as 'woman student medium skin tone' and render that as a single glyph, but it might equally go "I don't understand the zero-width-joiner in there and render this as "woman medium skin tone' 'graduation cap' as two separate glyphs - a perfectly acceptable fallback for this code-point sequence. In 32 bit, that's 16 bytes for a single character. In big-endian UTF-16, that's, DC96D83D DFFDD83C 200D DF93D83C, 14 bytes.
So... is that a string of length one? length two on fallback? Four code points? Three code points that aren't zero-width? Even in 32 bit, you cannot meaningfully tell the length of a string without specialised decoding. And you cannot arbitrarily 'chop' the string on 32 bit boundaries either - the second code point literally means 'render the previous code point with this skin tone' -- it is meaningless on its own, and the first code point on its own is a different (albeit generic) colour!
The only thing the encoding affects is 'bytes per code-point', but the number of code-points is irrelevant to virtually everyone - it's not the same as the 'length of the string'
And many people consider UTF-16 to be wasteful on bytes. After all, ASCII is good enough for a lot of uses..
Summary: Nope, there's no magical fixed-width Unicode encoding either. Unicode *itself* is variable length.
-
Re:Don't single out Facebook
Regular ASCII is good enough for everybody. Don't rock the boat. Unicode is a virus
-
Re:"Now available to download" link
1. On the emjoi's fonts there's "Raised Hand With Part Between Middle And Ring Fingers" - WhyTF is that not called "live long and prosper"? Some fonts are described by how they look while others are described by what they mean. A bit inconsistent but I guess that's more of a Unicode consortium issue.
2. Some of the hand emoji's like "White Left Pointing Backhand Index" are all called "white..." even though they've clearly done the race/skin tone colour spectrum ala whatsapp.
2b. The colours are a second unicode code (emoji modifier sequence) on the emoji ranging from U+1F3FB (white/pale) to 1F3FF (black/dark). (Btw, that's counter intuitive to programmers since RGB colour codes have "#00" being dark and "#FF" being light.) P.S. I haven't decided if the skin colour aspect of emoji's is racist or not. There may be some people who found the default yellow emoji's racist.
Names of symbols such as BLACK MEDIUM SQUARE or WHITE MEDIUM SQUARE are not meant to indicate that the corresponding character must be presented in black or white, respectively; rather, the use of “black” and “white” in the names is generally just to contrast filled versus outline shapes, or a darker color fill versus a lighter color fill. Similarly, in other symbols such as the hands U+261A BLACK LEFT POINTING INDEX and U+261C WHITE LEFT POINTING INDEX, the words “white” and “black” also refer to outlined versus filled, and do not indicate skin color.
and
General-purpose emoji for people and body parts should also not be given overly specific images: the general recommendation is to be as neutral as possible regarding race, ethnicity, and gender. Thus for the character U+1F777 CONSTRUCTION WORKER, the recommendation is to use a neutral graphic like (with an orange skin tone) instead of an overly specific image like (with a light skin tone). This includes the emoji modifier base characters listed in Sample Emoji Modifier Bases. The emoji modifiers allow for variations in skin tone to be expressed.
-
Re:"Now available to download" link
1. On the emjoi's fonts there's "Raised Hand With Part Between Middle And Ring Fingers" - WhyTF is that not called "live long and prosper"? Some fonts are described by how they look while others are described by what they mean. A bit inconsistent but I guess that's more of a Unicode consortium issue.
2. Some of the hand emoji's like "White Left Pointing Backhand Index" are all called "white..." even though they've clearly done the race/skin tone colour spectrum ala whatsapp.
2b. The colours are a second unicode code (emoji modifier sequence) on the emoji ranging from U+1F3FB (white/pale) to 1F3FF (black/dark). (Btw, that's counter intuitive to programmers since RGB colour codes have "#00" being dark and "#FF" being light.) P.S. I haven't decided if the skin colour aspect of emoji's is racist or not. There may be some people who found the default yellow emoji's racist.
Names of symbols such as BLACK MEDIUM SQUARE or WHITE MEDIUM SQUARE are not meant to indicate that the corresponding character must be presented in black or white, respectively; rather, the use of “black” and “white” in the names is generally just to contrast filled versus outline shapes, or a darker color fill versus a lighter color fill. Similarly, in other symbols such as the hands U+261A BLACK LEFT POINTING INDEX and U+261C WHITE LEFT POINTING INDEX, the words “white” and “black” also refer to outlined versus filled, and do not indicate skin color.
and
General-purpose emoji for people and body parts should also not be given overly specific images: the general recommendation is to be as neutral as possible regarding race, ethnicity, and gender. Thus for the character U+1F777 CONSTRUCTION WORKER, the recommendation is to use a neutral graphic like (with an orange skin tone) instead of an overly specific image like (with a light skin tone). This includes the emoji modifier base characters listed in Sample Emoji Modifier Bases. The emoji modifiers allow for variations in skin tone to be expressed.
-
Re:"Now available to download" link
1. On the emjoi's fonts there's "Raised Hand With Part Between Middle And Ring Fingers" - WhyTF is that not called "live long and prosper"? Some fonts are described by how they look while others are described by what they mean. A bit inconsistent but I guess that's more of a Unicode consortium issue.
2. Some of the hand emoji's like "White Left Pointing Backhand Index" are all called "white..." even though they've clearly done the race/skin tone colour spectrum ala whatsapp.
2b. The colours are a second unicode code (emoji modifier sequence) on the emoji ranging from U+1F3FB (white/pale) to 1F3FF (black/dark). (Btw, that's counter intuitive to programmers since RGB colour codes have "#00" being dark and "#FF" being light.) P.S. I haven't decided if the skin colour aspect of emoji's is racist or not. There may be some people who found the default yellow emoji's racist.
Names of symbols such as BLACK MEDIUM SQUARE or WHITE MEDIUM SQUARE are not meant to indicate that the corresponding character must be presented in black or white, respectively; rather, the use of “black” and “white” in the names is generally just to contrast filled versus outline shapes, or a darker color fill versus a lighter color fill. Similarly, in other symbols such as the hands U+261A BLACK LEFT POINTING INDEX and U+261C WHITE LEFT POINTING INDEX, the words “white” and “black” also refer to outlined versus filled, and do not indicate skin color.
and
General-purpose emoji for people and body parts should also not be given overly specific images: the general recommendation is to be as neutral as possible regarding race, ethnicity, and gender. Thus for the character U+1F777 CONSTRUCTION WORKER, the recommendation is to use a neutral graphic like (with an orange skin tone) instead of an overly specific image like (with a light skin tone). This includes the emoji modifier base characters listed in Sample Emoji Modifier Bases. The emoji modifiers allow for variations in skin tone to be expressed.
-
Standards and Definition
Unicode defines the glyph (U+1F52B in the "Miscellaneous Symbols and Pictographs" block) as "pistol" with the keywords "gun, handgun, pistol, revolver, tool, weapon." This is unambiguously not a symbol for a toy. Apple is in the wrong here for not adhering to standards, but being wrong and not adhering to standards has never stopped them before, so this shouldn't really be a surprise.
-
A really bad idea
This is a really bad idea, if you consider what emoji are.
Emoji are the idea, not to send tiny graphics, but to have a standard codepoint for smileys. Everyone can implement them and use their icon set for display. Either to have a strong brand, or to allow users to theme them or just because they have no rights on the graphics from other vendors.
Now there already were problems. Take the pistol icon, which had the problem that it had different directions in different sets. I do not know which set it was, but let's assume its apple vs. google.
Now they take it a step further and apple replaces the pistol with a water gun, while google did not yet do this.Here a smiley example:
(smiley)(pistol)(female smiley)Apple interpretation: I shoot with water on my girlfriend (and we have fun)
Google interpretation: I shoot with a pistol in my head where she's there (maybe because we have trouble)Considering, that smileys were the idea to convey emotions words cannot, its not only silly to have smileys for objects, but even more stupid to use non-matching icon sets. Remember the hairy heart trouble?
https://www.engadget.com/2014/...Even the unicode consortium recommends to find better long term solutions:
http://www.unicode.org/reports...> The longer-term goal for implementations should be to support embedded graphics, in addition to the emoji characters. Embedded graphics allow arbitrary emoji symbols, and are not dependent on additional Unicode encoding. Some examples of this are found in Skype and LINE—see the emoji press page for more examples.
-
Re:Change history Commrade? Da or Nyet?
Why do that when you can demand that Unicode version N+1 include new codepoints for each of those; or even go all the way and define some new "emoji modifiers" so that the user can specify water pistol + fluid it is filled with for hundreds of distinct combinations!
-
Re:I'm confused
I question that unicode is a security risk. In fact I deny it.
You shouldn't
You do need to take a few precautions, but not many, it's simple.
Or you could just stay away from it and not have to give it a second thought. We don't want pictographs spreading viruses around. The only safe sex is no sex. There is no need for unicode in a text forum. Leave it alone.
-
Of course they like working on emoji...
... it draws attention from, for example, this mess: http://www.unicode.org/reports...
Did you know that you cannot compare strings in Unicode?
Or well, I suppose there are 3 living people who understand that. -
Don't Fear The Emoji.This query to stackoverflow is four years old, but that doesn't really change things very much.
I am asking for the count of all the possible valid combinations in Unicode with explanation.
1,111,998: 17 planes x 65,536 characters per plane - 2048 surrogates - 66 noncharacters
109,384 code points are actually assigned in Unicode 6.0.How many characters can be mapped with Unicode?
There is plenty of room for growth here.
Unicode 8 supports 120 scripts and 14 collections of other symbols of which Emoji is one and typographical decorations --- dingbats --- another. Once you admit that a Unicode graphic can be purposeful, decorative or both, the battle against the admission of Emoji is lost. U 9.0 and Post 9.0 Emoji Candidates
Emoji is explicitly Asian in origin --- and that seems to be one of things ticking off the geek here --- but combining words and pictures in casual messaging to provide a touch of color or save some space is very old in the Western world, and doesn't really need a defense.
The geek who complains about this sort of thing tends to come across as humorless and prissy and a bit out of touch.
-
Re:Toldja so, you morons!
There are plenty of fonts available cross-platform(either because they are liberally licensed or because they are considered vital enough or cheap enough that more or less everyone throws them in; but fonts have never been Unicode's problem(their documentation does include examples drawn from one or more fonts, not sure who they use as a supplier and under what terms in order to provide examples; but those are explicitly noted to be non-normative and purely for the reader's convenience).
For the surprisingly messy business of font formats, we have other standards; and when transferring font/color/size and similar formatting information is important(as in any word processing application) the application's file format provides some mechanism for doing that.
We could theoretically blob all this together into one gigantic Ur-Standard; but creating hideously overlarge standards tends not to actually solve anything; it just moves you from a situation with a bunch of standards, only some of which are implemented on a any given platform; to a single standard which is only partially implemented(and always different parts) on any platform; which is basically the same disaster; but with uglier documentation.
It probably doesn't help that, even if there were a canonical-and-MIT-licensed emoji font, UI look and feel is one of the things that all the mobile platform vendors have enthusiastically sought to differentiate themselves on. It's not at all clear that Apple, Google, or Microsoft would be particularly happy if their pet UI's chosen font were interspersed with lowest-common-denominator-identical-across-vendors emoji; given how much attention each has paid to carving out a distinct look. If those vendors actually wanted a cross-platform emoji font; it would have been pretty trivial for them to make it happen: ~1,700 characters wouldn't be something you could get a foundry to do for free; but it would hardly be a crushing burden for the three of them to pay somebody enough to hack together the necessary font and offer it under a free, nonexclusive, license to anyone who wished to include it. That would pretty much be the end of the story.
As it is, though, various vendors have gone their separate ways; and don't even seem particularly interested in trying to emulate one another as closely as copyright law allows(here is a list of the emoji codepoints and their representations, if they exist, on various platforms. Not always even consistent within a given vendor's product lineup.) -
Re:Post to undo an accidental moderation
I just don't see the urgency. The system works. There are also some small security issues that maybe they don't want to deal with.
-
Re:So a national emergency gets declared and...
It's not unicode. Unicode is "unsafe"...
-
It is pretty crazy
Aren't many emoji combinations or modifications of other emoji? I seem to recall this was done (for among other reasons) to accommodate different skin colors and such?
This was the best I could find after a bunch of googling:
-
Re:Who proposed tem?
That's a different proposal and it's been declined.
-
Re:Betteridge's law of headlines says ... no
Please note that emoticons (which you demonstrate) and emoji (which are not just purely graphical representations of already existing emoticons) are different (relevant Unicode document).
we still mainly use the emoji native to the West.
This is simply wrong. The characters that Unicode classifies as being 'Emoji' are directly descendant from the characters defined by the major Japanese telcos in the 2000's.
-
Re:Betteridge's law of headlines says ... no
This is especially interesting given the way the consortium addresses the issue of different symbols representing the same character in different parts of East Asia. From http://unicode.org/faq/han_cjk...:
Q: If the character shapes are different in different parts of East Asia, why were the characters unified?
A: The Unicode Standard is designed to encode characters, not glyphs. Even where there are substantial variations in the standard way of writing a character from locale to locale, if the fundamental identity of the character is not in question, then a single character is encoded in Unicode.
Characters, not glyphs. So emoji are characters, while various Asian writing styles are glyphs, I guess. And a couple lines further down in the same answer...
There are occasional instances of unified characters whose typical Chinese glyph and typical Japanese glyph are distinct enough that the Chinese glyph will be unfamiliar to the typical Japanese reader, e.g., U+76F4. To prevent legibility problems for Japanese readers, it is advisable to use a Japanese-style font when presenting Unihan text to Japanese readers.
So if you're Japanese and want to see Japanese characters, you're told to use a Japanese font. But, you'll never be forced to choose between a male and female dancing emoji, you deserve to have BOTH in your character set. Why are emoji more important to Unicode than the Japanese language?
-
Re:There was a point where Unicode needed to stop
Actually, it's in a proposal, right next to owls and bats...
-
They're trying to unify *similar* characters
A lot of people complain about the idea of unification without understanding it. I can't judge if unicode's unification is great or awful. The English-speaking media constantly says it's awful, but it's usually clear the authors don't know what unification is, who's driving it, or how unicode's work compares to what existed beforehand, so they can only be ignored. (They're sometimes trying to spin up some clickbait about ignorant westerners imposing blah blah blah on Asia, which just shows they no nothing about the topic.)
The issue:
There's a certain number of symbols which have been copied from one East Asian language to another. They're the same symbol, so unicode has one slot for that symbol. Then there's a second category where the symbol has been copied, but one group draws it a little different (the Japanese might like to put a little flick at the end of one line, or the Chinese draw the line a little slantier). And a third category where one group has developed a simplified symbol, which means again the traditional and the simplified symbols are the same thing but drawn differently. The two symbols are equivalent, the new one is just a new suggestion for how to draw it.
Unification is about having one slot for the symbols in categories two and three and leaving it to the font to decide how to display it.
(Unicode uses more precise terms, but I'm calling them "symbols" and "slots" for simplicity.)
A disadvantage to this approach is that there can't be a font which would display a symbol both the way a Japanese would draw it and the way a Chinese would draw it. Fonts have to choose one style to draw each unified symbol.
An advantage of this approach is that new languages and dialects can be added supported without needing another 100,000 slots per language or dialect (we do all know there are more than three East Asian languages, don't we?), and it's much easier for fonts to add support for all the East Asian languages because once they've done Chinese, Japanese is automatically almost finished.
Here are some example symbols:
https://en.wikipedia.org/wiki/...
unicode.org's FAQ also has clarifications:
If the character shapes are different in different parts of East Asia, why were the characters unified?
http://www.unicode.org/faq/han...Isn't it true that some Japanese can't write their own names in Unicode?
http://www.unicode.org/faq/han...(All that said, it's been years since I looked into this so there's a chance I've gotten some detail wrong, but I'm confident it's a good summary of the issue.)
-
They're trying to unify *similar* characters
A lot of people complain about the idea of unification without understanding it. I can't judge if unicode's unification is great or awful. The English-speaking media constantly says it's awful, but it's usually clear the authors don't know what unification is, who's driving it, or how unicode's work compares to what existed beforehand, so they can only be ignored. (They're sometimes trying to spin up some clickbait about ignorant westerners imposing blah blah blah on Asia, which just shows they no nothing about the topic.)
The issue:
There's a certain number of symbols which have been copied from one East Asian language to another. They're the same symbol, so unicode has one slot for that symbol. Then there's a second category where the symbol has been copied, but one group draws it a little different (the Japanese might like to put a little flick at the end of one line, or the Chinese draw the line a little slantier). And a third category where one group has developed a simplified symbol, which means again the traditional and the simplified symbols are the same thing but drawn differently. The two symbols are equivalent, the new one is just a new suggestion for how to draw it.
Unification is about having one slot for the symbols in categories two and three and leaving it to the font to decide how to display it.
(Unicode uses more precise terms, but I'm calling them "symbols" and "slots" for simplicity.)
A disadvantage to this approach is that there can't be a font which would display a symbol both the way a Japanese would draw it and the way a Chinese would draw it. Fonts have to choose one style to draw each unified symbol.
An advantage of this approach is that new languages and dialects can be added supported without needing another 100,000 slots per language or dialect (we do all know there are more than three East Asian languages, don't we?), and it's much easier for fonts to add support for all the East Asian languages because once they've done Chinese, Japanese is automatically almost finished.
Here are some example symbols:
https://en.wikipedia.org/wiki/...
unicode.org's FAQ also has clarifications:
If the character shapes are different in different parts of East Asia, why were the characters unified?
http://www.unicode.org/faq/han...Isn't it true that some Japanese can't write their own names in Unicode?
http://www.unicode.org/faq/han...(All that said, it's been years since I looked into this so there's a chance I've gotten some detail wrong, but I'm confident it's a good summary of the issue.)
-
Re:Nitpick
How is an application supposed to know if a random character is Japanese, Chinese, Korean it mathematical? It would need some kind of strong AI to interpret and understand the text. It's a Unicode bug, merged characters are impossible to render correctly all the time because apps are forced to guess which font to use.
Except font encoding has never been part of the character encoding, you might want your English text in Arial, your French in Times New Roman and the formula in Courier, but Unicode doesn't encode that. You might argue that this is not a bug, that it's simply out of scope and should be solved by a higher level encoding like <font="some japanese font">konnichiwa</font><font="some chinese font">ni hao</font> and not plaintext Unicode. That's what the Unicode consortium says and if you express it as simply a style issue, it actually sounds plausible.
On the other hand, you might argue that there's no reasonable way to map a "unihan" character to a glyph except as a band-aid since the CJK styles are distinctly different and so any comprehensive font should have three variations, it shouldn't take three fonts to make a mixed CJK document look correct just one. That this information belongs on the lowest level and should be passed along as you copy-paste CJK snippets or pass them around in whatever interface or protocol you have, otherwise everything will need a document structure and not just a string.
I don't think they should "unmerge" and duplicate all the han characters, that'd be silly. What they should do is add CJK indicators - say HANC, HANJ, HANK like for bi-directional text, only simpler with no nesting just one indicator applying until superseded by another. Like (HANJ) konnichiwa (HANC) ni hao and the former will render as a Japanese han, the latter as a Chinese. If it doesn't have any indicator, well take a guess. Am I missing something blindingly obvious or would this trivially solve the problem?
-
Re:CJK is Unicode's big failing
But in reality it's actually causing problems since the same symbol is expected to look in one way for Chinese and slightly different for Japanese.
Well, a large number of people (including myself) believe it's the right thing to do. People like you lost that argument, that's why Unicode is the way it is. I'm simply explaining it, and I'm telling you that the justification isn't Western imperialism or American ignorance or whatever other cultural b.s. people like to attach to it.
By the way, there are plenty of examples where the same symbols have different code points intended for different contexts (Greek letters used for math etc). There are even Latin letters that look slightly different in different language contexts like U+0152 (filtered out by Slashdot), Ø and Ö (they all stem from a combination of O and E, Ö from the convention of writing the E above the O). Agreeing on one of the symbols for all affected languages would be logical and fully intelligible for everyone, but it would look wrong.
Yes, and Unicode CJK support does exactly the same thing that Latin script does for Latin alphabets: characters that look similar enough to be recognizable are shared, and characters that look significantly different and would be unintelligible get different codepoints. Since this is a much harder problem for CJK, they keep adding new codepoints.
Unicode used to have language contexts, as well as other contexts. But markup standards like HTML and XML simply ignored the Unicode facilities. Having two separate standards for marking up regions of texts, possibly conflicting, overlapping, and inconsistently, was a problem. And people weren't using the Unicode facilities. So they were deprecated, then dropped.
It's all a big mess. Not unicode specifically, but human writing in general.
No, most writing systems are pretty simple: they have a few hundred symbols that are arranged usually linear ways. In fact, even CJK isn't all that different and could easily be encoded in a few hundred codepoints (here); it was mostly a policy decision not to do that.
-
Re:Unicode can go fuck themselves
You may or may not be pleased to know that this latest release of Unicode adds glyphs for volleyball and cricket.
Hah!
It's almost like the consortium just bought him off by careful planning
or time travel. Bravo! -
Re:Unicode can go fuck themselves
You may or may not be pleased to know that this latest release of Unicode adds glyphs for volleyball and cricket.
-
Re:Already = 65K characters
"...adds 7,716 new characters to the existing 21,499 – that's more than 35% growth!"
There were already 113K characters in Unicode version 7.0. Which is more than 2^16 characters, so remember:
- 1. UTF-16 is *not* two bytes per character
- 2. Therefore a "character" in Java, C#, Javascript sometimes only holds half a Unicode character
- 3. Even a whole unicode character may be only part of a grapheme cluster, which means that taking arbitrary substrings may not result in readable text.
But wasn't UTF-16 supposed to cover all the practical languages (I'm not talking about Klingon or other languages created out of movies). In which case, the 65k should have covered it. Why does Unicode need weirdass characters for playing cards or stuff of that nature? Just stick to their original roles - supporting the implementation of written & spoken languages in computers, and leave it at that.
-
Getting carried away?
I know bits are cheap, but...really?. Font designers have to actually implement the characters - specifying hundreds of clipart characters seems kind of ridiculous. Design by committee, where no one ever says "no".
Unicode is beginning to remind me too much of CSS3, where they let the specification blow up beyond all reason - making it essentially impossible for anyone to ever have a fully compliant implementation.
-
Already = 65K characters
"...adds 7,716 new characters to the existing 21,499 – that's more than 35% growth!"
There were already 113K characters in Unicode version 7.0. Which is more than 2^16 characters, so remember:
- 1. UTF-16 is *not* two bytes per character
- 2. Therefore a "character" in Java, C#, Javascript sometimes only holds half a Unicode character
- 3. Even a whole unicode character may be only part of a grapheme cluster, which means that taking arbitrary substrings may not result in readable text.
-
Re:CJK is Unicode's big failing
There are certainly plenty of "repeat" characters in different contexts.
For example the math alphanumerics: http://unicode.org/charts/PDF/... -
Re:Goddamnit
Every other forum on the internet has emoticons, why not slashdot? Sooo behind the times.
Just adding Unicode support would provide emoji as well.
-
Re:Lol
There's no law that says they can't pad the variable length input to fixed length
I'm not sure you quite understand the problem, it's not the input length, it is the encoding of each of the characters. So are you suggesting turning all single-byte encoded characters into multi-byte encoding of some arbitrary maximum length? If you can already identify the problem at this level then you would just do that in the parser that is truncating the string.
...and then make sure you're handling combining character sequences and bidirectional text correctly.
-
Re:Lol
There's no law that says they can't pad the variable length input to fixed length
I'm not sure you quite understand the problem, it's not the input length, it is the encoding of each of the characters. So are you suggesting turning all single-byte encoded characters into multi-byte encoding of some arbitrary maximum length? If you can already identify the problem at this level then you would just do that in the parser that is truncating the string.
...and then make sure you're handling combining character sequences and bidirectional text correctly.
-
Re: Lol
It is not hard and it seems really obvious, but for some reason Unicode turns some otherwise really smart programmers into total idiots.
There's a lot to know, and people might not be aware of all of it and all the issues involved.
...and "really smart" might actually be a handicap if it means "I'm smart, I know how to do this, it's easy!", and not bother to Read The Fine Manual, whereas somebody less smart might find Unicode scary and actually bother to RTFM.
-
Re: Lol
It is not hard and it seems really obvious, but for some reason Unicode turns some otherwise really smart programmers into total idiots.
There's a lot to know, and people might not be aware of all of it and all the issues involved.
...and "really smart" might actually be a handicap if it means "I'm smart, I know how to do this, it's easy!", and not bother to Read The Fine Manual, whereas somebody less smart might find Unicode scary and actually bother to RTFM.
-
Re: Lol
It is not hard and it seems really obvious, but for some reason Unicode turns some otherwise really smart programmers into total idiots.
There's a lot to know, and people might not be aware of all of it and all the issues involved.
...and "really smart" might actually be a handicap if it means "I'm smart, I know how to do this, it's easy!", and not bother to Read The Fine Manual, whereas somebody less smart might find Unicode scary and actually bother to RTFM.
-
Re: Lol
It is not hard and it seems really obvious, but for some reason Unicode turns some otherwise really smart programmers into total idiots.
There's a lot to know, and people might not be aware of all of it and all the issues involved.
...and "really smart" might actually be a handicap if it means "I'm smart, I know how to do this, it's easy!", and not bother to Read The Fine Manual, whereas somebody less smart might find Unicode scary and actually bother to RTFM.
-
Re: Lol
And since some characters have different lengths, even counting characters might not be good enough.
And some characters might not always have a length and the length of some characters isn't "how much do you move to the right", it's "how much do you move to the left", so truncating is tricky.
-
Re: Lol
And since some characters have different lengths, even counting characters might not be good enough.
And some characters might not always have a length and the length of some characters isn't "how much do you move to the right", it's "how much do you move to the left", so truncating is tricky.
-
Re: Lol
From that description it does sound like the string is still valid. However if the display is crashing on a certain sequence containing an ellipsis, I am not clear why you can't construct that string directly, rather than rely on the insertion of the ellipsis.
Yup.
It does sound like they maybe rely on "sanitizing" but of a far more complex scheme that I was aware of.
Not to me, unless by "sanitizing" you mean "shortening so it'll fit in Notification Center".
This is still wrong, maybe far worse, as they are detecting and rejecting patterns containing ellipsis and some other character
I've seen nothing to indicate that they're doing anything specific with ellipses, other than "sticking them in at the point of truncation to let the user know that the full message isn't being displayed".
About all I'd assume is that certain sequences of characters are not being handled correctly by some part of Core Text; perhaps it's assuming, explicitly or implicitly, that those sequences "can't happen" and, instead of drawing them, crashing, perhaps in an assert.
In this case their glyph layout should simply not crash on any possible arrangement of bytes or words in the incoming string.
Correct.
It is not hard and it seems really obvious, but for some reason Unicode turns some otherwise really smart programmers into total idiots.
There's a lot to know, and people might not be aware of all of it and all the issues involved.
-
Re: Lol
So you are saying "fix the library". I am saying "sanitize input for library".
Both work, but I would argue that sanitizing for the library is usually a lot less problems.
"Programming for international environments is hard, let's go shopping!"
I would argue that you have perhaps not considered all the possible problems and have thus perhaps miscounted the problems with "work around a broken library by transforming perfectly legitimate Unicode character sequences into sequences that might not represent what the person sending the message intended", that being the correct description of the second approach to this problem in the list above.
Yeah, correctly truncating a message that could be an arbitrary sequence of text in multiple languages with combining character sequences and bidirectional text isn't easy, but, well, if you want to be thought of as a company that makes stuff that "just works", you'd better figure out how to make that complicated process "just work".
Maybe iOS 8.3.1 needs to have a quick fix of some sort, but iOS 10, if not iOS 9, should fix the truncation code.