Unicode Consortium Releases Unicode 8.0.0

Ithought by rossdee · 2015-06-19 17:41 · Score: 4, Funny

That slashdot didn't support unicode

Re:Ithought by Anonymous Coward · 2015-06-19 18:40 · Score: 2, Informative

> That slashdot didn't support unicode
However Soylent News has had full unicode support since last year.
Here is a recent thread with lots of greek.
Re:Ithought by hcs_$reboot · 2015-06-19 18:52 · Score: 4, Funny

Slashdot supports Unicode / UTF8 from 0x20 to 0x7F.

--
Slashdot, fix the reply notifications... You won't get away with it...
Re:Ithought by sound+vision · 2015-06-19 20:17 · Score: 3, Informative

Your post... has a Facebook icon next to it.
I knew Slashdot was going in a different direction, but... Facebook? The alt-text says "From Facebook". I'm not even completely sure what that means, but I don't want anything "from Facebook" in here. I hope you know your post has fucked up my world. I'm going to have a hard time sleeping now. Bro... your post has a Facebook icon on it! However you managed to get that to appear, don't do it ever again! And tell all your friends not to. Together, we can make Slashdot sane again...
Re:Ithought by tlhIngan · 2015-06-19 20:46 · Score: 4, Interesting

That slashdot didn't support unicode
It does. It's actually fully Unicode-compliant. It's just on the input and recently (as of a couple of years ago) the output side passes through a Unicode whitelist.
You see a Unicode codepoint is not necessarily a character.. It can be a character modifier. So you can be handling a string containing multiple codepoints, and yet on screen it only resolves to one character. Some of these include right-to-left overrides (which alter the flow of text on the screen so you can write a string and the display agent will reverse it). There are other modifiers that include flourishes, and Unicode 8 adds "skin type modifiers" as well for emoji. As in, if you display a face, the font should use a "non-human shading" (Apple chose a Simpsons-like yellow, Microsoft chose a pale zombie-ish hue). But with the addition of a skintone/diversity modifier, when combined with the emoji codepoint, can give you a variety of skin tones.
And it's also what screwed up iOS - the string you send is full of modifiers which makes it extremely hard to decide where to break the line. (Arabic is one where there are lots of modifiers because a character can appear differently based on the characters that appear before and after it).
And what does this have to do with /.? Easy - a lot of commenters abused the modifiers to screw with the website. And unless you know how to handle Unicode, it's really hard to properly reset the parser state. /. used to be able to display the screwed up the comments - if you Google for the oddball string n"5:erocS" it would show it (because Google ignores modifiers). If you wonder, that's the string "Score:5" as commenters use to fake-moderate their posts. But since /. strips unicode on display now, you get to see the messed up post as it was typed out
Re:Ithought by KiloByte · 2015-06-19 22:16 · Score: 4, Insightful

That slashdot didn't support unicode
It does. It's actually fully Unicode-compliant
No, Slashdot's database works in ISO-8859-1. You're confusing Slashcode which can do Unicode with Slashdot which still hasn't deployed it.

--
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
Re:Ithought by Smallpond · 2015-06-20 02:20 · Score: 2

You're just mad because your post didn't get an AOL icon.

CJK is Unicode's big failing by Anonymous Coward · 2015-06-19 18:04 · Score: 5, Interesting

CJK in Unicode really kills me. I once had to write an appointment that generated PDF documents with both Japanese and Chinese text. When you do this with, say, English and Russian, you just need to pick a font set that covers both alphabets and basta. Not Chinese/Japanese. There are a number of glyphs that share a common historic root in these languages, and the Unicode folks decided to consecrate this historical relationship by recycling the character codes between the languages. Yet, the glyphs are substantially different when rendered. So you don't know what the glyph really represents until you know what font set is being applied to the string.

What I ended up doing was processing each character individually and using a "look around" algorithm that would try to find clues in the context as to what language the glyph was in and render it with the right font. It never worked very well, but it worked well enough that the client decided not to redactor the controller that was generating the mixed language strings.

But I learned two valuable lessons that day: Unicode isn't that great after all and stay away from CJK contracts.

Re:CJK is Unicode's big failing by mwvdlee · 2015-06-19 18:20 · Score: 2

There are certainly plenty of "repeat" characters in different contexts.
For example the math alphanumerics: http://unicode.org/charts/PDF/...

--
Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
Re:CJK is Unicode's big failing by gustygolf · 2015-06-19 20:45 · Score: 4, Informative

In short:
To render text properly in Japanese, you need a Japanese font. To render text properly in Chinese, you need a Chinese font. It's not just because of character coverage, but because of a thing called Han unification the consortium did.
The Unicode consortium decided to map similar characters to the same code-point. Personally, I'm not particularly bothered by this. but it leads to the technical problem that each text must be supplied with a language tag to select a correct font.
And this is problematic when there are two CJK languages mixed in the same document -- in the GP's case, Chinese and Japanese --, or when a program must automatically decide which font to render things in.
Take a web browser for example. It reaches a random Chinese web page, encoded in UTF-8. The page's author never bothered adding a language tag. Now the web browser must guess whether to render the page in a Chinese font or a Japanese one. And a "guess" is really all that it can do.
(Typically, software used base the guesses on the user's locale. It's pretty accurate -- Chinese users tend to view Chinese documents, Japanese Japanese ones. But the problems start when someone tries viewing a 'foreign' document...)
It's really quite ironic that the consortium decided on codepoint unification for the three languages that would most benefit from Unicode.

--
"Slow Down Cowboy! It's been 58 minutes since you last successfully posted a comment" -- slashdot, driving users away.

Unicode is badly designed by Anonymous Coward · 2015-06-19 18:14 · Score: 2, Interesting

Is Unicode supposed to separate characters that look the same but are semantically different?

Looks like the answer is yes...
'LATIN CAPITAL LETTER A' (U+0041)
'GREEK CAPITAL LETTER ALPHA' (U+0391)

Looks like the answer is no...
'RIGHT SINGLE QUOTATION MARK' (U+2019) -- this is the preferred character to use for apostrophe.
(An apostrophe and closing a quotation are two very different things.)

Re:bloatware by Zontar+The+Mindless · 2015-06-19 18:16 · Score: 2

Most languages can be written with English characters (ie. plain latin).

Name a language written with Latin characters (other than English) that does not use any special characters or diacritical marks whatsoever.

(Even English requires extensions to the Latin character set. which originally had no "U", "J", or "W". )

--
Il n'y a pas de Planet B.

Re: I'm going back to ASCII by OrangeTide · 2015-06-19 18:32 · Score: 2, Insightful

Sorry, why do we need multiple languages again?

--
“Common sense is not so common.” — Voltaire

Re: bloatware by ciaran2014 · 2015-06-19 18:48 · Score: 2

You mean "ij"? The unified ij character isn't used by anyone. Not sure if it's even recommended by any body.

But Dutch does have accents (één, vóór, ...). News headlines this morning:

"Verstekeling valt boven Londen uit vliegtuig na 11u lange vlucht, één overleeft"

"Grieken demonstreren ook vóór de euro"

--
Help build the anti-software-patent wiki

Already = 65K characters by divec · 2015-06-19 18:53 · Score: 4, Informative

"...adds 7,716 new characters to the existing 21,499 – that's more than 35% growth!"

There were already 113K characters in Unicode version 7.0. Which is more than 2^16 characters, so remember:

1. UTF-16 is *not* two bytes per character
2. Therefore a "character" in Java, C#, Javascript sometimes only holds half a Unicode character
3. Even a whole unicode character may be only part of a grapheme cluster, which means that taking arbitrary substrings may not result in readable text.

--

perl -e 'fork||print for split//,"hahahaha"'

Re:I'm going back to ASCII by Dutch+Gun · 2015-06-19 20:33 · Score: 4, Interesting

Don't let the door hit you in the ass on the way out to the pasture of obsolescence. The rest of us will continue to use Unicode, which, despite some flaws, such as their mess-up with Han Unification, does a pretty good job at solving the problem of language intercommunication. If anyone thinks they can do a better job (not counting reverting back to English-only ASCII), have at it.

So go ahead and use ASCII, don't type in any of those dern foreign charcturs, and pretend you're back in the happy past where we had a mess of incompatible standards, and no way to easily discern which of the many possible encodings was actually used, resulting in the scrambled text we always used to see (notice you *don't* actually see that much anymore?). And most software just ignored the rest of the non-English-speaking world anyhow because of that mess.

Personally, I'm thankful people are willing to take on largely thankless (and mind-numbing to most of us) tasks such as these.

--
Irony: Agile development has too much intertia to be abandoned now.

Re: I'm going back to ASCII by prefec2 · 2015-06-19 20:55 · Score: 2

No we would not do well with only one language, we would loose a lot of culture. It would be like one standard food for everyone. Furthermore, your proposition is ludicrous, as language changes all the time. That's why new street languages pop up and then evolve in something different. Language is reflection of culture. It is not like a programming language. If you want to communicate with other people you should learn additional languages. And while you are at it, also try to learn something about their culture. Otherwise could would not be able to understand them. So even with only one language, you would not be able to understand them.

BTW: The most problems between societies are not based on religion. Religion is only used as a vehicle to transport the hostility. It is about greed, ignorance, stupidity, and frustrations.

Re:Seems like it, but doesn't by Dutch+Gun · 2015-06-19 21:21 · Score: 4, Insightful

Well, no, it's not, they're still working on that bit meaning that you get to keep upgrading all your programs to use newer and ever bigger libraries supporting more complex rules regularly. It's not stable.

Nonsense. The Unicode encoding formats are stable, and have been for a very long time. New character are added all the time, but the underlying OS and it's fonts are typically upgraded to support these, and so most programs need to do absolutely nothing once their support is in place. The vast, vast majority of applications that support Unicode don't actually explicitly need to use those "official" Unicode libraries (which are monstrously complex), because all modern operating systems provide most of the support they need. For simple conversions, there are a number of excellent free and simple-to-use libraries (many languages have standard libraries available), or you can just use OS-specific versions, or a number of very easy-to-use free and open-source libraries.

If you're concerned about size, just use UTF-8. There's no need to "switch encodings on the fly", because that's what variable-width encodings already do for you. And the vast majority of common encodings, even in Asian languages, are only 16-bits, not 24 or 32. The issue of inefficiency of text size with Asian languages is greatly exaggerated, and becoming less and less relevant anyhow with our machines with gigabytes of RAM and processors efficient enough to compress and decompress text on the fly. BTW, you can do that just fine even in Microsoft and Apple environments. It just means you need to transcode from UTF-8 to UTF-16 or back again at any API boundary that takes text, and this is fairly simple to do. I've written my own cross-platform code this way because UTF-8 is a much easier encoding to work with internally IMO.

I don't think anyone would try to argue that Unicode is a perfect solution, but it's a damn sight better than what we used to have. Your comparison to USB is pretty good, in fact. Ask just about any PC user what they'd prefer - modern USB devices or the old system of parallel, serial, PS/2, and joystick ports. Whatever faults USB has, it's a hell of an improvement over the old system.

--
Irony: Agile development has too much intertia to be abandoned now.

Re: bloatware by Ilgaz · 2015-06-19 23:31 · Score: 2

This is exactly why it took decades and crazy hacks for people to write their own language electronically.

Thank God virtually failed (but won) Plan 9 (UNIX2) came by with idealistic developers who respects other cultures came up with Unicode and companies like IBM/Microsoft/Adobe along with Free software supported it.

Who knows if the software/hardware/network combination you use had a line coded by a person who is from those "computer illiterate" regions?

Re:I'm going back to ASCII by Hognoxious · 2015-06-20 02:03 · Score: 2

What are you doing these days? When Doves Cry was part of my childhood.

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."

Re:Seems like it, but doesn't by Dutch+Gun · 2015-06-20 14:49 · Score: 2

Instead we are stuck with UTF16 as the default, and even the larger encodings use modifiers etc.

Who's "we"? Windows and Mac use UTF-16, while Linux and the web use the vastly superior UTF-8. Internally, assuming you're in a language that supports it like C++, you can actually use any encoding you want - it just means you need to transcode strings at API boundaries. You'd have to do this for one or more of your target platforms anyhow if you're writing cross-platform code (all three major PC OSes).

A lot of Windows programmers think "Unicode == UTF-16", which is not the case at all. In my own applications, I use UTF-8 as the native format, even on Windows and Mac. When I need to render glyphs (I write games, so I have my own low-level bitmapped based glyph rendering system), I convert them to UTF-32 code points for simple mapping. If you want to, nothing is stopping you from using UTF-32 internally as you'd seem to prefer, but I've found there's really no need, because you can always convert between formats on the fly as needed.

--
Irony: Agile development has too much intertia to be abandoned now.

Slashdot Mirror

Unicode Consortium Releases Unicode 8.0.0

21 of 164 comments (clear)