Unicode 6.1 Released
An anonymous reader writes "The latest version of the Unicode standard (v. 6.1.0) was officially released January 31. The latest version includes 732 new characters, including seven brand new scripts. It also adds support for distinguishing emoji-style and text-style symbols and emoticons with variation selectors, updates to the line-breaking algorithm to more accurately reflect Japanese and Hebrew texts, and updates other algorithms and technical notes to reflect new characters and newly documented text behaviors."
13 new emoticons1!1! http://www.unicode.org/charts/PDF/Unicode-6.1/U61-1F600.pdf
Take a good look at glyph 27cb aka \diagup part of the Misc Math Symbols. People are gonna try embedding that in html now. Can't wait.
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
has got to be the Love Hotel.
Does anyone know why this is even there?
seriously, why would people need more than 256 characters? and why would they need more than 640k of memory?
Before anyone chimes in complaining that Slashdot doesn't even support an old version of Unicode, this is for several reasons. For one thing, there was once a fad of posting pornographic ASCII art on Slashdot, so it appears Slashdot disallows any character that would be more useful for glyph art than for English text. For another, there was once a fad of using bidirectionality override control characters for turning text backwards, which would break the layout and allow spoofing a comment's moderation score.
Yeah but can you write a pile of poo in ASCII?
http://www.fileformat.info/info/unicode/char/1f4a9/index.htm
Slashdot seems to believe so, seeing that we can't type accents and whatnot without jumping through a few hoops
For justice, we must go to Don Corleone
Yes. +2D3cqQ
Seriously, emoticons? Who ever thought it a good idea to include those in a standard? Should we have an encoding for hearts as dots over lower case i as well? And little horseys, too? And y with a big tail that wraps around to the front of the word?
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
Raise your hand if you couldn't code a parser that detects those characters and takes appropriate action, such as popping bidi characters.
🙋 If I were writing such a parser, I don't know how I'd get it to automatically check for the release of a new version of the standard and determine which code points are new bidi characters to be popped.
I'd love to be able to write IPA when discussing pronunciation
It'd be nice but not necessary: X-SAMPA.
or actually write out words in other languages
I guess the rationale is that most moderators would not be able to read foreign words without transliteration into Latin characters.
pound and yen signs for currency
£ is Alt+0163 on a Windows machine, and ¥ is Alt+0165. They're probably Ctrl+Shift+U A 3 Enter and Ctrl+Shift+U A 5 Enter on a Linux machine, but I don't have one in front of me right this minute with which to test.
Trolls gonna troll; that's what moderation is for.
At one point, ASCII art spammers were filling pages with sexually explicit ASCII art, such as Goatse, male masturbation, and birds perched on a penis, so fast that moderators could not keep up.
So filter those character ranges.
Blacklisting doesn't work because the next version of the standard, such as Unicode 6.1, may introduce more undesirable character ranges.
Seriously, emoticons? Who ever thought it a good idea to include those in a standard?
Unicode had to be able to round-trip (losslessly encode and decode) all old popular encodings. This includes encoding now called "code page 437", introduced with the first IBM PC, which includes a smile emoticon at code value 0x01. It also includes the encodings associated with the widely distributed system fonts Zapf Dingbats and Wingdings.
English also has the second-worst spelling system on the planet (only outdone by Japanese). I may have to use it on /. but I'm happy I don't have to resort to it for daily usage. And even if your idiotic proposal were to be universally accepted (which it won't, it's like asking everyone to use DOS) we'd still be in of a way to encode historical documents and such.
Because my browser doesn't support Unicode 6.1 yet...
all the Tetris pieces
The polyominoes up to five squares can be composed from U+2580 (upper half block), U+2584 (lower half block), and 2588 (full block) characters. Unicode tends not to introduce precomposed ligatures except when needed for round-tripping with pre-Unicode encodings.
glyphs of game pieces of all well known games
A lot of well-known pre-1923 tabletop games' game pieces already exist in Unicode. Chess is U+2654 through U+265F, and Checkers is U+26C0 through U+26C3. A lot of game pieces are simple enough in form that the Geometric Shapes (U+25A0 through U+25FF) represent them just fine. For example, Othello is U+25CB and U+25CF, as is Connect Four. Even the enemy in Fast Eddie for Atari 2600 is in Miscellaneous Technical (U+237E) as is home plate in Baseball (U+2302).
heck, instead of just the suit symbols why not 52 glyphs for a standard deck of cards
Those can already be composed from a Basic Latin letter or number and a suit symbol. Unicode tends not to introduce precomposed ligatures except when needed for round-tripping with pre-Unicode encodings.
throw the Major Arcana tarot cards in there too
I don't know about Tarot, but all twelve signs of the zodiac are in Miscellaneous Symbols, even the "69" looking sign of Cancer (U+264B).
gang symbols
The symbol of "Folk Nation" gangs is similar to that of Judaism: a Star of David (U+2721). The symbol of "People Nation" gangs is similar to that of Islam: a 5-point star and crescent (U+262A).
http://xkcd.com/927/
Unicode has different *pages*. You can filter by page.
New versions of Unicode introduce new pages. If you're blocking a page for some reason, the next version of Unicode might introduce another page that extends the functionality of the old page, reintroducing the behavior that led you to block the old page.
What's stopping us from just creating a Greasemonkey script that translates back and forth from HTML with square brackets and allows the full HTML set
Slashdot's lameness filter would probably confuse those square brackets with ASCII art, and even if not, the comment would likely draw negative moderations from moderators who haven't installed the Greasemonkey script.
by putting every message in its own e.g. IFRAME
There was a time when hundreds of <iframe> elements on a page would cause the browser to become unusably slow or even crash. I reported this to bugzilla.mozilla.org as Bug 103649, and a decade later it's still not RESOLVED FIXED. And are you going to put the subject of a comment in its own iframe too?
and force a maximum size on the comment content.
Until April 2014, when IE 6 passes out of extended support, one can't assume that all supported browsers support CSS max-width.
If they write a brilliant paragraph a day ago, then deleted it in the morning, they can view the document as it existed yesterday, copy the paragraph back out, and be done with it.
For one thing, an application that saves (and sends) a document's undo history along with the document can disclose things that the document's author did not want to disclose. I seem to vaguely remember scandals with Word's AutoRecover being used to recover redacted parts of a document. For another, how much of the limited space on the drive should be dedicated to saving a document's undo history since creation, especially when the document is a large layered picture or multitrack audio project?
And that's because people forget to save - why not have the OS do it for them?
I agree, but how often should the OS spin up the hard drive to do so?
Yeah, it's fantastic that Cyrillic or Katanaga or Devanagiri scripts can be so beautifully supported in ASCII. Speaking of which, does HTML5 have a complete character list for unicode, or is it still restricted to ASCII?
ASCII leaves off a lot of English punctuation, and accents that are, in fact, used in English (sure, in words of foreign origin, but they are still used.)
Well said, that man. If you feel the desire to "write" with stick figures and squiggles use a bastarding graphic, for fuck's sake.
Eklinóringëon my arse.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Àçcênts aré easy (if you have Windows). See http://vulpeculox.net/ax.
Works for 'any' application. Free. No stupid picking or codes.
English also has the second-worst spelling system on the planet (only outdone by Japanese).
??? WTF are _YOU_ on about? English does not have the worst spelling system on the planet, and Japanese certainly doesn't qualify as the worst. "But they have three different scripts: two syllabaries, and an ideographic set" but...
Look, perhaps I better just demonstrate to you what a real bad spelling system looks like; go look at Irish.
WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
ASCII leaves off a lot of English punctuation, and accents that are, in fact, used in English (sure, in words of foreign origin, but they are still used.)
Some that aren't foreign as well. "Coöperate" is an archaic spelling. Basically, any prefix that ends in "o" that is attached to a word that starts with an "o" can archaically be spelled with a diaeresis, in the French/Dutch method of "this vowel should be pronounced separately, and not as part of a diphthong".
WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
ASCII is just 128 characters.
Write boring code, not shiny code!
??? WTF are _YOU_ on about?
Can you concisely explain why the English word "psyche" is pronounced the way it is to a non-native speaker of the language?
In a toolbar full of icons, the word "Save" or its localization without an icon will probably look out of place. Is this out-of-placeness somehow superior to the use of a floppy disk icon?
There's a bug in WebKit on the Mac that stops font fallback working properly.
Reported by me in Chrome, reported up the chain to Apple.
It works fine in Chrome for Linux, so it's something weird and Mac-specific Apple will probably need to fix.
GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
This is Slashdot, I'm sure you can find any number of examples of people who've written a pile of poo in ASCII.
GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
What a hell are you talking about? Scuse me, are you from the past?!
They've got symbols for a love hotel, a horse, and a steaming pile of poo, along with emoticons, and they still haven't accepted the Tengwar draft that's been around since '93? Where are these people's priorities!?
Yeah but can you write a pile of poo in ASCII?
As far as I know, Windows was originally written in ASCII... :)
Can you explain why any word in french is pronounced the way it is?
It seems like they have different rules for what letters to pronounce for every word.
Anyway the reason you pronounce psyche like that is because it sounds better than psitsh.
"It needed to be flexible, so it's a VM now."
I fear this is the next step. The right to left and line wrapping BS is complicated enough that I'd welcome a specialized VM with loadable bytecode & glyph data. Yes, from a security standpoint this could create a wider attack surface. However, I'd argue it would be less attack surface considering that the VM for my unlimited precision scientific & programming calculator is smaller than my UTF-8 text display implementation.
I'd also argue that it would be faster to adopt new glyphs and behaviors if all I needed was to drop in a new batch of bytecode.
I'd also argue just to argue... because, well this IS Unicode we're talking about.
Something wrong with the Java code for this though Character.getNumericValue() is documented as returning -1 for this character, when quite clearly it should be a number 2.
Professor Karmadillo Songs of Science
Can you concisely explain why the English word "psyche" is pronounced the way it is to a non-native speaker of the language?
ps: Pronounced s. Whenever you see the letters 'p' and 's' together at the start of a word, do not pronounce the 'p', for example in pseudonym, psilocybin, or psst!
y: Pronounced eye or, more simply i. Why? Exactly! Um...
che. Pronounced key. Because without it everything would remain locked up in your brain.
And if that doesn't work there's always the fall back: Because it is!
I'm particularly fond of this set:
1F648 SEE-NO-EVIL MONKEY
1F649 HEAR-NO-EVIL MONKEY
16F4A SPEAK-NO-EVIL MONKEY
The only thing better would be a smoking monkey character. Because there ain't nothing funnier than a smoking monkey!
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
It's a loan-word from Greek. It follows the basic English rules for borrowing Greek words.
The rules for regular English words are no better, to be honest. It's like someone was trying to come up with the most perverted way to make a letter represent something as different as possible from what it does in most European languages (and Latin, where it originates). The only language that's possibly worse in that regard is French, but at least they are consistent in the way they mutilate their phonemes (and most of it is just dropping them altogether), whereas in English you have to guess which of the possible 2-4 radically different pronunciations of a single letter (like, say, /e/ vs /i/ vs /ai/), often with no rules other than "you just kinda look at which words are similar and go from there, but mind the exceptions!".
I understand that this is not really the fault of the language itself, it's just that its spelling effectively represents how the words were pronounced several centuries ago (circa Chaucer), rather than how they are pronounced today. It would be great to fix that, but it's probably too late by now - it already is the "world language" in its present shape, and all those millions of people are not going to re-learn. So all that we can do is rant about it.
....oh there it is.
Can you explain why any word in french is pronounced the way it is? It seems like they have different rules for what letters to pronounce for every word.
You know, "better than French" is not a great achievement. Indeed, one of the reasons why English is in such a sorry shape is because it absorbed an unhealthy dose of French poison as part of its history.
Anyway, the rule of thumb in French seems to be, if you don't know how to pronounce any given letter, just skip it altogether - >50% chance of you getting it right in that case. ~
Anyway the reason you pronounce psyche like that is because it sounds better than psitsh.
Technically, it should be /psixe/, which sounds reasonable to me.
I'm sure we could have found some way to get along without "Mathematical Rising Diagonal" and "Kissing Face".
That is all.
Drop the accents, people will know what you mean... and in a long enough period of time, only historians will care.
I'm pretty sure in HTML5 like in HTML4 the document is considered to be made up of unicode characters and other charsets are considered as encodings of unicode. Of course the HTML5 spec doesn't include all unicode characters explicitly that would be insane.
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
)
,
(
___)\
(_____)
(_______)
A glyph for an ice cream cone but still no half-stars to do movie ratings?
We already have a two-character icon for wanker: Latin capital letter W (U+0057), followed by Anchor (U+2693).
But that defeats the purpose of Unicode, doesn't it? I'm not expecting that HTML5 support, for instance, Wingdings, but if someone, for whatever reason, in an English document needs to type a foreign character outside ASCII, such as a word in Cyrillic, or Mandarin, and can't, what's the good of making the spec Unicode, as opposed to ASCII compliant? I'd just want all the characters in all languages to be supported, but things like card symbols, or emoticons are okay not to support.
Done as well.
The character entities in HTML are only to try to get around legacy encodings. And since you can specify numerical Unicode entities, all of the Unicode set is accessible, there is no need for explicit names for everything.
If you aren't constrained to legacy encodings, then the obvious approach is just to set the encoding to something sensible, for example UTF8. There are several ways to do this in HTML. http://www.w3.org/TR/html5-diff/#character-encoding
Specifying the "document character set" as unicode means that even if the charset you are writing your document in doesn't support the character you want you can still enter it as a numeric (or named if one is defined) entity, whether it will be displayed is mostly a matter of whether appropriate fonts are installed but generally i'd expect someone who writes Chinese to have Chinese fonts installed.
Generally it's the GUI system's job to handle input and output of text not an individual application. Is it reasonable to expect browsers to ship a massive font full mostly of characters that most of it's users will either fond meaningless or have already? Is it reasonable to expect browsers to implement their own input methods in case the operating system's one is defficient? Is it reasonable to expect them to implement their own font rendering for the same reason?
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
95
??? WTF are _YOU_ on about?
Can you concisely explain why the English word "psyche" is pronounced the way it is to a non-native speaker of the language?
The word being originally from Greek and pronounced /psyxe/ was transliterated and taken into English. English phonology does not allow for a word to start with /ps/, and so the rules change that to a /s/. English phonology does not allow for a /x/, and so the rules change that to a "k". English phonology does not allow for a word to end with /e/, and so the rules change that to either a /ej/ or an /i/, but more more commonly /i/ (e.g. Japanese "sake" is typically pronounced /saki/). All that is left is the /y/ which also cannot occur in English phonology, and thus the rules treat it as if it were orthographically an "i", and then apply phonological rules based on this. Since the "i" would be long (CVCe rule) it is pronounced /aj/.
Thus, after a whole bunch of interference from English phonology /psyxe/ comes out as /sajki/.
Oh, you wanted a concise answer: "Because English can't pronounce 'psyche' properly, and fuck it up." The same way "keyboard" becomes "kiiboodo" in Japanese, and "Merry Christmas" becomes "Mele Kalikimaka" in Hawai'ian.
WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
Can you explain why any word in french is pronounced the way it is?
It seems like they have different rules for what letters to pronounce for every word.
Actually, French orthography guarantees that if you know how something is spelled, then you can pronounce it, but if you only know how it is pronounced, then you cannot know how to spell it.
So, while it might be difficult for some people learning the language, it is at least consistent in spelling to pronunciation (unlike English).
WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
It's like someone was trying to come up with the most perverted way to make a letter represent something as different as possible from what it does in most European languages
No, seriously... go look at Irish Gaelic spelling. You will be amazed at how much more unpredictable the system works, and how incredibly variant the letters are from the way that they're used by everyone else.
Playing Exalted, I get a head-explosion every time someone talks about "geis" as /giz/ rather than as /geS/... (I am at least understanding of their inability to pronounce /J\/ properly, as I don't think I have any experience producing it properly either.)
WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
The classic in English is the ough ending. Count how many different ways it can be prounouced. (Oooh, Uff, Ow, Oh, Uh... any more? And I have no explantion for a non-native speaker as to how that came about.)
-- The Grand Teddy Bear has Spoken: "Windows 8 Source Code Available NOW! more disgusting than your pr..."
It's not always so clear. When you see "resume" typed out, for example, don't you ever stop and have to think if it means resume or résumé?
Two quite different words, even pronounced differently, that can be muddled by dropping the accents in text.
I think you missed the first part of the sentence:
Standardizing on ASCII, even accents aside, would be insufficient for English. There's some punctuation used in English in the high end of Latin-1 (outside of the low-end which is ASCII), and even more in the Unicode general punctuation range (2000-206F).
Such as? Is it that important to have your quote marks angled?
Well, you get one of the biggies on your own:
Sure, it greatly improves readability. Same thing with visual distinction between hyphens, various forms of dashes, and minus signs. There is a reason why professionally-published documents rarely restrict themselves to the subset of English punctuation supported by ASCII.
The difference is that the Japanese and Hawaiians actually write (the kana for) "kiiboodo" or "Mele Kalikimaka" according to their phonology of their own languages.