Domain: unicode.org
Stories and comments across the archive that link to unicode.org.
Stories · 13
-
Big Tech Warns of 'Japan's Millennium Bug' Ahead of Akihito's Abdication (theguardian.com)
MightyMartian shares a report from The Guardian: On April 30, 2019, Emperor Akihito of Japan is expected to abdicate the chrysanthemum throne. The decision was announced in December 2017 so as to ensure an orderly transition to Akihito's son, Naruhito, but the coronation could cause concerns in an unlikely place: the technology sector. The Japanese calendar counts up from the coronation of a new emperor, using not the name of the emperor, but the name of the era they herald. Akihito's coronation in January 1989 marked the beginning of the Heisei era, and the end of the Shwa era that preceded him; and Naruhito's coronation will itself mark another new era. But that brings problems. For one, Akihito has been on the throne for almost the entirety of the information age, meaning that many systems have never had to deal with a switchover in era. For another, the official name of Naruhito's era has yet to be announced, causing concern for diary publishers, calendar printers and international standards bodies. It's why some are calling it "Japan's Y2K problem." "The magnitude of this event on computing systems using the Japanese Calendar may be similar to the Y2K event with the Gregorian Calendar," said Microsoft's Shawn Steele. "For the Y2K event, there was world-wide recognition of the upcoming change, resulting in governments and software vendors beginning to work on solutions for that problem several years before January 1, 2000. Even with that preparation many organizations encountered problems due to the millennial transition. Fortunately, this is a rare event, however it means that most software has not been tested to ensure that it will behave with an additional era."
Unicode's Ken Whistler wrote in a message earlier this month: "The [Unicode Technical Committee] cannot afford to make any mistakes here, nor can it just *guess* and release the code point early. All of this is pointing directly to the necessity of issuing a Unicode 12.1 release sharply on the heels of Unicode 12.0, incorporating the addition of the new Japanese era name character, which all vendors will be under great pressure to immediately support in 2019 software releases." -
Google's New Emoji Aimed At Promoting Gender Equality Are Coming (arstechnica.com)
An anonymous reader writes: Based largely on a proposal from Unicode Consortium member Google, Unicode Consortium has announced plans to support new emoji aimed at promoting gender equality. There will be "11 new 'professional' emoji [that] will depict both men and women performing different jobs, and there will be both male and female versions of 33 existing emoji that currently depict either a man or a women but not both," writes Ars Technica. "The new professions include, in the Unicode Consortium's words: a farmer, welder, mechanic, health worker, scientist, coder, business worker, chef, student, teacher, and rockstar." What's unusual about the new emoji is that they're created using combinations of existing emoji to avoid waiting for Unicode version 10.0 to be finalized in June of 2017. By using a special "zero-width joiner" (ZWJ) character between two or more emoji, operating systems that support it know to put out a different composite emoji rather than a series of separate emoji. "The new emoji for professions start with either a man or woman emoji, then a ZWJ character, then another character related to the job," reports Ars. "Emoji that were previously one specific gender (the dancing woman or the man running) can be joined to a male or female symbol with a ZWJ character to create emoji of either gender. And all of these emoji can be combined with the existing skin tone modifiers to produce diverse versions of either gender." We may see these combined emoji before the end of the year as software companies begin to integrate them into their operating systems. -
Google's New Emoji Aimed At Promoting Gender Equality Are Coming (arstechnica.com)
An anonymous reader writes: Based largely on a proposal from Unicode Consortium member Google, Unicode Consortium has announced plans to support new emoji aimed at promoting gender equality. There will be "11 new 'professional' emoji [that] will depict both men and women performing different jobs, and there will be both male and female versions of 33 existing emoji that currently depict either a man or a women but not both," writes Ars Technica. "The new professions include, in the Unicode Consortium's words: a farmer, welder, mechanic, health worker, scientist, coder, business worker, chef, student, teacher, and rockstar." What's unusual about the new emoji is that they're created using combinations of existing emoji to avoid waiting for Unicode version 10.0 to be finalized in June of 2017. By using a special "zero-width joiner" (ZWJ) character between two or more emoji, operating systems that support it know to put out a different composite emoji rather than a series of separate emoji. "The new emoji for professions start with either a man or woman emoji, then a ZWJ character, then another character related to the job," reports Ars. "Emoji that were previously one specific gender (the dancing woman or the man running) can be joined to a male or female symbol with a ZWJ character to create emoji of either gender. And all of these emoji can be combined with the existing skin tone modifiers to produce diverse versions of either gender." We may see these combined emoji before the end of the year as software companies begin to integrate them into their operating systems. -
Unicode Consortium Looks At Symbols For Allergies
AmiMoJo writes: A proposal (PDF) submitted by a Google engineer to the Unicode Consortium asks that food allergies get their own emojis and be added to the standard. The proposal suggests the addition of peanuts, soybeans, buckwheat, sesame seeds, kiwi fruit, celery, lupin beans, mustard, tree nuts, eggs, milk products and gluten. According to TNW: "This proposal will take a little longer to become reality — it's still in very early stages and needs to be reviewed by the Unicode Consortium before it can move forward, but it'll be a great way for those with allergies to quickly express them." -
Unicode Consortium Releases Unicode 8.0.0
An anonymous reader writes: The newest version of the Unicode standard adds 7,716 new characters to the existing 21,499 – that's more than 35% growth! Most of them are Chinese, Japan and Korean ideographs, but among those changes Unicode adds support for new languages like Ik, used in Uganda. -
Unicode 7.0 Released, Supporting 23 New Scripts
An anonymous reader writes "The newest major version of the Unicode Standard was released today, adding 2,834 new characters, including two new currency symbols and 250 emoji. The inclusion of 23 new scripts is the largest addition of writing systems to Unicode since version 1.0 was published with Unicode's original 24 scripts. Among the new scripts are Linear A, Grantha, Siddham, Mende Kikakui, and the first shorthand encoded in Unicode, Duployan." -
Unicode 6.1 Released
An anonymous reader writes "The latest version of the Unicode standard (v. 6.1.0) was officially released January 31. The latest version includes 732 new characters, including seven brand new scripts. It also adds support for distinguishing emoji-style and text-style symbols and emoticons with variation selectors, updates to the line-breaking algorithm to more accurately reflect Japanese and Hebrew texts, and updates other algorithms and technical notes to reflect new characters and newly documented text behaviors." -
Unicode 6.1 Released
An anonymous reader writes "The latest version of the Unicode standard (v. 6.1.0) was officially released January 31. The latest version includes 732 new characters, including seven brand new scripts. It also adds support for distinguishing emoji-style and text-style symbols and emoticons with variation selectors, updates to the line-breaking algorithm to more accurately reflect Japanese and Hebrew texts, and updates other algorithms and technical notes to reflect new characters and newly documented text behaviors." -
International Longest Tweet Contest Seeks Entries
An anonymous reader writes "The 1st International Longest Tweet Contest is open for submissions until April 12. It looks to be a take-off of the famous Obfuscated C Contest. So far the record is 4.2 kilobits encoded per tweet, based on exploiting the fact that Twitter actually passes the full 31 bits of ISO 10646 (the international standard that Unicode is based on), not the roughly 20.08 bits/character of Unicode itself." -
C++ GUI Programming with Qt 3
Alex Moskalyuk writes "Before Sun monopolized the notion of 'write once, run everywhere,' those who enjoy programming in C++ had the choice of using Qt libraries that provide cross-platform GUI support. C++ GUI Programming with Qt3 is written by the employees of TrollTech, the company that created and currently distributes the Qt environment." Read on for the rest of Alex's review. C++ GUI Programming with Qt 3 author Jasmin Blanchette, Mark Summerfield pages 464 publisher Prentice Hall PTR rating 9 reviewer Alex Moskalyuk ISBN 0131240722 summary Practical introduction into GUI programming with QtThe first question that came to mind when I got this book - is there any need for it? Qt's Documentation is detailed and extensive with how-to's and an API reference available online for free. I have done GUI development in .NET (with C#) and Tk (with Perl) environments, and even though I've never tried Qt, the site with tutorials looked like a sufficiently good resource to start.
However, after getting through the first few chapters, religiously trying out the code, my opinions on whether a separate book is needed have changed. Jasmin Blanchette and Mark Summerfield's book can take a sufficiently clueless newbie with some C++ knowledge and guide him through the intricacies of GUI building, providing practical advice and some bits of experience on the way. You learn about the practicality of this book by turning to page 3 (with page 1 being the title) and seeing a code example as the second paragraph of the first chapter. Writing a basic GUI application in C++/Qt is attractively easy, to win you over and make you read the rest of the chapter, as well as finish the basic introduction by creating a windowed application with SpinBox and Slider widgets.
The table of contents is available on the publisher's Web site and looks fairly simple. Each chapter takes about 20-30 pages, with screenshots and code examples provided as part of the text. Reading the first 5 chapters, which comprise the "Basic Qt" section and take up 110 pages, should be enough for any C++ developer to build a sufficiently complex GUI application if all that's required is some graphical interface slapped on top of the functionality that's already there.
The rest of the book -- "Intermediate Qt" chapters -- take the reader into the common problems of GUI development, providing some insight into more advanced topics as well. Supporting networking, working with graphics and images, internationalization of the software application, interacting with help, reading XML through SAX and DOM APIs, accessing databases and doing inter-process communication are all covered here. The authors tended to avoid inserting huge amounts of reference material into the book, and, for example, in the XML chapter when working with Unicode you will be told to go online and download the numeric values of the Unicode characters instead of dedicating valuable book pages to it.
The language of the book is simple to follow; there are plenty of code examples (with discussion following each), and when the authors make certain choices, they also explain why. The diagrams and screenshots are clear (although not in color), and the code examples can be easily separated from the text. This is the first official TrollTech guide to Qt 3.2 programming, and the authors promise that the techniques will work with Qt 4.
Perhaps part of the positive impression that this book left is the fact that programming in Qt is easy and straightforward. At the early stages of my education, I started learning GUI programming with MFC, which left an indelible image of complexity and will probably increase psychiatrist bills in the future (to be fair to Microsoft, Windows Forms with .NET is a huge step forward). The book and the Qt library made some complex things sound quite simple and enjoyable to program. As Matthias Ettrich notes in the foreword to this book, the most important point in reasoning why Qt is so popular is "because programmers like it."
The book comes with a CD that contains non-commercial version of Qt 3.2 for Windows/Mac/Linux, Borland C++ 5.5 (Non-Commercial) and trial version of Borland C++ 6.0 compilers, SQLite database engine and book source code. The non-commercial version of Qt 3.2 for Windows can be installed for Borland C++ 5.5, Borland C++ 6.0, Microsoft Visual C++ 6 and Microsoft Visual C++.NET environments. The examples are quite conveniently located in folders with chapter numbers, followed by subfolders with example names.
Whether you're looking for general introduction to GUI development with C++ or trying to learn Qt, having worked with other libraries and toolkits before, this book is a good source of practical information and reference. The book is part of Perens' Open Source Series.
Alex Moskalyuk enjoys reading and reviewing books on programming and tech industry in general. You can read his other reviews on his personal site. You can purchase C++ GUI Programming with Qt 3from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Migrating Large Scale Applications from ASCII to Unicode?
bobm asks: "We've been asked to migrate our newer applications to Unicode. My biggest issue is that if we start storing user data in unicode we will no longer be able to provide complete updates the legacy (pure ASCII) systems. This is important in that we are currently updating > 25k customers a day and managment does not want that to be affected. I also haven't found a clean way to provide multilanguage data mining that can return a single language output. This doesn't even begin to address issues like data validation and display issues. (note: we currently handle the web pages in multiple language sets but require the data to be in ascii form.) I've spent some time on Unicode.Org but I really haven't found any real world discussions on people doing this on a large scale (>1Tbyte databases)." -
Why Unicode Will Work On The Internet
Ken Whistler sent in the lengthy and rather pointed response below to the article "Why Unicode Won't Work On The Internet," which was posted to Slashdot on Tuesday.I have just finished reading the article you published today on the Hastings Research website, authored by Norman Goundry, entitled "Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations."
Mr. Goundry's grounding in Chinese is evident, and I will not quibble with his background East Asian historical discussion, but his understanding of the Unicode Standard in particular and of the history of Han character encoding standardization is woefully inadequate. He make a number of egregiously incorrect statements about both, which call into question the quality of research which went into the Unicode side of this article. And as they are based on a number of false premises, the article's main conclusions are also completely unreliable.
Here are some specific comments on items in the article which are either misleading or outright false.
Before getting into Unicode per se, Mr. Goundry provides some background on East Asian writing systems. The Chinese material seems accurate to me. However, there is an inaccurate statement about Hangul: "Technically, it was designed from the start to be able to describe any sound the human throat and mouth is capable of producing in speech, ..." This is false. The Hangul system was closely tied to the Old Korean sound system. It has a rather small number of primitives for consonants and vowels, and then mechanisms for combining them into consonantal and vocalic nuclei clusters and then into syllables. However, the inventory of sounds represented by the Jamo pieces of the Hangul are not even remotely close to describing any sound of human speech. Hangul is not and never was a rival for IPA (the International Phonetic Alphabet).
In the section on "The Inability of Unicode To Fully Address Oriental Characters", Mr. Goundry states that "Unicode's stated purpose is to allow a formalized font system to be generated from a list of placement numbers which can articulate every single written language on the planet." While the intended scope of the Unicode Standard is indeed to include all significant writing systems, present and past, as well as major collections of symbols, the Unicode Standard is not about creating "formalized font systems", whatever that might mean. Mr. Goundry, while critiquing Anglo-centricity in thinking about the Web and the Internet as an "unfortunate flaw in Western attitudes" seems to have made the mistake of confusing glyph and character -- an unfortunate flaw in Eastern attitudes that often attends those focussing exclusively on Han characters.
Immediately thereafter, Mr. Goundry starts making false statements about the architecture of the Unicode Standard, making tyro's mistakes in confusing codespace with the repertoire of encoded characters. In fact the codespace of the Unicode Standard contains 1,114,112 code points -- positions where characters can be encoded. The number he then cites, 49,194, was the number of standardized, encoded characters in the Unicode Standard, Version 3.0; that number has (as he notes below) risen to 94,140 standardized, encoded characters in the current version of the Unicode Standard, i.e., Version 3.1. After taking into account code points set aside for private use characters, there are still 882,373 code points unassigned but available for future encoding of characters as needed for writing systems as yet unencoded or for the extension of sets such as the Han characters.
Even if Mr. Goundry's calculation of 170,000 characters needed for China, Taiwan, Japan, and Korea were accurate, the Unicode Standard could accomodate that number of characters easily. (Note that it already includes 70,207 unified Han ideographs.) However, Mr. Goundry apparently has no understanding of the implications or history of Han unification as it applies to the Unicode Standard (and ISO/IEC 10646). Furthermore, he makes a completely false assertion when he states that Mainland China, Taiwan, Korea, and Japan "were not invited to the initial party."
Starting with the second problem first, a perusal of the Han Unification History, Appendix A of the Unicode Standard, Version 3.0, will show just how utterly false Mr. Goundry's implication that the Asian countries were left out of the consideration of encoding of Han characters in the Unicode Standard is. Appendix A is available online, so there really is no valid research excuse for not having considered it before haring off to invent nonexistent history about the project, even if Mr. Goundry didn't have a copy of the standard sitting on his desk. See:
http://www.unicode.org/unicode/uni2book/appA.pdf
The "historical" discussion which follows in Mr. Goundry's account, starting with "The reaction was predictable ..." is nothing less than fantasy history that has nothing to do with the actual involvement of the standardization bodies of China, Japan, Korea, Taiwan, Hong Kong, Singapore, Vietnam, and the United States in Han character encoding in 10646 and the Unicode Standard over the last 11 years.
Furthermore, Mr. Goundry's assertions about the numbers of characters to be encoded show a complete misunderstanding of the basics of Han unification for character encoding. The principles of Han unification were developed on the model of the main Japanese national character encoding, and were fully assented to by the Chinese, Korean, and other national bodies involved. So assertions such as "they [Taiwan] could not use the same number [for their 50,000 characters] as those assigned over to the Communists on the Mainland" is not only false but also scurrilously misrepresents the actual cooperation that took place among all the participants in the process.
Your (Mr. Carroll's) editorial observation that "It is only when you get all the nationalities in the same room that the problem becomes manifest," runs afoul of this fantasy history. All the nationalities have been participating in the Han unification for over a decade now. The effort is led by China, which has the greatest stakeholding in Han characters, of course, but Japan, Korea, Taiwan and the others are full participants, and their character requirements have not been neglected.
And your assertion that many Westerners have a "tendency .. to dismiss older Oriental characters as 'classic,'" is also a fantasy that has nothing to do with the reality of the encoding in the Unicode Standard. If you would bother to refer to the documentation for the Unicode Standard, Version 3.1, you would find that among the sources exhaustively consulted for inclusion in the Unicode Standard are the KangXi dictionary (cited by Mr. Goundry), but also Hanyu Da Zidian, Ci Yuan, Ci Hai, the Chinese Encyclopedia, and the Siku Quanshu. Those are the major references for Classical Chinese -- the Siku Quanshu is the Classical canon, a massive collection of Classical Chinese works which is now available on CDROM using Unicode. In fact, the company making it available is led by the same man who represents the Chinese national standards body for character encoding and who chairs the Ideographic Rapporteur Group (the international group that assists the ISO working group in preparing the Han character encoding for 10646 and the Unicode Standard).
Mr. Goundry's argument for "Why Unicode 3.1 Does Not Solve the Problem" is merely that "[94,140 characters] still falls woefully short of the 170,000+ characters needed"-- and is just bogus. First of all the number 170,000 is pulled out of the air by considering Chinese, Japanese, and Korean repertoires without taking Han unification into account. In fact, many more than 170,000 candidate characters were considered by the IRG for encoding -- see the lists of sources in the standard itself. The 70,207 unified Han ideographs (and 832 CJK compatibility ideographs) already in the Unicode Standard more than cover the kinds of national sources Mr. Goundry is talking about.
Next Mr. Goundry commits an error in misunderstanding the architecture of the Unicode Standard, claiming that "two separate 16-bit blocks do not solve the problem at all." That is not how the Unicode Standard is built. Mr. Goundry claims that "18 bits wide" would be enough -- but in fact, the Unicode Standard codespace is 21 bits wide (see the numbers cited above). So this argument just falls to pieces.
The next section on "The Political Significance Of This Expressed In Western Terms" is a complete farce based on false premises. I can only conclude that the aim of this rhetoric is to convince some ignorant Westerners who don't actually know anything about East Asian writing systems -- or the Unicode Standard, for that matter -- that what is going on is comparable to leaving out five or six letters of the Latin alphabet or forcing "the French ... to use the German alphabet". Oh my! In fact, nothing of the kind is going on, and these are completely misleading metaphors.
The problem of URL encodings for the Web is a significant problem, but it is not a problem *created* by the Unicode Standard. It is a problem which is being actively worked on my the IETF currently, and it is quite likely that the Unicode Standard will be a significant part of the solution to the problem, enabling worldwide interoperability, rather than obstructing it.
And it isn't clear where Mr. Goundry comes up with asides about "Ascii-dependent browsers". I would counter that Mr. Goundry is naive if he hasn't examined recently the internationalized capabilities of major browsers such as Internet Explorer -- which themselves depend on the Unicode Standard.
Mr. Goundry's conclusion then presents a muddled summary of Unicode encoding forms, completely missing the point that UTF-8, UTF-16, and UTF-32 are each completely interoperable encoding forms, each of which can express the entire range of the Unicode Standard. It is incorrect to state that "Unicode 3.1 has increased the complexity of UCS-2." The architecture of the Unicode Standard has included UTF-16 (not UCS-2) since the publication of Unicode 2.0 in 1996; Unicode 3.1 merely started the process of standardizing characters beyond the Basic Multilingual Plane.
And if Mr. Goundry (or anyone else) dislikes the architectural complexity of UTF-16, UTF-32 is *precisely* the kind of flat encoding that he seems to imply would be preferable because it would not "exacerbate the complexity of font mapping".
In sum, I see no point in Mr. Goundry's FUD-mongering about the Unicode Standard and East Asian writing systems.
Finally, the editorial conclusion, to wit, "Hastings [has] been experimenting with workarounds, which we believe can be language- and device-compatible for all nationalities," leads me to believe that there may be hidden agenda for Hastings in posting this piece of so-called research about Unicode. Post a seemingly well-researched white paper with a scary headline about how something doesn't work, convince some ignorant souls that they have a "problem" that Unicode doesn't address and which is "politically explosive", and then turn around and sell them consulting and vaporware to "fix" their problem. Uh-huh. Well, I'm not buying it.
Whistler is Technical Director for Unicode, Inc. and co-editor of The Unicode Standard, Version 3.0. He holds a B.A. in Chinese and a Ph.D. in Linguistics. -
How Do You Handle Unicode?
spectecjr asks: "The word on the street is that Unicode is the panacea for programming globalizable applications. The thing is, which operating systems support it? And how much work do you need to do? Allegedly, KDE and GNOME both support Unicode. Windows 9x supports it, but in a rather fragmented manner. Windows 2000 is Unicode to the extreme, even supporting Dvengali, Thai and Arabic script on all versions. But what are the pitfalls? What do you have to be aware of? What makes it different than talking to ASCII? How do you handle whitespace? How do you make your API display the characters in the right fonts? All of these issues are becoming more important as the world becomes more switched on, and the boundaries shrink between places. But what does it really mean for Joe Q. Developer?"For some related articles on this subject, you might want to check out the following: