CJKV Information Processing 2nd ed.
stoolpigeon writes "At the end of last year, I made a move from an IT shop focused on supporting the US side of our business to a department that provides support to our operations outside the US. This was the first time I've worked in an international context and found myself, on a regular basis, running into long-time assumptions that were no longer true. My first project was implementing a third-party, web-based HR system for medium-sized offices. I found myself constantly missing important issues because I had such a narrow approach to the problem space. Sure, I've built applications and databases that supported Unicode, but I've never actually implemented anything with them but the same types of systems I'd built in the past with ASCII. But a large portion of the world's population is in Asia, and ASCII is certainly not going to cut it there. Fortunately, a new edition of Ken Lunde's classic CJKV Information Processing has become available, and it has really opened my eyes." Keep reading for the rest of JR's review.
CJKV Information Processing 2nd ed.
author
Ken Lunde
pages
898
publisher
O'Reilly Media, Inc.
rating
10/10
reviewer
JR Peck
ISBN
978-0-596-51447-1
summary
Chinese, Japanese, Korean and Vietnamese computing.
CJKV Information Processing has a long history that actually goes back into the 1980s. It began as a simple text document JAPAN.INF, available via FTP on a number of servers. This document was excerpted and refined and published as Lunde's first book in 1993, Understanding Japanese Information Processing. Shortly after JAPAN.INF became CJK.INF and the foundation for the first edition of CJKV Information Processing was born. The first edition was published in 1999, and it is safe to say that a number of important things have changed over the last 10 years. Lunde states four major developments that prompted this second edition in the preface. They are the emergence of Unicode, OpenType and the Portable Document Format (PDF) as preferred tools and lastly the maturity of the web in general to use Unicode and deal with a wider range of languages and their character sets.
Lunde sets out not to create an exhaustive reference on the languages themselves, but rather an exhaustive guide to the considerations that come into play when processing CJKV information. As Lunde states, "..this book focuses heavily on how CJKV text is handled on computer systems in a very platform-independent way..." Taking into account the complexity of the topic, the breadth of the work and the degree to which it is independent of any specific technology, outside a heavy bias for Unicode, is extremely impressive. A glance over the table of contents show just how true this is. Chapter 9, Information Processing Techniques has sections touching on C/C++, Java, Perl, Python, Ruby, Tcl and others. These are brief, with most examples in Java but that they are all directly addressed shows a great awareness of the options out there. The sections that deal with operating system issues have the same breadth. Chapter 10, OSes, Text Editors, and Word Processors doesn't just hit the top Mac and Windows items. It looks at FreeBSD, Linux, Mac OS X, MS Vista, MS-DOS, Plan 9, OpenSolaris, Unix and more. There are also sections for what Lunde calls hybrid environments such as Boot Camp, CrossOver Mac, Gnome, KDE, VMware Fusion, Wine and the X Window System. Interestingly the Word Processor system covers AbiWord and KWord but not OpenOffice.org The point stands that anyone looking to support CJKV, this book will probably cover your platform and give you at the very least a starting point with your chosen tool set.
That said, an extremely specific implementation is not what Lunde is out to offer up. This is the very opposite of a 'cook book' approach. This also makes the book extremely useful to anyone dealing with internationalization, globalization or localization issues regardless of character set or language. Lunde teaches the underlying principles of how writing systems and scripts work. He then moves to how computer systems deal with these various writing systems and scripts. The focus is always on CJKV but the principles will hold true in any setting. This continues to be the case as Lunde talks about character sets, encoding, code conversion and a host of other issues that surround handling characters. Typography is included, as well as input and output methods. In each case Lunde covers the basics as well as pointing out areas of concern and where exceptions may cause issues. The author is nothing if not thorough in this regard. His knowledge of the problem space is at times down right staggering. Lunde also touches on dictionaries as well as publishing in print and on the web.
The first three chapters set the table for the rest of the book with an overview of the issues that will be addressed, information on the history and usage of the writing systems and scripts covered and the character set standards that exist. This was a fascinating glimpse, once again into CJKV languages and how other languages are dealt with as well. I think there is even a lot here that would be extremely informative to a person who wants to learn more about CJKV, even if they are not a developer that will be working with one of the languages. That's only the first quarter of the book, so I don't know that it would be worth it from just that perspective, but it is definitely a nice benefit of Lunde's approach.
The style is very readable, but I wouldn't just hand this to someone who didn't have some familiarity with text processing issues on computer systems. While there is no requirement to know or understand one of the CJKV languages, understanding how computer systems process data and information is important. I did not know anything about CJKV languages prior to reading the book and have learned quite a bit. What I learned was not limited to the CJKV arena. The experience I had was very similar to when I studied ancient Greek in school. Learning Greek I learned much more about English grammar than I had ever picked up prior. Reading CJKV Information Processing I learned quite a bit more about the issues involved in things like character encoding and typography for every language, not just these four. But in dealing with CJKV specifically I've found that Lunde's work is indispensable. It is not just my go to reference, it's essentially my only reference. If any other works do come my way, this is the standard against which they will be judged.
There are thirteen indexes including a nice glossary. Nine of them are character sets, which were printed out in the longer first edition. In this second edition, there is a note on each, with a url pointing to a PDF with the information. It seemed odd, but each URL gets it's own page. This means there are nine pages with nothing but the title of the index and a url. Fortunately they are all in the same directory, which can be reached directly from the books page at the O'Reilly site. It seems it would have made sense to just list them all on a single page, but maybe it was necessary for some reason. It's a minute flaw in what is a great book."
You can purchase CJKV Information Processing 2nd ed. from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
Lunde sets out not to create an exhaustive reference on the languages themselves, but rather an exhaustive guide to the considerations that come into play when processing CJKV information. As Lunde states, "..this book focuses heavily on how CJKV text is handled on computer systems in a very platform-independent way..." Taking into account the complexity of the topic, the breadth of the work and the degree to which it is independent of any specific technology, outside a heavy bias for Unicode, is extremely impressive. A glance over the table of contents show just how true this is. Chapter 9, Information Processing Techniques has sections touching on C/C++, Java, Perl, Python, Ruby, Tcl and others. These are brief, with most examples in Java but that they are all directly addressed shows a great awareness of the options out there. The sections that deal with operating system issues have the same breadth. Chapter 10, OSes, Text Editors, and Word Processors doesn't just hit the top Mac and Windows items. It looks at FreeBSD, Linux, Mac OS X, MS Vista, MS-DOS, Plan 9, OpenSolaris, Unix and more. There are also sections for what Lunde calls hybrid environments such as Boot Camp, CrossOver Mac, Gnome, KDE, VMware Fusion, Wine and the X Window System. Interestingly the Word Processor system covers AbiWord and KWord but not OpenOffice.org The point stands that anyone looking to support CJKV, this book will probably cover your platform and give you at the very least a starting point with your chosen tool set.
That said, an extremely specific implementation is not what Lunde is out to offer up. This is the very opposite of a 'cook book' approach. This also makes the book extremely useful to anyone dealing with internationalization, globalization or localization issues regardless of character set or language. Lunde teaches the underlying principles of how writing systems and scripts work. He then moves to how computer systems deal with these various writing systems and scripts. The focus is always on CJKV but the principles will hold true in any setting. This continues to be the case as Lunde talks about character sets, encoding, code conversion and a host of other issues that surround handling characters. Typography is included, as well as input and output methods. In each case Lunde covers the basics as well as pointing out areas of concern and where exceptions may cause issues. The author is nothing if not thorough in this regard. His knowledge of the problem space is at times down right staggering. Lunde also touches on dictionaries as well as publishing in print and on the web.
The first three chapters set the table for the rest of the book with an overview of the issues that will be addressed, information on the history and usage of the writing systems and scripts covered and the character set standards that exist. This was a fascinating glimpse, once again into CJKV languages and how other languages are dealt with as well. I think there is even a lot here that would be extremely informative to a person who wants to learn more about CJKV, even if they are not a developer that will be working with one of the languages. That's only the first quarter of the book, so I don't know that it would be worth it from just that perspective, but it is definitely a nice benefit of Lunde's approach.
The style is very readable, but I wouldn't just hand this to someone who didn't have some familiarity with text processing issues on computer systems. While there is no requirement to know or understand one of the CJKV languages, understanding how computer systems process data and information is important. I did not know anything about CJKV languages prior to reading the book and have learned quite a bit. What I learned was not limited to the CJKV arena. The experience I had was very similar to when I studied ancient Greek in school. Learning Greek I learned much more about English grammar than I had ever picked up prior. Reading CJKV Information Processing I learned quite a bit more about the issues involved in things like character encoding and typography for every language, not just these four. But in dealing with CJKV specifically I've found that Lunde's work is indispensable. It is not just my go to reference, it's essentially my only reference. If any other works do come my way, this is the standard against which they will be judged.
There are thirteen indexes including a nice glossary. Nine of them are character sets, which were printed out in the longer first edition. In this second edition, there is a note on each, with a url pointing to a PDF with the information. It seemed odd, but each URL gets it's own page. This means there are nine pages with nothing but the title of the index and a url. Fortunately they are all in the same directory, which can be reached directly from the books page at the O'Reilly site. It seems it would have made sense to just list them all on a single page, but maybe it was necessary for some reason. It's a minute flaw in what is a great book."
You can purchase CJKV Information Processing 2nd ed. from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
http://lmgtfy.com/?q=CJKV
Sent from your iPad.
CJK is a collective term for Chinese, Japanese, and Korean, which constitute the main East Asian languages. The term is used in the field of software and communications internationalization.
The term CJKV means CJK plus Vietnamese, which in the past used Hán t/Chinese characters and Ch Nôm prior to adopting Quc Ng.
http://en.wikipedia.org/wiki/CJK_characters
is likely a limitation of the use of FrameMaker to compose the document and an unwillingness to set up new styles to put them together (unfortunately O'Reilly hasn't use TeX for a title since _Making TeX Work_) and was probably let stand since they needed a particular page count to come out to even signatures anyway.
William
Sphinx of black quartz, judge my vow.
Cue the idiots saying that computers should only support English, because otherwise it allows those other people to isolate themselves from us/not get on with the program/just stop existing already...
Are you adequate?
When I was working on my JavaCC book I bought Jukka Korpela's Unicode Explained and it was *extremely* helpful. After reading it I actually felt comfortable using various tools to convert from one encoding to another, discussing multibyte character sets, and so forth. It helped me write the Unicode chapter in my book with some confidence. It was the first time I had used vi to enter Unicode characters... fun times.
That said, it sounds like "CJKV Information Processing" covers some of the same ground. Has anyone read both?
The Army reading list
I used the first ed years ago, and sure enough, Unicode, OTF, anf PDF dominate my world now. The only thing that is complicated enough to need additional exposition would be Arabic, with it's ability to not only combine RTL and LTR text (Hebrew does as well) but has to be shaped contextually.
-- John
It's going to have a big overlap, but the additional, crucially important material with CJKV processing is the non-Unicode encoding systems that have been used for those scripts, and the input methods that are used to enter the scripts into the computer. A general-purpose Unicode book will not go into a lot of depth about either of these topics.
"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" is also a very good -- but very much shorter -- introduction to Unicode.
http://www.joelonsoftware.com/articles/Unicode.html
I frequently send this to people that I need to work with who don't "get" it.
"Good news, everyone!"
why modded troll?
...nearly every week there will be a new O'Reilly book on something you've never heard of.
Please write article intros that state the topic in the 1st of 2nd sentence. Rules of writing suggest you describe acronyms at their FIRST usage. Otherwise it just sounds like media driven news outlets, "New killer virus threatens entire humanity, more on that after these messages."
Anyone???
"third-party, web-based HR system"
You fool, you've damned us all.
I own the first edition of CJKV but I find Fonts and encodings to be far more useful. Obviously if you are working heavily in any of these languages the 2nd best book is worth having but I'd say that F&E feels like a systematic treatment while CJKV feels like 1000 pages of webarticles on the topic.
Interesting slant.
you have to support Turkish! where simple things like
... path="C:\Program Files"
... path=path.toUpper
will cause PathDoesNotExistException.
You need to go through the whole code base and remove any case-changes that happen with the letter "i" or letter "I".
Because Turkish is the ONLY alphabet where the uppercase version of 7bit "i" has 8 bits! Undotted i
Actually Arabic (and Persian, using almost the same alphabet) isn't the only such case; there are lots of complicated issues with S Asian and SE Asian scripts (not to mention Mongolian, which like Arabic has initial, medial and final forms--but is, properly, written vertically).
.sig withheld by request
AFAIK it's not possible to do it in a 100% reliable fashion, but there are technical solutions where the file doesn't need to be read twice. Java, despite all of its flaws, handles this sort of thing pretty well, so I'll use that as an example.
In Java, there is a distinction between byte-based and character-based I/O. InputStream and OutputStream are byte-based I/O classes; Reader and Writer are character-based. Then you have classes like InputStreamReader that bridge the two worlds; an InputStreamReader is a reader that pulls bytes from an InputStream and passes them through a CharsetDecoder to converts them to the system's internal string representation (which is UTF-16).
So in Java, to read and validate the file in one pass, you just need to hook up your InputStream/InputStreamReader/CharsetDecoder pipeline so that the decoder throws an exception when the file does not conform to the encoding. This is one of various built-in strategies for CharsetDecoder; others are to ignore the invalid data and try to recover, or to insert some predefined character.
People coming from Perl or similar systems, when they see this for the first time, tend to think that this is much too complicated, especially when they notice all the associated extra classes like CoderResult and CodingErrorAction. It might be a bit more complicated than it needs to be, but certainly the best solution to reading characters from files is going to be more complex than what these people want.
Are you adequate?
Doh, nope! :) Actually, it's not the character for "wake" or "cause", i.e. okiru or okosu, but rather the character for "exceed" or "pass through", as in the Japanese words koeru or kosu.
The etsu part in Japanese is pronounced yuè in Mandarin Chinese (link). The "u" is kinda pinched in pronunciation, such that the sound isn't all too far from a tight viet pronunciation. I have a sneaking hunch the pronunciation in Cantonese would be even closer to the Vietnamese.
The nan or nam morpheme shows up in a lot of Asian place names, and usually means "south" -- Hainan, Vietnam, Nanjing, Shônan, etc.
So the name for Vietnam (as written in Chinese characters, at least) ultimately means something like "deep south" -- about right, geographically speaking, from a Chinese perspective. :)
Cheers,
"What in the name of Fats Waller is that?"
"A four-foot prune."