CJKV Information Processing 2nd ed.

← Back to Stories (view on slashdot.org)

CJKV Information Processing 2nd ed.

Posted by samzenpus on Wednesday July 8, 2009 @06:00AM from the read-all-about-it dept.

stoolpigeon writes "At the end of last year, I made a move from an IT shop focused on supporting the US side of our business to a department that provides support to our operations outside the US. This was the first time I've worked in an international context and found myself, on a regular basis, running into long-time assumptions that were no longer true. My first project was implementing a third-party, web-based HR system for medium-sized offices. I found myself constantly missing important issues because I had such a narrow approach to the problem space. Sure, I've built applications and databases that supported Unicode, but I've never actually implemented anything with them but the same types of systems I'd built in the past with ASCII. But a large portion of the world's population is in Asia, and ASCII is certainly not going to cut it there. Fortunately, a new edition of Ken Lunde's classic CJKV Information Processing has become available, and it has really opened my eyes." Keep reading for the rest of JR's review. CJKV Information Processing 2nd ed. author Ken Lunde pages 898 publisher O'Reilly Media, Inc. rating 10/10 reviewer JR Peck ISBN 978-0-596-51447-1 summary Chinese, Japanese, Korean and Vietnamese computing. CJKV Information Processing has a long history that actually goes back into the 1980s. It began as a simple text document JAPAN.INF, available via FTP on a number of servers. This document was excerpted and refined and published as Lunde's first book in 1993, Understanding Japanese Information Processing. Shortly after JAPAN.INF became CJK.INF and the foundation for the first edition of CJKV Information Processing was born. The first edition was published in 1999, and it is safe to say that a number of important things have changed over the last 10 years. Lunde states four major developments that prompted this second edition in the preface. They are the emergence of Unicode, OpenType and the Portable Document Format (PDF) as preferred tools and lastly the maturity of the web in general to use Unicode and deal with a wider range of languages and their character sets.

Lunde sets out not to create an exhaustive reference on the languages themselves, but rather an exhaustive guide to the considerations that come into play when processing CJKV information. As Lunde states, "..this book focuses heavily on how CJKV text is handled on computer systems in a very platform-independent way..." Taking into account the complexity of the topic, the breadth of the work and the degree to which it is independent of any specific technology, outside a heavy bias for Unicode, is extremely impressive. A glance over the table of contents show just how true this is. Chapter 9, Information Processing Techniques has sections touching on C/C++, Java, Perl, Python, Ruby, Tcl and others. These are brief, with most examples in Java but that they are all directly addressed shows a great awareness of the options out there. The sections that deal with operating system issues have the same breadth. Chapter 10, OSes, Text Editors, and Word Processors doesn't just hit the top Mac and Windows items. It looks at FreeBSD, Linux, Mac OS X, MS Vista, MS-DOS, Plan 9, OpenSolaris, Unix and more. There are also sections for what Lunde calls hybrid environments such as Boot Camp, CrossOver Mac, Gnome, KDE, VMware Fusion, Wine and the X Window System. Interestingly the Word Processor system covers AbiWord and KWord but not OpenOffice.org The point stands that anyone looking to support CJKV, this book will probably cover your platform and give you at the very least a starting point with your chosen tool set.

That said, an extremely specific implementation is not what Lunde is out to offer up. This is the very opposite of a 'cook book' approach. This also makes the book extremely useful to anyone dealing with internationalization, globalization or localization issues regardless of character set or language. Lunde teaches the underlying principles of how writing systems and scripts work. He then moves to how computer systems deal with these various writing systems and scripts. The focus is always on CJKV but the principles will hold true in any setting. This continues to be the case as Lunde talks about character sets, encoding, code conversion and a host of other issues that surround handling characters. Typography is included, as well as input and output methods. In each case Lunde covers the basics as well as pointing out areas of concern and where exceptions may cause issues. The author is nothing if not thorough in this regard. His knowledge of the problem space is at times down right staggering. Lunde also touches on dictionaries as well as publishing in print and on the web.

The first three chapters set the table for the rest of the book with an overview of the issues that will be addressed, information on the history and usage of the writing systems and scripts covered and the character set standards that exist. This was a fascinating glimpse, once again into CJKV languages and how other languages are dealt with as well. I think there is even a lot here that would be extremely informative to a person who wants to learn more about CJKV, even if they are not a developer that will be working with one of the languages. That's only the first quarter of the book, so I don't know that it would be worth it from just that perspective, but it is definitely a nice benefit of Lunde's approach.

The style is very readable, but I wouldn't just hand this to someone who didn't have some familiarity with text processing issues on computer systems. While there is no requirement to know or understand one of the CJKV languages, understanding how computer systems process data and information is important. I did not know anything about CJKV languages prior to reading the book and have learned quite a bit. What I learned was not limited to the CJKV arena. The experience I had was very similar to when I studied ancient Greek in school. Learning Greek I learned much more about English grammar than I had ever picked up prior. Reading CJKV Information Processing I learned quite a bit more about the issues involved in things like character encoding and typography for every language, not just these four. But in dealing with CJKV specifically I've found that Lunde's work is indispensable. It is not just my go to reference, it's essentially my only reference. If any other works do come my way, this is the standard against which they will be judged.

There are thirteen indexes including a nice glossary. Nine of them are character sets, which were printed out in the longer first edition. In this second edition, there is a note on each, with a url pointing to a PDF with the information. It seemed odd, but each URL gets it's own page. This means there are nine pages with nothing but the title of the index and a url. Fortunately they are all in the same directory, which can be reached directly from the books page at the O'Reilly site. It seems it would have made sense to just list them all on a single page, but maybe it was necessary for some reason. It's a minute flaw in what is a great book."

You can purchase CJKV Information Processing 2nd ed. from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

52 comments

Min score:

Reason:

Sort:

QUE? by Em+Emalb · 2009-07-08 06:07 · Score: 2, Funny

http://lmgtfy.com/?q=CJKV

--
Sent from your iPad.
1. Re:QUE? by ManuelH · 2009-07-08 07:10 · Score: 1
  
  Or CJKV
  
  --
  Mother used to said If you want you find a way But mother never danced through fire shower
CJKV is.... by ForexCoder · 2009-07-08 06:13 · Score: 5, Informative

CJK is a collective term for Chinese, Japanese, and Korean, which constitute the main East Asian languages. The term is used in the field of software and communications internationalization.
The term CJKV means CJK plus Vietnamese, which in the past used Hán t/Chinese characters and Ch Nôm prior to adopting Quc Ng.
http://en.wikipedia.org/wiki/CJK_characters
1. Re:CJKV is.... by Anonymous Coward · 2009-07-08 12:02 · Score: 0
  
  This post is far more informative than the fucking review was.
2. Re:CJKV is.... by kumanopuusan · 2009-07-08 13:56 · Score: 1
  
  Yeah, I was wondering about the characters on the cover, behind the CJKV.
  
  I was familiar with the first three, the kanji used to represent China (the character for middle), Japan (the character for the sun) and Korea (the character for... Korea), but I didn't realize the last one was the character for Vietnam. It normally means to wake or cause. FWIW, the old name for Vietnam in Japanese seems to be "etsunan", which I guess is pretty close phonetically.
  
  --
  Use of the words "good", "bad" or "evil" is almost invariably the result of oversimplification.
3. Re:CJKV is.... by karlconnors · 2009-07-08 14:39 · Score: 1
  
  Thanks!
  And to think someone thought that CHKV was a programming language :)
one page min. per index / appendix / chapter by WillAdams · 2009-07-08 06:19 · Score: 3, Interesting

is likely a limitation of the use of FrameMaker to compose the document and an unwillingness to set up new styles to put them together (unfortunately O'Reilly hasn't use TeX for a title since _Making TeX Work_) and was probably let stand since they needed a particular page count to come out to even signatures anyway.
William

--
Sphinx of black quartz, judge my vow.
1. Re:one page min. per index / appendix / chapter by Anonymous Coward · 2009-07-08 09:08 · Score: 0
  
  Actually, the second edition used InDesign: http://unicode.org/mail-arch/unicode-ml/y2009-m01/0063.html
2. Re:one page min. per index / appendix / chapter by WillAdams · 2009-07-09 00:02 · Score: 1
  
  Just after I posted I wondered if O'Reilly was still so wedded to FM and wished I could've taken the time to research the matter.
  Thanks for correcting my wrong assumption and setting the record straight.
  William
  
  --
  Sphinx of black quartz, judge my vow.
Oh boy... by Estanislao+Mart�nez · 2009-07-08 06:31 · Score: 0, Offtopic

Cue the idiots saying that computers should only support English, because otherwise it allows those other people to isolate themselves from us/not get on with the program/just stop existing already...

--
Are you adequate?
1. Re:Oh boy... by Anonymous Coward · 2009-07-08 06:38 · Score: 0
  
  Cue the idiots saying that computers should only support English, because otherwise it allows those other people to isolate themselves from us/not get on with the program/just stop existing already...
  computers should only support English, because otherwise it allows those other people to isolate themselves from us/not get on with the program/just stop existing already...
2. Re:Oh boy... by Octorian · 2009-07-08 07:24 · Score: 1
  
  Sure, that was intended as a joke, but a lot of protocols in the computer world sure feel like they were invented with the assumption that everyone communicated in nothing but 7-bit US-ASCII.
  After all, why else would we need Quoted-Printable and Base64 encoding, which let you put non-7bit data into 7-bit US-ASCII?
  And then we have character sets. Its a total mess. It started (most likely) with US-ASCII, and eventually ended up at the all-encompassing Unicode. But along the way, we gained dozens of "legacy character sets" that are inconsistently supported and no one wants to use, but are still outputted by plenty of software.
3. Re:Oh boy... by SL+Baur · 2009-07-08 09:04 · Score: 2, Interesting
  
  I'm not sure why this was modded offtopic.
  s/English/ASCII/ and I got plenty of complaints along those lines in my mailbox over the years. Supporting Asian languages can be expensive in terms of processing time. Japanese companies *can* be insular, been there done that. I have no experience with the CKV part.
  Fortunately the state of the art in computing hardware has improved over the years and it's not as expensive as it used to be.
  Their English web presence leaves something to be desired, but I agree with their mission statement - http://www.m17n.org/index.html Those are the guys who first did Asian language support for emacs. I worked with them for a year in Japan.
Overlaps with "Unicode Explained"? by tcopeland · 2009-07-08 06:33 · Score: 3, Interesting

When I was working on my JavaCC book I bought Jukka Korpela's Unicode Explained and it was *extremely* helpful. After reading it I actually felt comfortable using various tools to convert from one encoding to another, discussing multibyte character sets, and so forth. It helped me write the Unicode chapter in my book with some confidence. It was the first time I had used vi to enter Unicode characters... fun times.
That said, it sounds like "CJKV Information Processing" covers some of the same ground. Has anyone read both?

--
The Army reading list
1. Re:Overlaps with "Unicode Explained"? by slarrg · 2009-07-08 07:51 · Score: 1
  
  Gee, my methods were different than most: I married a Ukrainian woman. Having a wife who knows several languages, each with different 8-bit encodings, using computers in your house on a daily basis makes you appreciate Unicode in a hurry.
2. Re:Overlaps with "Unicode Explained"? by tcopeland · 2009-07-08 07:58 · Score: 1
  
  > Gee, my methods were different than most: I married a Ukrainian woman.
  Hehe, yeah, actually, my wife is Romanian, so all my JavaCC Unicode examples involve s-with-cedilla and stuff like that :-) Buna zuia!
  
  --
  The Army reading list
Great Book. Could use an Arabic supplement. by jholder · 2009-07-08 06:37 · Score: 3, Informative

I used the first ed years ago, and sure enough, Unicode, OTF, anf PDF dominate my world now. The only thing that is complicated enough to need additional exposition would be Arabic, with it's ability to not only combine RTL and LTR text (Hebrew does as well) but has to be shaped contextually.

--
-- John
Overlaps, sure; equivalent, I don't think so. by Anonymous Coward · 2009-07-08 06:40 · Score: 3, Insightful

It's going to have a big overlap, but the additional, crucially important material with CJKV processing is the non-Unicode encoding systems that have been used for those scripts, and the input methods that are used to enter the scripts into the computer. A general-purpose Unicode book will not go into a lot of depth about either of these topics.
1. Re:Overlaps, sure; equivalent, I don't think so. by Anonymous Coward · 2009-07-08 23:02 · Score: 0
  
  gconv( "moonspeak", in, "utf-8", out );
The Absolute Minimum..." by KlaymenDK · 2009-07-08 06:44 · Score: 5, Informative

"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" is also a very good -- but very much shorter -- introduction to Unicode.
http://www.joelonsoftware.com/articles/Unicode.html
I frequently send this to people that I need to work with who don't "get" it.

--
"Good news, everyone!"
1. Re:The Absolute Minimum..." by bcrowell · 2009-07-08 10:11 · Score: 4, Interesting
  
  Nice article -- thanks for providing the link! I liked this: "There Ain't No Such Thing As Plain Text. If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly."
  This is not a hard problem to solve in the case of email and web pages, which can have encoding given in headers. (If you validate your page using the w3c validator, it will warn you if you didn't supply an encoding.) It's also not an insanely hard problem for strings in memory; the encoding can be either set by your encoding convention or handled behind the scenes by your language (as in perl).
  What really sucks is files. For instance, I wrote this extremely simple terminal-based personal calendar program in perl, and it's actually attracted a decent number of users. It's internationalized in 11 languages. Well, one day a user sends me an email complaining that the program is giving him mysterious error messages. He sends me his calendar file, which is a plain text file with some Swedish in it. I run the program on my machine with his calendar file, and it works fine. I can't reproduce the bug. We go through a few rounds of confused communication before I finally realize that he must have had the file encoded in Latin-1 on his end, whereas my program is documented as requiring utf-8. So now my program has to include the following cruft:
  
  sub file_is_valid_utf8 { my $f = shift; open(F,"<:raw",$f) or return 0; local $/; my $x=<F>; close F; return is_valid_utf8($x); } # What's passed to this routine has to be a stream of bytes, not a utf8 string in which the characters are complete utf8 characters. # That's why you typically want to call file_is_valid_utf8 rather than calling this directly. sub is_valid_utf8 { my $x = shift; return utf8::decode(my $dummy = $x); }
  
  Yech. It requires reading the file twice, and it's not even 100% reliable.
  This is the kind of situation where the Unix philosophy, based on plain text files and little programs that read and write them, really runs into a problem. With hindsight, it would have been really, really helpful if Unix filesystems could have included just a smidgen more metadata, enough to specify the character encoding.
  
  --
  Find free books.
2. Re:The Absolute Minimum..." by ld+a,b · 2009-07-08 12:18 · Score: 1
  
  If you work with them it is easier, hopefully you can try to get them fired or at least coerced into doing it right.
  With free software programmed by volunteers it is even worse. Many such volunteers are great coders but they come from ASCII countries and as such don't "get" while tail should perform worse than it used to do, or why should they care about character width instead of strlen, or why should they update an algorithm they borrowed from K&R 30 years ago.
  Truth is, with UTF-8 while you lose the convenient at times 1:1 char/character equivalence, most of your legacy code can remain unchanged because of its great design.
  Only very badly hard-coded routines will need any significant investment to convert, while most applications can be adapted in days by someone who has any idea about encodings.
  People, you really need to learn about UTF-8, UTF-16(+UCS-2) and UTF-32 and the relevant library functions in your platform.
  
  --
  10 little-endian boys went out to dine, a big-endian carp ate one, and then there were -246.
3. Re:The Absolute Minimum..." by simcop2387 · 2009-07-08 12:38 · Score: 1
  
  you could also always open the file for reading and writing with the utf8 encoding, that way it wouldn't matter what the user sets up for their environment.
4. Re:The Absolute Minimum..." by spitzak · 2009-07-08 13:11 · Score: 2, Interesting
  
  What you are encountering is a typical moron implementatin of UTF-8.
  For some reason otherwise intelligent programmers lose their minds when presented with UTF-8. They act as though the program will crash instantly if they ever make a pointer that points at the middle of a character, or if they fail to correclty count the "characters" in a string and dare to use an offset or number of bytes. I am not really certain what causes these diseases but being exposed to decades of character==byte ASCII programming seems responsible.
  One way I try to correct this is to get them to thing about "words" the same way they are thinking about "characters". Do they panic that there is not a fast method of moving by N words? Do they panic that it is possible to split a string in the middle of a word and thus produce two incorrectly-spelled words? Do they think that copying text in fixed-sized blocks from one location to another will somehow garble it because at the midpoint a word got split into two parts? No, not if they have any brains at all. However for some reason when they see multibyte characters all sense goes out the window.
  Here is how you solve it: you keep the text as UTF-8 and you treat it as an array of BYTES!. Smoke will NOT come out of your computer because you don't continuously think about the "characters", in fact it will, amazingly enough, be remembered and the bits will not change because you failed to continuously look at the character boundaries or you dared to not count how many there were! When it gets to time to display it, you parse out each UTF-8 character and draw it on the screen (and also do all that complex Pango-like layout). At the same time, through an amazing ability of the UTF-8 decoder to recognize that it can't decode something, you will, FOR FREE, find the errors. You can then render the bytes of the error sequence in another way, perhaps by choosing the matching character from CP1252.
  This completely avoids the need for "metadata" and "BOM" and all that other crap, and magically works when the users accidentally pastes text from different encodings together, something that no metadata can ever solve.
  This isn't rocket science or magic, but for some reason it appears to be for a lot of people. You included, and many many other intelligent people. Comon, everybody, please think a little!
5. Re:The Absolute Minimum..." by oasisbob · 2009-07-08 14:58 · Score: 1
  
  What really sucks is files.
  Indeed. Which is why Bush hid the facts.
6. Re:The Absolute Minimum..." by david.given · 2009-07-08 16:37 · Score: 2, Interesting
  
  I liked this: "There Ain't No Such Thing As Plain Text. If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly."
  My Unicode mantra is:
  "You can't do random access on strings. No, not even if you turn it into UCS-2. Or UCS-4. Yes, Java is lying to you."
  This is because a Unicode printable thing can span multiple bytes and multiple code points. You can't find the nth character in a string, firstly because Unicode doesn't really have such a concept as a character, and secondly because you don't know where it is. This Java code:
  char c = s.charAt(4);
  ...doesn't do what people think it does --- it returns the 4th UTF-16 sequence thingamajig that may actually contain only part of a code point, and that code point may actually only contain part of a glyph, and trying to string slice without first checking you're at the end of a glyph is going to cause people from countries that use combining characters to hate you, because your app will break.
  So in essence, in order to manipulate strings, you need to step through them from one glyph to the next, each of which may occupy an arbitrary number of bytes. So you might as well use UTF-8.
  A while back I wrote a word processor using this technique: WordGrinder. It worked surprisingly well; the whole thing is 6300 lines of code and the first version took a month to write. I'll admit that I chickened out with RTL and entry and display of combining characters, but the text storage core can cope with them just fine.
  But it does require a rather different philosophy for managing text than in the good old ASCII days, which is a pain in the arse sometimes...
7. Re:The Absolute Minimum..." by bcrowell · 2009-07-09 03:50 · Score: 1
  
  You can't do random access on strings. No, not even if you turn it into UCS-2. Or UCS-4. Yes, Java is lying to you.
  
  It's been interesting reading different people's replies to my post. One thing I've noticed is that each of us is talking about the language he's most familiar with. I was writing about a situation I encountered with perl. You're talking about java. Other people are talking about C.
  Your comment applies to java but not to perl. In perl, you really can do random access on strings. All the internal details of the implementation are hidden, but it really does Just Work, under certain conditions. The main condition is that when you first create the string, e.g., by reading from a file, perl has to know what the encoding is. If you don't tell perl what the encoding is, it can misinterpret the encoding. For instance, if you try to read latin-1, but it's acually a unicode encoding such as utf-8, you're going to get errors or garbled data. I think this is a point that a lot of people replying to my post didn't understand. Note that latin-1 is not a type of unicode -- it's a completely different character encoding than unicode.
  
  --
  Find free books.
8. Re:The Absolute Minimum..." by david.given · 2009-07-09 23:15 · Score: 1
  
  Actually, I do it mostly in C --- I picked Java for that example because it has a really simple example of getting it wrong.
  And when you say Perl supports random access of Unicode strings, are you sure it's not just giving you random access to an array of Unicode code points --- which is also wrong? Remember that a single Unicode glyph can be made up of an arbitrary number of code points.
  Even in European languages, trying to split a string between the combining accent code point and the base character code point will have really weird results. In Asian languages things just get worse. The only sensible thing to do is to treat the each glyph as an atomic substring of variable length. Which, of course, means you don't know where they are in the string...
9. Re:The Absolute Minimum..." by bcrowell · 2009-07-10 03:58 · Score: 1
  
  And when you say Perl supports random access of Unicode strings, are you sure it's not just giving you random access to an array of Unicode code points --- which is also wrong? Remember that a single Unicode glyph can be made up of an arbitrary number of code points.
  
  Interesting point. Some documentation: man perlunicode, man perluniintro, Unicode::Normalize. I spent some time studying these, and concluded that I didn't understand enough to answer your question :-)
  
  --
  Find free books.
uh by Anonymous Coward · 2009-07-08 07:09 · Score: 1, Insightful

why modded troll?
Spungo's Law says... by spungo · 2009-07-08 07:15 · Score: 1

...nearly every week there will be a new O'Reilly book on something you've never heard of.
1. Re:Spungo's Law says... by karlconnors · 2009-07-08 14:36 · Score: 1
  
  I lot that law, very funny!!
Seriously, state the topic in 1st or 2nd sentence by Anonymous Coward · 2009-07-08 07:20 · Score: 0

Please write article intros that state the topic in the 1st of 2nd sentence. Rules of writing suggest you describe acronyms at their FIRST usage. Otherwise it just sounds like media driven news outlets, "New killer virus threatens entire humanity, more on that after these messages."
How's this different from ASDF processing? by robert899 · 2009-07-08 07:28 · Score: 1

Anyone???
1. Re:How's this different from ASDF processing? by radtea · 2009-07-08 07:53 · Score: 1
  
  It's well known that CJKV is more like QPZA than ASDF, although TYRX process is probably better documented than either.
  Recent developments in RWRI technology have seen a lot of uptake by the IRWR community, leading some to believe that ASDF is on its way out entirely.
  That's a completely clear and informative SUMMARY of the issue, right?
  
  --
  Blasphemy is a human right. Blasphemophobia kills.
2. Re:How's this different from ASDF processing? by Anonymous Coward · 2009-07-08 09:10 · Score: 0
  
  Perhaps ASDF = American, Spanish, Dominican Republic, French languages. CJKV = Chinese, Japanese, Korean, Vietnamese languages (the hard ones to implement on a computer).
aka the Seventh Seal by Anonymous Coward · 2009-07-08 08:07 · Score: 0

"third-party, web-based HR system"
You fool, you've damned us all.
Fonts and encoding by jbolden · 2009-07-08 08:25 · Score: 3, Interesting

I own the first edition of CJKV but I find Fonts and encodings to be far more useful. Obviously if you are working heavily in any of these languages the 2nd best book is worth having but I'd say that F&E feels like a systematic treatment while CJKV feels like 1000 pages of webarticles on the topic.
1. Re:Fonts and encoding by jholder · 2009-07-08 11:19 · Score: 1
  
  This looks really good, I'd definitely have gotten this book if it existed back when I had to learn all this stuff starting in 1997-2000.
  
  --
  -- John
"and it has really opened my eyes" by Anonymous Coward · 2009-07-08 09:29 · Score: 2, Funny

Interesting slant.
And just when you think you mastered it all.... by idji · 2009-07-08 10:51 · Score: 1

you have to support Turkish! where simple things like
... path="C:\Program Files"
... path=path.toUpper
will cause PathDoesNotExistException.

You need to go through the whole code base and remove any case-changes that happen with the letter "i" or letter "I".
Because Turkish is the ONLY alphabet where the uppercase version of 7bit "i" has 8 bits! Undotted i
1. Re:And just when you think you mastered it all.... by corsec67 · 2009-07-08 12:38 · Score: 2, Insightful
  
  Changing the case of a path SHOULD cause it to refer to a different path.
  Here is 5 cents, go buy yourself a better computer.
  
  --
  If I have nothing to hide, don't search me
2. Re:And just when you think you mastered it all.... by Anonymous Coward · 2009-07-08 15:28 · Score: 0
  
  You shouldn't be upper-casing Windows long pathnames - they can't be treated like DOS 8.3 pathnames. It (nearly always) works by accident on western systems, but that doesn't make it right. Indeed, you shouldn't be performing *any* case conversions on long pathnames, ever. Note that NTFS supports Unicode file/folder names, so you could run into similar problems on western systems, too.
  Even if you have a long pathname where a folder/file name is "short" but has lowercase letters (e.g. C:\HASiANDu\), use Windows API functions to translate between long & short pathnames. Sloppy shortcuts like upper-casing pathnames should be left in the 16-bit DOS era where they belong.
  Hmmm...why would you upper-case a long pathname anyway?
  - T
3. Re:And just when you think you mastered it all.... by idji · 2009-07-09 19:12 · Score: 1
  
  if you are in the *nix world. in windows not so. has nothing to do with a new computer. Turkey is largely a HP + Microsoft world in government and large business.
Re:Great Book. Could use an Arabic supplement. by brusk · 2009-07-08 12:43 · Score: 1

Actually Arabic (and Persian, using almost the same alphabet) isn't the only such case; there are lots of complicated issues with S Asian and SE Asian scripts (not to mention Mongolian, which like Arabic has initial, medial and final forms--but is, properly, written vertically).

--
.sig withheld by request
Java I/O is better here. by Estanislao+Mart�nez · 2009-07-08 12:46 · Score: 1

Yech. It requires reading the file twice, and it's not even 100% reliable.
AFAIK it's not possible to do it in a 100% reliable fashion, but there are technical solutions where the file doesn't need to be read twice. Java, despite all of its flaws, handles this sort of thing pretty well, so I'll use that as an example.
In Java, there is a distinction between byte-based and character-based I/O. InputStream and OutputStream are byte-based I/O classes; Reader and Writer are character-based. Then you have classes like InputStreamReader that bridge the two worlds; an InputStreamReader is a reader that pulls bytes from an InputStream and passes them through a CharsetDecoder to converts them to the system's internal string representation (which is UTF-16).
So in Java, to read and validate the file in one pass, you just need to hook up your InputStream/InputStreamReader/CharsetDecoder pipeline so that the decoder throws an exception when the file does not conform to the encoding. This is one of various built-in strategies for CharsetDecoder; others are to ignore the invalid data and try to recover, or to insert some predefined character.
People coming from Perl or similar systems, when they see this for the first time, tend to think that this is much too complicated, especially when they notice all the associated extra classes like CoderResult and CodingErrorAction. It might be a bit more complicated than it needs to be, but certainly the best solution to reading characters from files is going to be more complex than what these people want.

--
Are you adequate?
Hanzi/Kanji for "Viet", and other trivia by zooblethorpe · 2009-07-08 14:45 · Score: 1

...but I didn't realize the last one was the character for Vietnam. It normally means to wake or cause.

Doh, nope! :) Actually, it's not the character for "wake" or "cause", i.e. okiru or okosu, but rather the character for "exceed" or "pass through", as in the Japanese words koeru or kosu.

FWIW, the old name for Vietnam in Japanese seems to be "etsunan", which I guess is pretty close phonetically.
The etsu part in Japanese is pronounced yuè in Mandarin Chinese (link). The "u" is kinda pinched in pronunciation, such that the sound isn't all too far from a tight viet pronunciation. I have a sneaking hunch the pronunciation in Cantonese would be even closer to the Vietnamese.
The nan or nam morpheme shows up in a lot of Asian place names, and usually means "south" -- Hainan, Vietnam, Nanjing, Shônan, etc.
So the name for Vietnam (as written in Chinese characters, at least) ultimately means something like "deep south" -- about right, geographically speaking, from a Chinese perspective. :)
Cheers,

--
"What in the name of Fats Waller is that?"
"A four-foot prune."
1. Re:Hanzi/Kanji for "Viet", and other trivia by kumanopuusan · 2009-07-08 15:03 · Score: 1
  
  Yeah, of course you're right. Is it a feeble excuse to say that I'm used to reading with okurigana? :-(
  
  --
  Use of the words "good", "bad" or "evil" is almost invariably the result of oversimplification.
2. Re:Hanzi/Kanji for "Viet", and other trivia by zooblethorpe · 2009-07-08 15:57 · Score: 1
  
  Well, FWIW, the koeru kanji looks not too far from the okiru kanji; they both have the same bushu or radical, the bit going down the left and extending across underneath, which happens to be one of the larger bushu too. :) And, for that matter, there are two kanji used for koeru / kosu, one with the on reading (i.e., the reading(s) generally used in compounds and that came originally from Chinese) of etsu, and the other read as chô, as in chô kawaii!
  So no worries, hey, it's Japanese. Whee!
  Cheers,
  
  --
  "What in the name of Fats Waller is that?"
  "A four-foot prune."
3. Re:Hanzi/Kanji for "Viet", and other trivia by Anonymous Coward · 2009-07-08 19:06 · Score: 0
  
  In this topic, "Viet" is a old localãpart name of China, not for "exceed".
  Therefore VietNam means for on the south of China.