Unicode and the Unix Console?

← Back to Stories (view on slashdot.org)

Unicode and the Unix Console?

Posted by Cliff on Thursday December 19, 2002 @12:10PM from the moving-beyind-ASCII dept.

Phactorial asks: "At it's current state, most UNIX consoles (not graphical terminal emulators, mlterm is out for this) I have dealt with do not handle unicode properly. This is essential when it comes to dealing with languages that require characters that are not in the current ASCII set. I was wondering if anyone out there is developing a solution for non-Linux platforms. I know the Arabeyes project is currently working on a project called 'Akka' which provides UTF-8 (kinda) support and even shaping and bidirectional code (essential for many languages in the East, the program works fine and I am working on getting a FreeBSD port out). However, I was pondering, how are other UNIX consoles doing? Do any of them fully support unicode, even bidirectional characters? shaping? (a great many of today's UNIX applications lack many if not all of these ;(). If you know of such applications or are working on support for a platform, could you give feedback as to your experiences and thoughts on the current state of the UNIX console?"

2 of 57 comments (clear)

Uhh... by jensend · 2002-12-19 13:06 · Score: 5, Insightful

From text of question:

(not graphical terminal emulators, mlterm is out for this)
I was wondering if anyone out there is developing a solution for non-Linux platforms.

The answer "Sure, there's this graphical terminal emulator in a recent linux distro!" seems somewhat inappropriate to the question.
Nonsense by GCP · 2002-12-19 19:31 · Score: 5, Insightful

Not flamebait, just nonsense.

Unix isn't byte based, it's text based. Of course one layer deeper, it's byte based, but so is every other OS, and below that it's transistor based, etc.

What distinguishes Unix from other OSes is an emphasis on working in text with text utilities, often thru windows (telnet clients) on other machines -- windows whose only supported datatype is text.

In Unix, as in XML, text is sort of considered the ultimate data type. Bytes are just the medium used to represent the text under the surface. If the bytes were what mattered, people would usually work in a hex editor and do hex I/O, but they don't. They work at the text level of abstraction most of the time. It's the text that matters, not the bytes used to digitize it.

For text to reach its full potential, though, you have to say goodbye to grampa's ASCII and move on to a rich, universal form of text: Unicode. It's ludicrous for someone to say that speakers of non-Western languages should never have the ability to use the full range of Unix the way Westerners can. People who make comments like that are usually unaware of the problems that even English speakers have with single-byte encodings. (The second most powerful currency on earth is the Euro. Where is the Euro sign in Latin-1? Where are the curly quotes used by almost all English-language press? What happens when a press release destined for Time Magazine gets piped thru a series of single-byte Unix utilities? Undefined!)

XML, another system that considers text the universal data type, is Unicode based. They understand the concept of "universal". Same for HTML now. More and more Web pages are going to UTF-8, even for English, to avoid weird problems with Macs vs. PCs, Euro signs, curly quotes, embedded non-English text, etc. Are such pages really supposed to be out of reach of standard Unix utilities?

Java and .Net are 100% Unicode. Windows and Macintosh are now all Unicode based.

IETF and W3C have made it clear that no non-Unicode-based text protocols will be considered from now on.

Oracle is recommending Unicode as the format for all database text for new databases. So what happens when you cripple Unix so that it can't handle Oracle data in default form?

AT&T considers Unicode the future of Unix (cf. Plan 9), Sun has made the conversion to full Unicode fundamental to the future of Solaris, and as we speak the Free Standards Organization is preparing to do the same for an upcoming version of the LSB (Linux Standards Base) common core that all major Linux vendors have committed to.

It's unfortunate that so many Unix users still think that ASCII was good enough for grampa, so it should be good enough for every Unix user on earth from now on, but fortunately those who drive the standards have abandoned that kind of thinking forever.

--
"Those who have never entered upon scientific pursuits know not a tithe of the poetry by which they are surrounded."