NetHack Development Team Polls Community For Advice On Unicode
An anonymous reader writes After years of relative silence, the development team behind the classic roguelike game NetHack has posteda question: going forward, what internal representation should the NetHack core use for Unicode characters? UTF8? UTF32? Something else? (See also: NH4 blog, reddit. Also, yes, I have verified that the question authentically comes from the NetHack dev team.)
utf-8
If masochist, just UTF-16. If slashdot coder use ASCII.
who cares? This only affects naming your character and displaying stuff on the map.
I started playing nethack before it was nethack, it was just hack. (I may well hold the record for longest time playing without an asencion, but that is beside the point.) I have played other roguelikes and keep coming back to nethack because it is the only one that keeps that same feel for me. It has had the same overall look my entire life. While the expanded character set in UTF would allow for significantly more characters to be used in drawing the map, and designating each monster with a different character, I beg of you not to do so. Keep the overall look the same, (or allow it as a compile time option at the very least) and just use UTF for the character name.
For which implimentation of UTF to use, I'd go with utf8 as it seems to have the widest adoption, or 32 because that will probably allow you the longest time before having to think about this again. I would avoid the middle ground.
Little Brother, watching the watchers
First off, UTF-32 is least likely to cause bugs, since all chars are the same length and thus possible to determine memory usage simply by multiplying char count by 4. So, if you're gonna do unicode, and you don't like your code to be buggy, this is the way to do it.
That said, unicode is a travesty. Unlike ascii, there is no such thing as a complete unicode font that implements all of unicode's code points. Unicode only defines how any implemented chars should be numbered, but doesn't actually require you to implement more than zero characters.
Calling unicode a standard - well it's true, of course. But it doesn't mean what people think it means.
The answer is UTF-8. It's pretty much going to be the de-facto character set now. It has backwards compatibility with ASCII, and can easily be extended in the future to support possible U+200000 - U+7FFFFFFF codepoints, as the original UTF-8 specification used to include that anyway.
Any important point is to not mess things up and end up with CESU-8 like MySQL did. There are completely valid 4-byte UTF-8 characters, so don't think of it as some special alternate UTF-8 by artificially capping UTF-8 at a max of 3 bytes per character.
Morphing Software
UTF-8 is easily adopted by C based software like Nethack because null-terminated string logic works unmodified; a UTF-8 string has no embedded nulls to trip up any code that that measures string length by searching for a zero byte. For the most part things should "just work." UTF-16 and 32 strings have zero bytes embedded among characters, so you have to audit every bit of code to ensure compatibility.
UTF-32 should make memory allocation more predictable as every character is guaranteed to be 32 bits.
In my experience, if you are upgrading legacy code that assumed straightforward ascii then utf8 is the
way to go. It was invented for the purpose by someone very smart (Ken Thompson). If there were a 'Neatest Hacks of All Time' competition utf8 would be my nomination.
The only real issues I've encountered are the usual ones of comparisons between equivalent characters and defining collating order. These stop being a problem (or more precisely 'your' problem) once you abandon the idea of rolling your own and use a decent utf8 string library.
i don't see a real argument here. "considering the length". how long is it?
Check the game history. Literally decades between major releases.
"some of the silliness". what silliness is this exactly? external storage of utf-32 requires that one deal with an endian character set. every time any text is touched, you'll get to endian convert.
Everybody has already settled on the little-endian presentation.
isn't that awesome? utf-8 does not have this issue. and one can almost always treat utf8 as a byte stream. except in the rare case where one needs to know where character boundaries are. for example, to map the character to a font. the fast path is the common path (ascii), and just requires a single test ((c&0x80) == 0).
With UCS-4 you do not even need any tests.
Extracting a character - trivial.
Length of string - trivial.
Normalization - much simpler than the utf-8.
The sad reality that libraries I have seen actually implement the utf-8 handling by using internally utf-32. You can't avoid it: Unicode is specified in the code points, which as you point it out are already as good as 32 bit long.
sure the gnu c library has had bad wchar_t conversion routines in the past, but it's a free country. you can implement your own.
Frankly, I haven't even used C library for the purpose. We had already one library developed in-house, because portable support for utf-8 is patchy at best.
The sanest portable approach is to link with iconv and convert everything from some internal presentation to external. Because you can never know what encoding user needs. Unless you really need to save the RAM (one has shitload of string data), utf-8 simply sucks as internal presentation.
P.S. I have had very little experience with Unicode. But several month of dealing with it, have simply convinced me that if one has to deal with l10n/i10n, then utf-16/utf-32 are very good choices. Ditto, if one has to deal with the Unicode. If application really doesn't care what it prints or reads - then pass-through binary (utf-8) works too. But as soon as one has to take the length of utf-8 string (real length), then it is time to start switching from utf-8 to utf-32.
All hope abandon ye who enter here.
What use are those characters anyway? You don't need funny accents on letters to play Nethack.
For more terrifying monster types, of course. You haven't really battled a Chinese dragon until you've done it using the original Han character set.
I don't care if it's 90,000 hectares. That lake was not my doing.
There are combined characters that are not represented by a single codepoint: http://en.wikipedia.org/wiki/U...
Please, don't use the Wikia NetHack Wiki. It is outdated, ad-ridden, and has been abandoned by the community, but Wikia doesn't allow a wiki to be deleted.
The current NetHack wiki is at http://nethackwiki.com/ .
UnNetHack: NetHack Improved!
Extracting a character - trivial. Length of string - trivial.
I don't think it's quite as simple as you think. UTF-8 is a variable-length encoding, but UTF-32 is too when you consider grapheme clusters.
When you extract characters and and determine length, are you only talking about code points (not very useful) or are you taking into consideration combining characters to account for actual visible glyphs that most people would consider to be a character?
The overwhelming majority of apps are only doing trivial operations -- string concatenation and shuffling bits to some API to display text. For these apps, choice of encoding really does not matter. NetHack is very likely in this category.
Anything more and you'll have to deal with variable-length data for both UTF-8 and UTF-32. So it doesn't really matter. Choose whichever uses less storage space.
Let me answer with a koan: 'What is the real length of a soft hyphen?'
Its obvious you have little real experience with unicode, because saying 'just convert to utf-32' just papers over the problems without solving them. UTF-32 units are code points, not characters, and there are many multi-code-point (variable length) characters in utf-32. So you still have all the length and normalization problems you have with utf-8 (and even with ASCII, though people often ignore it there -- are 'a' and 'A' the same character? How do they sort?) The real 'length' problem is that people insist on using the term ambiguously -- you have string storage space and string rendering size, and the two are completely independent.
AFAICT the original query came from the actual DevTeam. The blog post in the submission is from the NetHack4 guy, who I suspect is also the anonymous submitter.
Ooh, moderator points! Five more idjits go to Minus One Hell!
Delendae sunt RIAA, MPAA et Windoze
Characters in Thai are rendered in display-oredr, and not logical order.
so, for example ( mina would be imna) and requires reordering for sorting.
Characters in many Indic languages are still all syllable based.
So, consonants and vowels are encoded separately, and fully interact as a logical graphical character.
Sinhala:
0dc1 0dca 200d 0dbb 0dd3
ZHA VIRAMA ZWJ RA VOWEL-SIGN-II
Combine to form a single displayable character. (Sri)
If you omit the Zero-Width-Joiner, then it displays as two characters, "Sa'" and "Ri."
So, the rendering and display are dependant on the entire grapheme, which is the normal unit of display and truncation.
Otherwise one will be cropping portions of a character on display; and rendering either jibbrish/bakamoji, or unrelated characters/syllables because.
Malay:
0d15 0d4d 0d38 0d3e
KA VIRAMA SA AA
One displayable character.
If you display code-point by code point, the grapheme displayed would changes 4 times.
KA
K'
KSA
KSAA
UTF-8 is designed to be treated as a byte stream - even when detecting character boundaries. If a byte is >0x7F and <0xC0, then it is not a character boundary. If you want to be really strict, filter out the invalid bytes (0xC0, 0xC1, >0xF4), then everything else is a character boundary.
Everyone knows (or should know) that the web was built on WTF-8 which explains a lot.
What does "character" mean?
Something represented by one unicode codepoint? (making your statement a tautology)
Grapheme cluster? (what most users would consider a character)
A position in the character grid of a console?
Which brings us to the real question. to what extent do you want to support unicode? do you care about
* Grapheme clusters that take multiple code points to represent? (letters with multiple diacritics, unusual letter/diacritic combinations etc)
* Right to left languages? (hebrew, arabic etc)
* Languages where chracters merge together such that computer output looks more like handwriting than type? (see above)
* Languages where "fixed" width fonts use two different widths giving "single width" and "double width" characters? (chineese, japanese, korean)
* Characters outside of the basic multilingual plane? (rare Chinese characters, dead languages, made up languages, rare mathematical symbols)
Once you have worked though that design decision it will help you make others. What you find is that "length in unicode code points" and "unicode code point n" really aren't much more useful than "length in utf-k code units" and "utf-k code point n". Either is fine for sanity checking string length or iterating through a string looking for delimiter. Neither is much use for anything more than unless you are doing a very limited implementation.
UTF-32 seems enticing initially but turns out to be fairly pointless, by the time you get to caring about non-BMP characters you are probably also going to be caring about combining characters etc and it will massively increase the size of the vast majority of text.
UTF-8 vs UTF-16 is something of a tossup. UTF-16 lets you get away with treating each unit of the string as one "character" much longer which may be considered either a blessing (because you don't care about the cases where it doesn't work) or a curse (because you realise your assumptions were wrong much later after basing much more code on them). UTF-8 is smaller for text with lots of latin chracters, UTF-16 is smaller for text with lots of CJK characters. UTF-8 is the usual choice on *nix systems and internet protocols. UTF-16 is the encoding chosen by windows and Java.
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
Definite a canadian
http://en.wikipedia.org/wiki/A...
Confucius say, "Find worm in apple - bad. Find half a worm - worse."