NetHack Development Team Polls Community For Advice On Unicode
An anonymous reader writes After years of relative silence, the development team behind the classic roguelike game NetHack has posteda question: going forward, what internal representation should the NetHack core use for Unicode characters? UTF8? UTF32? Something else? (See also: NH4 blog, reddit. Also, yes, I have verified that the question authentically comes from the NetHack dev team.)
First off, UTF-32 is least likely to cause bugs, since all chars are the same length and thus possible to determine memory usage simply by multiplying char count by 4. So, if you're gonna do unicode, and you don't like your code to be buggy, this is the way to do it.
That said, unicode is a travesty. Unlike ascii, there is no such thing as a complete unicode font that implements all of unicode's code points. Unicode only defines how any implemented chars should be numbered, but doesn't actually require you to implement more than zero characters.
Calling unicode a standard - well it's true, of course. But it doesn't mean what people think it means.
There are combined characters that are not represented by a single codepoint: http://en.wikipedia.org/wiki/U...
Please, don't use the Wikia NetHack Wiki. It is outdated, ad-ridden, and has been abandoned by the community, but Wikia doesn't allow a wiki to be deleted.
The current NetHack wiki is at http://nethackwiki.com/ .
UnNetHack: NetHack Improved!
Extracting a character - trivial. Length of string - trivial.
I don't think it's quite as simple as you think. UTF-8 is a variable-length encoding, but UTF-32 is too when you consider grapheme clusters.
When you extract characters and and determine length, are you only talking about code points (not very useful) or are you taking into consideration combining characters to account for actual visible glyphs that most people would consider to be a character?
The overwhelming majority of apps are only doing trivial operations -- string concatenation and shuffling bits to some API to display text. For these apps, choice of encoding really does not matter. NetHack is very likely in this category.
Anything more and you'll have to deal with variable-length data for both UTF-8 and UTF-32. So it doesn't really matter. Choose whichever uses less storage space.
For storing a single character: UCS-4 (aka UTF-32), and that's without possible combining character decoration. For everything else, UTF-8 internally, no matter what the system locale is.
wchar_t is always damage, it shouldn't be used except in wrappers that do actual I/O: you need such wrappers as standard-compliant functions are buggy to the level of uselessness on Windows and you need SomeWindowsInventedFunctionW() for everything if you want Unicode.
And why UTF-8 not UCS-4 for strings? UTF-8 takes slightly longer code:
while (int l = utf8towc(&c, s))
{
s += l;
do_something(c);
}
vs UCS-4's simpler:
for (; *s; s++)
{
do_something(*s);
}
but UCS-4 blows up most your strings by a factor of 4, and makes viewing stuff in a debugger really cumbersome.
My credentials: I'm the guy who added Unicode support to Dungeon Crawl.
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
Characters in Thai are rendered in display-oredr, and not logical order.
so, for example ( mina would be imna) and requires reordering for sorting.
Characters in many Indic languages are still all syllable based.
So, consonants and vowels are encoded separately, and fully interact as a logical graphical character.
Sinhala:
0dc1 0dca 200d 0dbb 0dd3
ZHA VIRAMA ZWJ RA VOWEL-SIGN-II
Combine to form a single displayable character. (Sri)
If you omit the Zero-Width-Joiner, then it displays as two characters, "Sa'" and "Ri."
So, the rendering and display are dependant on the entire grapheme, which is the normal unit of display and truncation.
Otherwise one will be cropping portions of a character on display; and rendering either jibbrish/bakamoji, or unrelated characters/syllables because.
Malay:
0d15 0d4d 0d38 0d3e
KA VIRAMA SA AA
One displayable character.
If you display code-point by code point, the grapheme displayed would changes 4 times.
KA
K'
KSA
KSAA