NetHack Development Team Polls Community For Advice On Unicode
An anonymous reader writes After years of relative silence, the development team behind the classic roguelike game NetHack has posteda question: going forward, what internal representation should the NetHack core use for Unicode characters? UTF8? UTF32? Something else? (See also: NH4 blog, reddit. Also, yes, I have verified that the question authentically comes from the NetHack dev team.)
There are combined characters that are not represented by a single codepoint: http://en.wikipedia.org/wiki/U...
Please, don't use the Wikia NetHack Wiki. It is outdated, ad-ridden, and has been abandoned by the community, but Wikia doesn't allow a wiki to be deleted.
The current NetHack wiki is at http://nethackwiki.com/ .
UnNetHack: NetHack Improved!
For storing a single character: UCS-4 (aka UTF-32), and that's without possible combining character decoration. For everything else, UTF-8 internally, no matter what the system locale is.
wchar_t is always damage, it shouldn't be used except in wrappers that do actual I/O: you need such wrappers as standard-compliant functions are buggy to the level of uselessness on Windows and you need SomeWindowsInventedFunctionW() for everything if you want Unicode.
And why UTF-8 not UCS-4 for strings? UTF-8 takes slightly longer code:
while (int l = utf8towc(&c, s))
{
s += l;
do_something(c);
}
vs UCS-4's simpler:
for (; *s; s++)
{
do_something(*s);
}
but UCS-4 blows up most your strings by a factor of 4, and makes viewing stuff in a debugger really cumbersome.
My credentials: I'm the guy who added Unicode support to Dungeon Crawl.
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
Characters in Thai are rendered in display-oredr, and not logical order.
so, for example ( mina would be imna) and requires reordering for sorting.
Characters in many Indic languages are still all syllable based.
So, consonants and vowels are encoded separately, and fully interact as a logical graphical character.
Sinhala:
0dc1 0dca 200d 0dbb 0dd3
ZHA VIRAMA ZWJ RA VOWEL-SIGN-II
Combine to form a single displayable character. (Sri)
If you omit the Zero-Width-Joiner, then it displays as two characters, "Sa'" and "Ri."
So, the rendering and display are dependant on the entire grapheme, which is the normal unit of display and truncation.
Otherwise one will be cropping portions of a character on display; and rendering either jibbrish/bakamoji, or unrelated characters/syllables because.
Malay:
0d15 0d4d 0d38 0d3e
KA VIRAMA SA AA
One displayable character.
If you display code-point by code point, the grapheme displayed would changes 4 times.
KA
K'
KSA
KSAA