NetHack Development Team Polls Community For Advice On Unicode
An anonymous reader writes After years of relative silence, the development team behind the classic roguelike game NetHack has posteda question: going forward, what internal representation should the NetHack core use for Unicode characters? UTF8? UTF32? Something else? (See also: NH4 blog, reddit. Also, yes, I have verified that the question authentically comes from the NetHack dev team.)
There are combined characters that are not represented by a single codepoint: http://en.wikipedia.org/wiki/U...
Please, don't use the Wikia NetHack Wiki. It is outdated, ad-ridden, and has been abandoned by the community, but Wikia doesn't allow a wiki to be deleted.
The current NetHack wiki is at http://nethackwiki.com/ .
UnNetHack: NetHack Improved!
Extracting a character - trivial. Length of string - trivial.
I don't think it's quite as simple as you think. UTF-8 is a variable-length encoding, but UTF-32 is too when you consider grapheme clusters.
When you extract characters and and determine length, are you only talking about code points (not very useful) or are you taking into consideration combining characters to account for actual visible glyphs that most people would consider to be a character?
The overwhelming majority of apps are only doing trivial operations -- string concatenation and shuffling bits to some API to display text. For these apps, choice of encoding really does not matter. NetHack is very likely in this category.
Anything more and you'll have to deal with variable-length data for both UTF-8 and UTF-32. So it doesn't really matter. Choose whichever uses less storage space.
For storing a single character: UCS-4 (aka UTF-32), and that's without possible combining character decoration. For everything else, UTF-8 internally, no matter what the system locale is.
wchar_t is always damage, it shouldn't be used except in wrappers that do actual I/O: you need such wrappers as standard-compliant functions are buggy to the level of uselessness on Windows and you need SomeWindowsInventedFunctionW() for everything if you want Unicode.
And why UTF-8 not UCS-4 for strings? UTF-8 takes slightly longer code:
while (int l = utf8towc(&c, s))
{
s += l;
do_something(c);
}
vs UCS-4's simpler:
for (; *s; s++)
{
do_something(*s);
}
but UCS-4 blows up most your strings by a factor of 4, and makes viewing stuff in a debugger really cumbersome.
My credentials: I'm the guy who added Unicode support to Dungeon Crawl.
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
Characters in Thai are rendered in display-oredr, and not logical order.
so, for example ( mina would be imna) and requires reordering for sorting.
Characters in many Indic languages are still all syllable based.
So, consonants and vowels are encoded separately, and fully interact as a logical graphical character.
Sinhala:
0dc1 0dca 200d 0dbb 0dd3
ZHA VIRAMA ZWJ RA VOWEL-SIGN-II
Combine to form a single displayable character. (Sri)
If you omit the Zero-Width-Joiner, then it displays as two characters, "Sa'" and "Ri."
So, the rendering and display are dependant on the entire grapheme, which is the normal unit of display and truncation.
Otherwise one will be cropping portions of a character on display; and rendering either jibbrish/bakamoji, or unrelated characters/syllables because.
Malay:
0d15 0d4d 0d38 0d3e
KA VIRAMA SA AA
One displayable character.
If you display code-point by code point, the grapheme displayed would changes 4 times.
KA
K'
KSA
KSAA
Terminoligy needs to be fixed.
All Codepoints are 4 bypes
All characters (defined as a single conceptual, and graphical display unit) range from 1 to 6 code-points. (so, 4-24bytes)
Sinhala:
0dc1 0dca 200d 0dbb 0dd3
ZHA VIRAMA ZWJ RA VOWEL-SIGN-II
Combine to form a single displayable character. (Sri) (kinda a fancy item; but different from without the ZWJ which would display two graphemes. (S', and RII)
And Lituanian:
"However, not all abstract characters are encoded as a single Unicode character, and some abstract characters may be represented in Unicode by a sequence of two or more characters. For example, a Latin small letter "i" with an ogonek, a dot above, and an acute accent, which is required in Lithuanian, is represented by the character sequence U+012F, U+0307, U+0301."
And there are many other cases where there is no single code-point to represent a single grapheme.
So for string truncation and line-splitting, (and anything dealing with arabic or indic scripts), you need to never crop in the middle of a codepoint-sequence that defined a single grapheme; or else the visual display is incorrect, or bakamoji (jibbrish).