NetHack Development Team Polls Community For Advice On Unicode
An anonymous reader writes After years of relative silence, the development team behind the classic roguelike game NetHack has posteda question: going forward, what internal representation should the NetHack core use for Unicode characters? UTF8? UTF32? Something else? (See also: NH4 blog, reddit. Also, yes, I have verified that the question authentically comes from the NetHack dev team.)
utf-8
I started playing nethack before it was nethack, it was just hack. (I may well hold the record for longest time playing without an asencion, but that is beside the point.) I have played other roguelikes and keep coming back to nethack because it is the only one that keeps that same feel for me. It has had the same overall look my entire life. While the expanded character set in UTF would allow for significantly more characters to be used in drawing the map, and designating each monster with a different character, I beg of you not to do so. Keep the overall look the same, (or allow it as a compile time option at the very least) and just use UTF for the character name.
For which implimentation of UTF to use, I'd go with utf8 as it seems to have the widest adoption, or 32 because that will probably allow you the longest time before having to think about this again. I would avoid the middle ground.
Little Brother, watching the watchers
The answer is UTF-8. It's pretty much going to be the de-facto character set now. It has backwards compatibility with ASCII, and can easily be extended in the future to support possible U+200000 - U+7FFFFFFF codepoints, as the original UTF-8 specification used to include that anyway.
Any important point is to not mess things up and end up with CESU-8 like MySQL did. There are completely valid 4-byte UTF-8 characters, so don't think of it as some special alternate UTF-8 by artificially capping UTF-8 at a max of 3 bytes per character.
Morphing Software
UTF-8 is easily adopted by C based software like Nethack because null-terminated string logic works unmodified; a UTF-8 string has no embedded nulls to trip up any code that that measures string length by searching for a zero byte. For the most part things should "just work." UTF-16 and 32 strings have zero bytes embedded among characters, so you have to audit every bit of code to ensure compatibility.
In my experience, if you are upgrading legacy code that assumed straightforward ascii then utf8 is the
way to go. It was invented for the purpose by someone very smart (Ken Thompson). If there were a 'Neatest Hacks of All Time' competition utf8 would be my nomination.
The only real issues I've encountered are the usual ones of comparisons between equivalent characters and defining collating order. These stop being a problem (or more precisely 'your' problem) once you abandon the idea of rolling your own and use a decent utf8 string library.
What use are those characters anyway? You don't need funny accents on letters to play Nethack.
For more terrifying monster types, of course. You haven't really battled a Chinese dragon until you've done it using the original Han character set.
I don't care if it's 90,000 hectares. That lake was not my doing.
There are combined characters that are not represented by a single codepoint: http://en.wikipedia.org/wiki/U...
Please, don't use the Wikia NetHack Wiki. It is outdated, ad-ridden, and has been abandoned by the community, but Wikia doesn't allow a wiki to be deleted.
The current NetHack wiki is at http://nethackwiki.com/ .
UnNetHack: NetHack Improved!
Extracting a character - trivial. Length of string - trivial.
I don't think it's quite as simple as you think. UTF-8 is a variable-length encoding, but UTF-32 is too when you consider grapheme clusters.
When you extract characters and and determine length, are you only talking about code points (not very useful) or are you taking into consideration combining characters to account for actual visible glyphs that most people would consider to be a character?
The overwhelming majority of apps are only doing trivial operations -- string concatenation and shuffling bits to some API to display text. For these apps, choice of encoding really does not matter. NetHack is very likely in this category.
Anything more and you'll have to deal with variable-length data for both UTF-8 and UTF-32. So it doesn't really matter. Choose whichever uses less storage space.
Don't you want to name your fruit U+1F4A9? (can't write this as a literal because Slashdot)
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
Characters in Thai are rendered in display-oredr, and not logical order.
so, for example ( mina would be imna) and requires reordering for sorting.
Characters in many Indic languages are still all syllable based.
So, consonants and vowels are encoded separately, and fully interact as a logical graphical character.
Sinhala:
0dc1 0dca 200d 0dbb 0dd3
ZHA VIRAMA ZWJ RA VOWEL-SIGN-II
Combine to form a single displayable character. (Sri)
If you omit the Zero-Width-Joiner, then it displays as two characters, "Sa'" and "Ri."
So, the rendering and display are dependant on the entire grapheme, which is the normal unit of display and truncation.
Otherwise one will be cropping portions of a character on display; and rendering either jibbrish/bakamoji, or unrelated characters/syllables because.
Malay:
0d15 0d4d 0d38 0d3e
KA VIRAMA SA AA
One displayable character.
If you display code-point by code point, the grapheme displayed would changes 4 times.
KA
K'
KSA
KSAA