NetHack Development Team Polls Community For Advice On Unicode

← Back to Stories (view on slashdot.org)

NetHack Development Team Polls Community For Advice On Unicode

Posted by timothy on Sunday January 11, 2015 @04:10AM from the pressing-issues dept.

An anonymous reader writes After years of relative silence, the development team behind the classic roguelike game NetHack has posteda question: going forward, what internal representation should the NetHack core use for Unicode characters? UTF8? UTF32? Something else? (See also: NH4 blog, reddit. Also, yes, I have verified that the question authentically comes from the NetHack dev team.)

6 of 165 comments (clear)

Min score:

Reason:

Sort:

The answer is... by Anonymous Coward · 2015-01-11 04:13 · Score: 5, Insightful

utf-8
More importantly, by Anonymous Coward · 2015-01-11 04:17 · Score: 3, Insightful

who cares? This only affects naming your character and displaying stuff on the map.
UTF-8 by Ark42 · 2015-01-11 04:39 · Score: 4, Insightful

The answer is UTF-8. It's pretty much going to be the de-facto character set now. It has backwards compatibility with ASCII, and can easily be extended in the future to support possible U+200000 - U+7FFFFFFF codepoints, as the original UTF-8 specification used to include that anyway.
Any important point is to not mess things up and end up with CESU-8 like MySQL did. There are completely valid 4-byte UTF-8 characters, so don't think of it as some special alternate UTF-8 by artificially capping UTF-8 at a max of 3 bytes per character.

--
Morphing Software
Go with the majority by namgge · 2015-01-11 04:54 · Score: 4, Insightful

In my experience, if you are upgrading legacy code that assumed straightforward ascii then utf8 is the
way to go. It was invented for the purpose by someone very smart (Ken Thompson). If there were a 'Neatest Hacks of All Time' competition utf8 would be my nomination.
The only real issues I've encountered are the usual ones of comparisons between equivalent characters and defining collating order. These stop being a problem (or more precisely 'your' problem) once you abandon the idea of rolling your own and use a decent utf8 string library.
Re:utf-32/ucs-4 by Anonymous Coward · 2015-01-11 06:28 · Score: 3, Insightful

Let me answer with a koan: 'What is the real length of a soft hyphen?'
Re: Short of memory? by petermgreen · 2015-01-11 15:47 · Score: 3, Insightful

What does "character" mean?
Something represented by one unicode codepoint? (making your statement a tautology)
Grapheme cluster? (what most users would consider a character)
A position in the character grid of a console?
Which brings us to the real question. to what extent do you want to support unicode? do you care about
* Grapheme clusters that take multiple code points to represent? (letters with multiple diacritics, unusual letter/diacritic combinations etc)
* Right to left languages? (hebrew, arabic etc)
* Languages where chracters merge together such that computer output looks more like handwriting than type? (see above)
* Languages where "fixed" width fonts use two different widths giving "single width" and "double width" characters? (chineese, japanese, korean)
* Characters outside of the basic multilingual plane? (rare Chinese characters, dead languages, made up languages, rare mathematical symbols)
Once you have worked though that design decision it will help you make others. What you find is that "length in unicode code points" and "unicode code point n" really aren't much more useful than "length in utf-k code units" and "utf-k code point n". Either is fine for sanity checking string length or iterating through a string looking for delimiter. Neither is much use for anything more than unless you are doing a very limited implementation.
UTF-32 seems enticing initially but turns out to be fairly pointless, by the time you get to caring about non-BMP characters you are probably also going to be caring about combining characters etc and it will massively increase the size of the vast majority of text.
UTF-8 vs UTF-16 is something of a tossup. UTF-16 lets you get away with treating each unit of the string as one "character" much longer which may be considered either a blessing (because you don't care about the cases where it doesn't work) or a curse (because you realise your assumptions were wrong much later after basing much more code on them). UTF-8 is smaller for text with lots of latin chracters, UTF-16 is smaller for text with lots of CJK characters. UTF-8 is the usual choice on *nix systems and internet protocols. UTF-16 is the encoding chosen by windows and Java.

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register