Slashdot Mirror


NetHack Development Team Polls Community For Advice On Unicode

An anonymous reader writes After years of relative silence, the development team behind the classic roguelike game NetHack has posteda question: going forward, what internal representation should the NetHack core use for Unicode characters? UTF8? UTF32? Something else? (See also: NH4 blog, reddit. Also, yes, I have verified that the question authentically comes from the NetHack dev team.)

27 of 165 comments (clear)

  1. The answer is... by Anonymous Coward · · Score: 5, Insightful

    utf-8

    1. Re:The answer is... by KiloByte · · Score: 5, Informative

      For storing a single character: UCS-4 (aka UTF-32), and that's without possible combining character decoration. For everything else, UTF-8 internally, no matter what the system locale is.

      wchar_t is always damage, it shouldn't be used except in wrappers that do actual I/O: you need such wrappers as standard-compliant functions are buggy to the level of uselessness on Windows and you need SomeWindowsInventedFunctionW() for everything if you want Unicode.

      And why UTF-8 not UCS-4 for strings? UTF-8 takes slightly longer code:
      while (int l = utf8towc(&c, s))
      {
              s += l;
              do_something(c);
      }

      vs UCS-4's simpler:
      for (; *s; s++)
      {
              do_something(*s);
      }

      but UCS-4 blows up most your strings by a factor of 4, and makes viewing stuff in a debugger really cumbersome.

      My credentials: I'm the guy who added Unicode support to Dungeon Crawl.

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
  2. Re:Short of memory? by Anonymous Coward · · Score: 3, Funny

    If masochist, just UTF-16. If slashdot coder use ASCII.

  3. More importantly, by Anonymous Coward · · Score: 3, Insightful

    who cares? This only affects naming your character and displaying stuff on the map.

    1. Re:More importantly, by KiloByte · · Score: 4, Funny

      Don't you want to name your fruit U+1F4A9? (can't write this as a literal because Slashdot)

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
  4. Use utf if you must, for character names, only. by Little+Brother · · Score: 4, Interesting

    I started playing nethack before it was nethack, it was just hack. (I may well hold the record for longest time playing without an asencion, but that is beside the point.) I have played other roguelikes and keep coming back to nethack because it is the only one that keeps that same feel for me. It has had the same overall look my entire life. While the expanded character set in UTF would allow for significantly more characters to be used in drawing the map, and designating each monster with a different character, I beg of you not to do so. Keep the overall look the same, (or allow it as a compile time option at the very least) and just use UTF for the character name.

    For which implimentation of UTF to use, I'd go with utf8 as it seems to have the widest adoption, or 32 because that will probably allow you the longest time before having to think about this again. I would avoid the middle ground.

    --

    Little Brother, watching the watchers

    1. Re:Use utf if you must, for character names, only. by DahGhostfacedFiddlah · · Score: 2

      Don't worry, that would never happen.

      G's are only ever gnomes (of differing ranks), but a g might be a gargoyle, flying gargoyle, or gremlin.

      I hope that clears things up. And for god's sake, don't genocide G's if you're playing as a gnome.

    2. Re:Use utf if you must, for character names, only. by DahGhostfacedFiddlah · · Score: 2

      (I may well hold the record for longest time playing without an ascension)

      I think I started Hack around '92, and finally ascended in 2009. I've been trying to ascend about once a year since then.

  5. Fonts missing in action by Anonymous Coward · · Score: 2, Informative

    First off, UTF-32 is least likely to cause bugs, since all chars are the same length and thus possible to determine memory usage simply by multiplying char count by 4. So, if you're gonna do unicode, and you don't like your code to be buggy, this is the way to do it.

    That said, unicode is a travesty. Unlike ascii, there is no such thing as a complete unicode font that implements all of unicode's code points. Unicode only defines how any implemented chars should be numbered, but doesn't actually require you to implement more than zero characters.

    Calling unicode a standard - well it's true, of course. But it doesn't mean what people think it means.

    1. Re:Fonts missing in action by IcyWolfy · · Score: 3, Informative

      Terminoligy needs to be fixed.

      All Codepoints are 4 bypes
      All characters (defined as a single conceptual, and graphical display unit) range from 1 to 6 code-points. (so, 4-24bytes)

      Sinhala:
      0dc1 0dca 200d 0dbb 0dd3
      ZHA VIRAMA ZWJ RA VOWEL-SIGN-II

      Combine to form a single displayable character. (Sri) (kinda a fancy item; but different from without the ZWJ which would display two graphemes. (S', and RII)

      And Lituanian:
      "However, not all abstract characters are encoded as a single Unicode character, and some abstract characters may be represented in Unicode by a sequence of two or more characters. For example, a Latin small letter "i" with an ogonek, a dot above, and an acute accent, which is required in Lithuanian, is represented by the character sequence U+012F, U+0307, U+0301."

      And there are many other cases where there is no single code-point to represent a single grapheme.
      So for string truncation and line-splitting, (and anything dealing with arabic or indic scripts), you need to never crop in the middle of a codepoint-sequence that defined a single grapheme; or else the visual display is incorrect, or bakamoji (jibbrish).

  6. UTF-8 by Ark42 · · Score: 4, Insightful

    The answer is UTF-8. It's pretty much going to be the de-facto character set now. It has backwards compatibility with ASCII, and can easily be extended in the future to support possible U+200000 - U+7FFFFFFF codepoints, as the original UTF-8 specification used to include that anyway.

    Any important point is to not mess things up and end up with CESU-8 like MySQL did. There are completely valid 4-byte UTF-8 characters, so don't think of it as some special alternate UTF-8 by artificially capping UTF-8 at a max of 3 bytes per character.

  7. UTF-8 by Anonymous Coward · · Score: 5, Interesting

    UTF-8 is easily adopted by C based software like Nethack because null-terminated string logic works unmodified; a UTF-8 string has no embedded nulls to trip up any code that that measures string length by searching for a zero byte. For the most part things should "just work." UTF-16 and 32 strings have zero bytes embedded among characters, so you have to audit every bit of code to ensure compatibility.

  8. Re: Short of memory? by jones_supa · · Score: 2

    UTF-32 should make memory allocation more predictable as every character is guaranteed to be 32 bits.

  9. Go with the majority by namgge · · Score: 4, Insightful

    In my experience, if you are upgrading legacy code that assumed straightforward ascii then utf8 is the
    way to go. It was invented for the purpose by someone very smart (Ken Thompson). If there were a 'Neatest Hacks of All Time' competition utf8 would be my nomination.

    The only real issues I've encountered are the usual ones of comparisons between equivalent characters and defining collating order. These stop being a problem (or more precisely 'your' problem) once you abandon the idea of rolling your own and use a decent utf8 string library.

  10. Re:utf-32/ucs-4 by ThePhilips · · Score: 2

    i don't see a real argument here. "considering the length". how long is it?

    Check the game history. Literally decades between major releases.

    "some of the silliness". what silliness is this exactly? external storage of utf-32 requires that one deal with an endian character set. every time any text is touched, you'll get to endian convert.

    Everybody has already settled on the little-endian presentation.

    isn't that awesome? utf-8 does not have this issue. and one can almost always treat utf8 as a byte stream. except in the rare case where one needs to know where character boundaries are. for example, to map the character to a font. the fast path is the common path (ascii), and just requires a single test ((c&0x80) == 0).

    With UCS-4 you do not even need any tests.

    Extracting a character - trivial.

    Length of string - trivial.

    Normalization - much simpler than the utf-8.

    The sad reality that libraries I have seen actually implement the utf-8 handling by using internally utf-32. You can't avoid it: Unicode is specified in the code points, which as you point it out are already as good as 32 bit long.

    sure the gnu c library has had bad wchar_t conversion routines in the past, but it's a free country. you can implement your own.

    Frankly, I haven't even used C library for the purpose. We had already one library developed in-house, because portable support for utf-8 is patchy at best.

    The sanest portable approach is to link with iconv and convert everything from some internal presentation to external. Because you can never know what encoding user needs. Unless you really need to save the RAM (one has shitload of string data), utf-8 simply sucks as internal presentation.

    P.S. I have had very little experience with Unicode. But several month of dealing with it, have simply convinced me that if one has to deal with l10n/i10n, then utf-16/utf-32 are very good choices. Ditto, if one has to deal with the Unicode. If application really doesn't care what it prints or reads - then pass-through binary (utf-8) works too. But as soon as one has to take the length of utf-8 string (real length), then it is time to start switching from utf-8 to utf-32.

    --
    All hope abandon ye who enter here.
  11. Re:Better to not support it at all by Jeremi · · Score: 5, Funny

    What use are those characters anyway? You don't need funny accents on letters to play Nethack.

    For more terrifying monster types, of course. You haven't really battled a Chinese dragon until you've done it using the original Han character set.

    --


    I don't care if it's 90,000 hectares. That lake was not my doing.
  12. Re: Short of memory? by nmoore · · Score: 5, Informative

    There are combined characters that are not represented by a single codepoint: http://en.wikipedia.org/wiki/U...

  13. Re:utf-32/ucs-4 by bhaak1 · · Score: 5, Informative

    Please, don't use the Wikia NetHack Wiki. It is outdated, ad-ridden, and has been abandoned by the community, but Wikia doesn't allow a wiki to be deleted.

    The current NetHack wiki is at http://nethackwiki.com/ .

  14. Re:utf-32/ucs-4 by PhrostyMcByte · · Score: 4, Informative

    Extracting a character - trivial. Length of string - trivial.

    I don't think it's quite as simple as you think. UTF-8 is a variable-length encoding, but UTF-32 is too when you consider grapheme clusters.

    When you extract characters and and determine length, are you only talking about code points (not very useful) or are you taking into consideration combining characters to account for actual visible glyphs that most people would consider to be a character?

    The overwhelming majority of apps are only doing trivial operations -- string concatenation and shuffling bits to some API to display text. For these apps, choice of encoding really does not matter. NetHack is very likely in this category.

    Anything more and you'll have to deal with variable-length data for both UTF-8 and UTF-32. So it doesn't really matter. Choose whichever uses less storage space.

  15. Re:utf-32/ucs-4 by Anonymous Coward · · Score: 3, Insightful

    Let me answer with a koan: 'What is the real length of a soft hyphen?'

  16. Re:utf-32/ucs-4 by Chris+Dodd · · Score: 2

    Its obvious you have little real experience with unicode, because saying 'just convert to utf-32' just papers over the problems without solving them. UTF-32 units are code points, not characters, and there are many multi-code-point (variable length) characters in utf-32. So you still have all the length and normalization problems you have with utf-8 (and even with ASCII, though people often ignore it there -- are 'a' and 'A' the same character? How do they sort?) The real 'length' problem is that people insist on using the term ambiguously -- you have string storage space and string rendering size, and the two are completely independent.

  17. Re:NetHack Development Team Not Dead by L.+J.+Beauregard · · Score: 2

    AFAICT the original query came from the actual DevTeam. The blog post in the submission is from the NetHack4 guy, who I suspect is also the anonymous submitter.

    --
    Ooh, moderator points! Five more idjits go to Minus One Hell!
    Delendae sunt RIAA, MPAA et Windoze
  18. Re:utf-32/ucs-4 by IcyWolfy · · Score: 5, Informative

    Characters in Thai are rendered in display-oredr, and not logical order.
    so, for example ( mina would be imna) and requires reordering for sorting.

    Characters in many Indic languages are still all syllable based.
    So, consonants and vowels are encoded separately, and fully interact as a logical graphical character.

    Sinhala:
    0dc1 0dca 200d 0dbb 0dd3
    ZHA VIRAMA ZWJ RA VOWEL-SIGN-II

    Combine to form a single displayable character. (Sri)

    If you omit the Zero-Width-Joiner, then it displays as two characters, "Sa'" and "Ri."
    So, the rendering and display are dependant on the entire grapheme, which is the normal unit of display and truncation.
    Otherwise one will be cropping portions of a character on display; and rendering either jibbrish/bakamoji, or unrelated characters/syllables because.

    Malay:
    0d15 0d4d 0d38 0d3e
    KA VIRAMA SA AA

    One displayable character.
    If you display code-point by code point, the grapheme displayed would changes 4 times.
    KA
    K'
    KSA
    KSAA

  19. Re:utf-32/ucs-4 by jrumney · · Score: 2

    one can almost always treat utf8 as a byte stream. except in the rare case where one needs to know where character boundaries are.

    UTF-8 is designed to be treated as a byte stream - even when detecting character boundaries. If a byte is >0x7F and <0xC0, then it is not a character boundary. If you want to be really strict, filter out the invalid bytes (0xC0, 0xC1, >0xF4), then everything else is a character boundary.

  20. What the web was built on by RogueWarrior65 · · Score: 3, Funny

    Everyone knows (or should know) that the web was built on WTF-8 which explains a lot.

  21. Re: Short of memory? by petermgreen · · Score: 3, Insightful

    What does "character" mean?

    Something represented by one unicode codepoint? (making your statement a tautology)
    Grapheme cluster? (what most users would consider a character)
    A position in the character grid of a console?

    Which brings us to the real question. to what extent do you want to support unicode? do you care about

    * Grapheme clusters that take multiple code points to represent? (letters with multiple diacritics, unusual letter/diacritic combinations etc)
    * Right to left languages? (hebrew, arabic etc)
    * Languages where chracters merge together such that computer output looks more like handwriting than type? (see above)
    * Languages where "fixed" width fonts use two different widths giving "single width" and "double width" characters? (chineese, japanese, korean)
    * Characters outside of the basic multilingual plane? (rare Chinese characters, dead languages, made up languages, rare mathematical symbols)

    Once you have worked though that design decision it will help you make others. What you find is that "length in unicode code points" and "unicode code point n" really aren't much more useful than "length in utf-k code units" and "utf-k code point n". Either is fine for sanity checking string length or iterating through a string looking for delimiter. Neither is much use for anything more than unless you are doing a very limited implementation.

    UTF-32 seems enticing initially but turns out to be fairly pointless, by the time you get to caring about non-BMP characters you are probably also going to be caring about combining characters etc and it will massively increase the size of the vast majority of text.

    UTF-8 vs UTF-16 is something of a tossup. UTF-16 lets you get away with treating each unit of the string as one "character" much longer which may be considered either a blessing (because you don't care about the cases where it doesn't work) or a curse (because you realise your assumptions were wrong much later after basing much more code on them). UTF-8 is smaller for text with lots of latin chracters, UTF-16 is smaller for text with lots of CJK characters. UTF-8 is the usual choice on *nix systems and internet protocols. UTF-16 is the encoding chosen by windows and Java.

    --
    note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
  22. Re: None by Hognoxious · · Score: 2
    --
    Confucius say, "Find worm in apple - bad. Find half a worm - worse."