Slashdot Mirror


NetHack Development Team Polls Community For Advice On Unicode

An anonymous reader writes After years of relative silence, the development team behind the classic roguelike game NetHack has posteda question: going forward, what internal representation should the NetHack core use for Unicode characters? UTF8? UTF32? Something else? (See also: NH4 blog, reddit. Also, yes, I have verified that the question authentically comes from the NetHack dev team.)

7 of 165 comments (clear)

  1. The answer is... by Anonymous Coward · · Score: 5, Insightful

    utf-8

    1. Re:The answer is... by KiloByte · · Score: 5, Informative

      For storing a single character: UCS-4 (aka UTF-32), and that's without possible combining character decoration. For everything else, UTF-8 internally, no matter what the system locale is.

      wchar_t is always damage, it shouldn't be used except in wrappers that do actual I/O: you need such wrappers as standard-compliant functions are buggy to the level of uselessness on Windows and you need SomeWindowsInventedFunctionW() for everything if you want Unicode.

      And why UTF-8 not UCS-4 for strings? UTF-8 takes slightly longer code:
      while (int l = utf8towc(&c, s))
      {
              s += l;
              do_something(c);
      }

      vs UCS-4's simpler:
      for (; *s; s++)
      {
              do_something(*s);
      }

      but UCS-4 blows up most your strings by a factor of 4, and makes viewing stuff in a debugger really cumbersome.

      My credentials: I'm the guy who added Unicode support to Dungeon Crawl.

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
  2. UTF-8 by Anonymous Coward · · Score: 5, Interesting

    UTF-8 is easily adopted by C based software like Nethack because null-terminated string logic works unmodified; a UTF-8 string has no embedded nulls to trip up any code that that measures string length by searching for a zero byte. For the most part things should "just work." UTF-16 and 32 strings have zero bytes embedded among characters, so you have to audit every bit of code to ensure compatibility.

  3. Re:Better to not support it at all by Jeremi · · Score: 5, Funny

    What use are those characters anyway? You don't need funny accents on letters to play Nethack.

    For more terrifying monster types, of course. You haven't really battled a Chinese dragon until you've done it using the original Han character set.

    --


    I don't care if it's 90,000 hectares. That lake was not my doing.
  4. Re: Short of memory? by nmoore · · Score: 5, Informative

    There are combined characters that are not represented by a single codepoint: http://en.wikipedia.org/wiki/U...

  5. Re:utf-32/ucs-4 by bhaak1 · · Score: 5, Informative

    Please, don't use the Wikia NetHack Wiki. It is outdated, ad-ridden, and has been abandoned by the community, but Wikia doesn't allow a wiki to be deleted.

    The current NetHack wiki is at http://nethackwiki.com/ .

  6. Re:utf-32/ucs-4 by IcyWolfy · · Score: 5, Informative

    Characters in Thai are rendered in display-oredr, and not logical order.
    so, for example ( mina would be imna) and requires reordering for sorting.

    Characters in many Indic languages are still all syllable based.
    So, consonants and vowels are encoded separately, and fully interact as a logical graphical character.

    Sinhala:
    0dc1 0dca 200d 0dbb 0dd3
    ZHA VIRAMA ZWJ RA VOWEL-SIGN-II

    Combine to form a single displayable character. (Sri)

    If you omit the Zero-Width-Joiner, then it displays as two characters, "Sa'" and "Ri."
    So, the rendering and display are dependant on the entire grapheme, which is the normal unit of display and truncation.
    Otherwise one will be cropping portions of a character on display; and rendering either jibbrish/bakamoji, or unrelated characters/syllables because.

    Malay:
    0d15 0d4d 0d38 0d3e
    KA VIRAMA SA AA

    One displayable character.
    If you display code-point by code point, the grapheme displayed would changes 4 times.
    KA
    K'
    KSA
    KSAA