NetHack Development Team Polls Community For Advice On Unicode

The answer is... by Anonymous Coward · 2015-01-11 04:13 · Score: 5, Insightful

utf-8

Re:The answer is... by KiloByte · 2015-01-11 06:49 · Score: 5, Informative

For storing a single character: UCS-4 (aka UTF-32), and that's without possible combining character decoration. For everything else, UTF-8 internally, no matter what the system locale is.
wchar_t is always damage, it shouldn't be used except in wrappers that do actual I/O: you need such wrappers as standard-compliant functions are buggy to the level of uselessness on Windows and you need SomeWindowsInventedFunctionW() for everything if you want Unicode.
And why UTF-8 not UCS-4 for strings? UTF-8 takes slightly longer code:while (int l = utf8towc(&c, s)) { s += l; do_something(c); }
vs UCS-4's simpler:for (; *s; s++) { do_something(*s); }
but UCS-4 blows up most your strings by a factor of 4, and makes viewing stuff in a debugger really cumbersome.
My credentials: I'm the guy who added Unicode support to Dungeon Crawl.

--
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
Re:The answer is... by LostMyBeaver · 2015-01-11 10:52 · Score: 1

I just read the source and agree. Unless they think it's just so much fun to completely reimplement character handling on 20 platforms...or unless they intend to port entirely to Qt, UTF-8 is their only option.
Re:The answer is... by Pinhedd · 2015-01-11 17:22 · Score: 1

Came here to say pretty much the same thing.
UTF-8 is pretty easy to work with from a memory management perspective and will make it easier when upgrading an established ASCII base.
Re:The answer is... by lorimer · 2015-01-13 11:39 · Score: 1

for DCSS, how much tedious rewriting-every-single-instance-of-a-change BS did you have to wade through?

Re:Short of memory? by Anonymous Coward · 2015-01-11 04:15 · Score: 3, Funny

If masochist, just UTF-16. If slashdot coder use ASCII.

More importantly, by Anonymous Coward · 2015-01-11 04:17 · Score: 3, Insightful

who cares? This only affects naming your character and displaying stuff on the map.

Re:More importantly, by KiloByte · 2015-01-11 06:51 · Score: 4, Funny

Don't you want to name your fruit U+1F4A9? (can't write this as a literal because Slashdot)

--
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
Re:More importantly, by Anonymous Coward · 2015-01-11 09:26 · Score: 1

You like to eat smiling piles of poo?
Re:More importantly, by jonadab · 2015-01-13 13:48 · Score: 1

Honestly, it'd probably taste better than some of the other stuff NetHack characters eat.

--
Cut that out, or I will ship you to Norilsk in a box.

Better to not support it at all by johanw · 2015-01-11 04:19 · Score: 1

What use are those characters anyway? You don't need funny accents on letters to play Nethack. 7 bits should be enough for any character set! Hardcore hackers who want a workaround can just use LaTeX codes.

Re:Better to not support it at all by Jeremi · 2015-01-11 05:29 · Score: 5, Funny

What use are those characters anyway? You don't need funny accents on letters to play Nethack.

For more terrifying monster types, of course. You haven't really battled a Chinese dragon until you've done it using the original Han character set.

--

I don't care if it's 90,000 hectares. That lake was not my doing.

utf-32/ucs-4 by ThePhilips · 2015-01-11 04:30 · Score: 1

Considering the length of their release cycle, seems to be a safe choice.

It's not like the difference 1/2/4 bytes would make much performance difference for the application like NetHack.

Using the utf-32 internally would save them from some of the silliness the alternatives like utf-8 bring with them.

--
All hope abandon ye who enter here.

Re:utf-32/ucs-4 by ThePhilips · 2015-01-11 05:24 · Score: 2

i don't see a real argument here. "considering the length". how long is it?
Check the game history. Literally decades between major releases.

"some of the silliness". what silliness is this exactly? external storage of utf-32 requires that one deal with an endian character set. every time any text is touched, you'll get to endian convert.
Everybody has already settled on the little-endian presentation.

isn't that awesome? utf-8 does not have this issue. and one can almost always treat utf8 as a byte stream. except in the rare case where one needs to know where character boundaries are. for example, to map the character to a font. the fast path is the common path (ascii), and just requires a single test ((c&0x80) == 0).
With UCS-4 you do not even need any tests.
Extracting a character - trivial.
Length of string - trivial.
Normalization - much simpler than the utf-8.
The sad reality that libraries I have seen actually implement the utf-8 handling by using internally utf-32. You can't avoid it: Unicode is specified in the code points, which as you point it out are already as good as 32 bit long.

sure the gnu c library has had bad wchar_t conversion routines in the past, but it's a free country. you can implement your own.
Frankly, I haven't even used C library for the purpose. We had already one library developed in-house, because portable support for utf-8 is patchy at best.
The sanest portable approach is to link with iconv and convert everything from some internal presentation to external. Because you can never know what encoding user needs. Unless you really need to save the RAM (one has shitload of string data), utf-8 simply sucks as internal presentation.
P.S. I have had very little experience with Unicode. But several month of dealing with it, have simply convinced me that if one has to deal with l10n/i10n, then utf-16/utf-32 are very good choices. Ditto, if one has to deal with the Unicode. If application really doesn't care what it prints or reads - then pass-through binary (utf-8) works too. But as soon as one has to take the length of utf-8 string (real length), then it is time to start switching from utf-8 to utf-32.

--
All hope abandon ye who enter here.
Re:utf-32/ucs-4 by bhaak1 · 2015-01-11 05:48 · Score: 5, Informative

Please, don't use the Wikia NetHack Wiki. It is outdated, ad-ridden, and has been abandoned by the community, but Wikia doesn't allow a wiki to be deleted.
The current NetHack wiki is at http://nethackwiki.com/ .

--
UnNetHack: NetHack Improved!
Re:utf-32/ucs-4 by PhrostyMcByte · 2015-01-11 06:23 · Score: 4, Informative

Extracting a character - trivial. Length of string - trivial.
I don't think it's quite as simple as you think. UTF-8 is a variable-length encoding, but UTF-32 is too when you consider grapheme clusters.
When you extract characters and and determine length, are you only talking about code points (not very useful) or are you taking into consideration combining characters to account for actual visible glyphs that most people would consider to be a character?
The overwhelming majority of apps are only doing trivial operations -- string concatenation and shuffling bits to some API to display text. For these apps, choice of encoding really does not matter. NetHack is very likely in this category.
Anything more and you'll have to deal with variable-length data for both UTF-8 and UTF-32. So it doesn't really matter. Choose whichever uses less storage space.
Re:utf-32/ucs-4 by Anonymous Coward · 2015-01-11 06:28 · Score: 3, Insightful

Let me answer with a koan: 'What is the real length of a soft hyphen?'
Re:utf-32/ucs-4 by Chris+Dodd · 2015-01-11 06:40 · Score: 2

Its obvious you have little real experience with unicode, because saying 'just convert to utf-32' just papers over the problems without solving them. UTF-32 units are code points, not characters, and there are many multi-code-point (variable length) characters in utf-32. So you still have all the length and normalization problems you have with utf-8 (and even with ASCII, though people often ignore it there -- are 'a' and 'A' the same character? How do they sort?) The real 'length' problem is that people insist on using the term ambiguously -- you have string storage space and string rendering size, and the two are completely independent.
Re:utf-32/ucs-4 by ThePhilips · 2015-01-11 06:41 · Score: 1

It is the same problem as with the fancy acute/agrave/etc special symbols.
And the special white-space/no-space characters. And the special writing direction change characters.
They are generally removed during normalization/conversion into canonical presentation.
The thing is, after the normalization, which is needed for any Unicode text anyway, UCS-4 becomes a plain array of characters. But UTF-8 - still not.

--
All hope abandon ye who enter here.
Re:utf-32/ucs-4 by TheRaven64 · 2015-01-11 06:47 · Score: 1

The thing is, after the normalization, which is needed for any Unicode text anyway, UCS-4 becomes a plain array of characters. But UTF-8 - still not.
It becomes a plain array of codepoints. Some things still require multiple codepoints to represent, though they're relatively rare. The main advantage of UTF-32 is that if you're only broken for the things that are multiple codepoints, then most people won't notice or care. If you're broken for things that require mutlibyte UTF-8 characters, then a lot of people will notice.

--
I am TheRaven on Soylent News
Re:utf-32/ucs-4 by ThePhilips · 2015-01-11 06:47 · Score: 1

Its obvious you have little real experience with unicode, because saying 'just convert to utf-32' just papers over the problems without solving them.

Indeed I've only scratched surface. And that alone gave me headaches for months.

UTF-32 units are code points, not characters, and there are many multi-code-point (variable length) characters in utf-32.

For example?

--
All hope abandon ye who enter here.
Re:utf-32/ucs-4 by IcyWolfy · 2015-01-11 07:59 · Score: 1

"However, not all abstract characters are encoded as a single Unicode character, and some abstract characters may be represented in Unicode by a sequence of two or more characters. For example, a Latin small letter "i" with an ogonek, a dot above, and an acute accent, which is required in Lithuanian, is represented by the character sequence U+012F, U+0307, U+0301."
Re:utf-32/ucs-4 by Anonymous Coward · 2015-01-11 08:04 · Score: 1

A google search tells me that E with dot below and acute accent, used in Yoruba language, has no precomposed codepoint.
Re:utf-32/ucs-4 by IcyWolfy · 2015-01-11 08:12 · Score: 5, Informative

Characters in Thai are rendered in display-oredr, and not logical order.
so, for example ( mina would be imna) and requires reordering for sorting.
Characters in many Indic languages are still all syllable based.
So, consonants and vowels are encoded separately, and fully interact as a logical graphical character.
Sinhala:
0dc1 0dca 200d 0dbb 0dd3
ZHA VIRAMA ZWJ RA VOWEL-SIGN-II
Combine to form a single displayable character. (Sri)
If you omit the Zero-Width-Joiner, then it displays as two characters, "Sa'" and "Ri."
So, the rendering and display are dependant on the entire grapheme, which is the normal unit of display and truncation.
Otherwise one will be cropping portions of a character on display; and rendering either jibbrish/bakamoji, or unrelated characters/syllables because.
Malay:
0d15 0d4d 0d38 0d3e
KA VIRAMA SA AA
One displayable character.
If you display code-point by code point, the grapheme displayed would changes 4 times.
KA
K'
KSA
KSAA
Re:utf-32/ucs-4 by Anonymous Coward · 2015-01-11 08:31 · Score: 1

So, based on your few months of experience with Unicode, which apparently gives you a headache, you are pushing for them to implement an easier short-term solution that you admit won't work in some cases.
Re:utf-32/ucs-4 by ThePhilips · 2015-01-11 08:33 · Score: 1

Characters in Thai are rendered in display-oredr, and not logical order. [...]
Ha! Not relevant to me, actually. But very informative. Thanks.
Overall, most customers are aware of the problems (and in my experience better than me). Simple handling I had in my software had worked and was sufficient.
The Thai language specifically is a cool example. Why not relevant? My company refused to do Thai localization. (And thanks to you now I know fully why.) To do the localization we were told that we have to buy a special Thai language library. The library costs huge money. When we told customer that they would have to pay for it, they have refused and canceled the project, because for them it was too too expensive.

--
All hope abandon ye who enter here.
Re:utf-32/ucs-4 by ThePhilips · 2015-01-11 08:41 · Score: 1

So what you propose?
Go with utf-8 which doesn't alleviate any of the problems? But adds its own one?
Beside, I doubt very much that anybody is going to use any of the fancy characters in the NetHack.

--
All hope abandon ye who enter here.
Re:utf-32/ucs-4 by dodobh · 2015-01-11 12:53 · Score: 1

Combining characters (and the rest of the crap) pretty much never occur in real life.
Depends on the scripts and languages. It's fairly common in Indic scripts

--
I can throw myself at the ground, and miss.
Re:utf-32/ucs-4 by jrumney · 2015-01-11 13:02 · Score: 2

one can almost always treat utf8 as a byte stream. except in the rare case where one needs to know where character boundaries are.
UTF-8 is designed to be treated as a byte stream - even when detecting character boundaries. If a byte is >0x7F and <0xC0, then it is not a character boundary. If you want to be really strict, filter out the invalid bytes (0xC0, 0xC1, >0xF4), then everything else is a character boundary.
Re: utf-32/ucs-4 by Anonymous Coward · 2015-01-11 14:19 · Score: 1

Do you realize that ICU is free? And is capable of parsing Thai and other crazy languages based on all the insane rules? ICU abstracts things like lines, paragraphs, etc, based on the Unicode rules.
Pay indeed.... The only thing you need to pay for is somebody to QA your app who can read the foreign language. And maybe for half a clue about learning this stuff, because apparently you're too lazy too read all the free material on the web that explains this stuff.
Re:utf-32/ucs-4 by Antique+Geekmeister · 2015-01-11 17:41 · Score: 1

> Everybody has already settled on the little-endian presentation.
What makes you think this? There are plenty of old Motrola architecture based systems still in legacy environment use, preserved for stable scientific or business computing environments. NASA has a great deal of it still in use, because they've been forced to keep old earthbound hardware in use to support old spacebound mission hardware. And there is a significant amount of new, bi-endian hardware being produced now,
I'm afraid I have quite a lot of experience with Unicode compatibility and cross compatibility. Frankly, for a multi-platform tool like Nethack, I'd stay with the 8-bit, one byte, extremely stable 'POSIX' standard.
Re:utf-32/ucs-4 by ThePhilips · 2015-01-11 20:58 · Score: 1

Everybody has already settled on the little-endian presentation.
What makes you think this? There are plenty of old Motorola architecture based systems still in legacy environment use, preserved for stable scientific or business computing environments.
Man, I come from the BE world. You do not need to tell me that there is still abundance of the BE hardware.

And there is a significant amount of new, bi-endian hardware being produced now,
Most modern CPUs I had to deal with, except the Intel, are bi-endian. BUT. Most (by model number) are used in BE mode. (But since ARM also has settled on the LE, now it is effectively a LE world.)
Yet.
1st. The endianness of the CPU is not related to the endianness of an data exchange format.
2ns. The endianness of the data exchange format does not relate to the internal presentation of the data in the application's memory.

I'm afraid I have quite a lot of experience with Unicode compatibility and cross compatibility. Frankly, for a multi-platform tool like Nethack, I'd stay with the 8-bit, one byte, extremely stable 'POSIX' standard.
You folks lump it all together. There are two sides to it: internal presentation and external conversion.
For internal presentation, one goes with whatever makes your life as developer easier. UCS-4 is definitively an option. UTF-8 (aka "I do not care, just passing data through") is also OK. Most applications fall into the later category. But if one ever starts pondering use of the widechars, when one needs to actually peek at the data, then there is simply no point using the UTF-16. And UTF-8 has disadvantages whne .
For external conversions, all what matters that the internal format can be easily converted into the widely used encodings. Application doesn't have any direct control over it - it is user controlled. User might pick UTF-8. Or JIS. Or win-1257. And application has to make sure that when it spews the data to outside, they come out in the encoding requested by the user.
Naive notion of that utf-8 is used by everybody is extremely naive. And IMO it is rooted in the same arrogance which held back the *nix world for decades in the dark ages of the 7-bit ASCII.

--
All hope abandon ye who enter here.
Re: utf-32/ucs-4 by Rich0 · 2015-01-12 00:06 · Score: 1

Pay indeed.... The only thing you need to pay for is somebody to QA your app who can read the foreign language. And maybe for half a clue about learning this stuff, because apparently you're too lazy too read all the free material on the web that explains this stuff.
I imagine the problem for many software companies is that localization is probably an afterthought unless they start out in Asia, and even then Thai is probably not a priority for them so while they'll certainly handle Unicode, they might not handle all of it.
Re:utf-32/ucs-4 by Millennium · 2015-01-12 01:32 · Score: 1

Even in the real world, a surprising number of languages contain characters that still have no single-point normalized forms. But the most widely-known case of multi-point characters doesn't correspond to any real-world language at all: look up "Zalgo" for more information on this.
Re:utf-32/ucs-4 by Antique+Geekmeister · 2015-01-12 01:56 · Score: 1

> For external conversions, all what matters that the internal format can be easily converted into the widely used encodings.
And this is the difficulty. It's not the _ease_. It's the consistency, predictability, and portability. Many external displays of Unicode content have varied between platforms in alarming ways, especially due to mishandled character displays which the programmer has little control over. It may have gotten better since my last go-around with it, but even simply layout issues like column alignment have been screwed up, especially when the legitimate Unicode character generates an erroneous on-screen error code instead of a single character display. And _that_ can ruin Nethack layouts, in ways unpredictable to the maintainers.
Re:utf-32/ucs-4 by jonadab · 2015-01-13 13:53 · Score: 1

Five bytes. In decimal, it'd be 38 115 104 121 59. HTH.HAND.

--
Cut that out, or I will ship you to Norilsk in a box.

Use utf if you must, for character names, only. by Little+Brother · 2015-01-11 04:31 · Score: 4, Interesting

I started playing nethack before it was nethack, it was just hack. (I may well hold the record for longest time playing without an asencion, but that is beside the point.) I have played other roguelikes and keep coming back to nethack because it is the only one that keeps that same feel for me. It has had the same overall look my entire life. While the expanded character set in UTF would allow for significantly more characters to be used in drawing the map, and designating each monster with a different character, I beg of you not to do so. Keep the overall look the same, (or allow it as a compile time option at the very least) and just use UTF for the character name.

For which implimentation of UTF to use, I'd go with utf8 as it seems to have the widest adoption, or 32 because that will probably allow you the longest time before having to think about this again. I would avoid the middle ground.

--

Little Brother, watching the watchers

Re:Use utf if you must, for character names, only. by Anonymous Coward · 2015-01-11 05:08 · Score: 1

Adding Unicode for names would be nice but it also would probably introduce a ton of bugs in the process making the game less stable again. Plus, using the same character for different monsters is *part of the game*. If you get lazy and don't look if the G is a gnome vs gargoyle or something, the mistake is supposed to cost you.
Re:Use utf if you must, for character names, only. by PhrostyMcByte · 2015-01-11 06:12 · Score: 1

For which implimentation of UTF to use, I'd go with utf8 as it seems to have the widest adoption, or 32 because that will probably allow you the longest time before having to think about this again. I would avoid the middle ground.
UTF-8, while originally only defined to 31 bits and now defined to 21 bits, actually has room to trivially extend up to 43 bits. One could say it's more future-proof than UTF-32. Not that it really matters -- we're only using 17 bits right now so I doubt we'll ever get past 21. Maybe when we encounter intelligent alien life.
Re:Use utf if you must, for character names, only. by lgw · 2015-01-11 08:30 · Score: 1

Adding Unicode for names would be nice but it also would probably introduce a ton of bugs in the process making the game less stable again. Plus, using the same character for different monsters is *part of the game*. If you get lazy and don't look if the G is a gnome vs gargoyle or something, the mistake is supposed to cost you.
Thanks for reminding me why I don't play Nethack - briefly I was tempted.

--
Socialism: a lie told by totalitarians and believed by fools.
Re:Use utf if you must, for character names, only. by DahGhostfacedFiddlah · 2015-01-11 08:36 · Score: 2

Don't worry, that would never happen.
G's are only ever gnomes (of differing ranks), but a g might be a gargoyle, flying gargoyle, or gremlin.
I hope that clears things up. And for god's sake, don't genocide G's if you're playing as a gnome.

--
Last post!
Re:Use utf if you must, for character names, only. by DahGhostfacedFiddlah · 2015-01-11 08:38 · Score: 2

(I may well hold the record for longest time playing without an ascension)
I think I started Hack around '92, and finally ascended in 2009. I've been trying to ascend about once a year since then.

--
Last post!
Re:Use utf if you must, for character names, only. by jonadab · 2015-01-13 13:58 · Score: 1

Actually, that could happen, if you've eaten a purple F, for example. (Admittedly, the fact that "gnome" and "gargoyle" both start with g would be an irrelevant coincidence in such a case.)

--
Cut that out, or I will ship you to Norilsk in a box.
Re:Use utf if you must, for character names, only. by DahGhostfacedFiddlah · 2015-01-14 13:59 · Score: 1

Well, clearly *that's* an exception. I'd forgotten because in that case, I'd normally a a u's (.

--
Last post!

Fonts missing in action by Anonymous Coward · 2015-01-11 04:37 · Score: 2, Informative

First off, UTF-32 is least likely to cause bugs, since all chars are the same length and thus possible to determine memory usage simply by multiplying char count by 4. So, if you're gonna do unicode, and you don't like your code to be buggy, this is the way to do it.

That said, unicode is a travesty. Unlike ascii, there is no such thing as a complete unicode font that implements all of unicode's code points. Unicode only defines how any implemented chars should be numbered, but doesn't actually require you to implement more than zero characters.

Calling unicode a standard - well it's true, of course. But it doesn't mean what people think it means.

Re:Fonts missing in action by Ark42 · 2015-01-11 04:47 · Score: 1

The font issue is a silly thing to worry about. The same thing can be said of ASCII of and Windows-1252. I'm sure lots of early fonts, and probably even some you find today, that claim to support all glyphs in Windows-1252, are missing the Euro sign at codepoint 0x80, because they added it later on. Even for a small character set restricted to 256 max characters, as you can see, things change over time, and fonts don't always keep up.

--
Morphing Software
Re:Fonts missing in action by drinkypoo · 2015-01-11 05:10 · Score: 1

Shouldn't the font system just solve this for me in the case of display use? Sure, for typography you probably don't want magical mystery substitutions, but why can't the system figure out which of my fonts is most similar to the font I'm using and sub in missing glyphs?

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:Fonts missing in action by Ark42 · 2015-01-11 06:28 · Score: 1

I'm pretty sure most font systems already DO do this. In fact, this was the reason I rooted my Android phone - I wanted to change the font-fallback order so that certain Kanji would display with a Japanese font instead of Chinese one. An example is http://jisho.org/kanji/details... which is drawn completely different in Chinese fonts, to the point where Japanese readers would not know the symbol, yet both are supposed to be represented by the same codepoint, because they're the same character.
But anyway, fonts and display aren't a character set encoding issue. It doesn't matter how you represent the glyph on disk or in memory, if your fonts are all missing a rendering for the character, you're going to just see a placeholder box no matter what.

--
Morphing Software
Re:Fonts missing in action by IcyWolfy · 2015-01-11 08:18 · Score: 3, Informative

Terminoligy needs to be fixed.
All Codepoints are 4 bypes
All characters (defined as a single conceptual, and graphical display unit) range from 1 to 6 code-points. (so, 4-24bytes)
Sinhala:
0dc1 0dca 200d 0dbb 0dd3
ZHA VIRAMA ZWJ RA VOWEL-SIGN-II
Combine to form a single displayable character. (Sri) (kinda a fancy item; but different from without the ZWJ which would display two graphemes. (S', and RII)
And Lituanian:
"However, not all abstract characters are encoded as a single Unicode character, and some abstract characters may be represented in Unicode by a sequence of two or more characters. For example, a Latin small letter "i" with an ogonek, a dot above, and an acute accent, which is required in Lithuanian, is represented by the character sequence U+012F, U+0307, U+0301."
And there are many other cases where there is no single code-point to represent a single grapheme.
So for string truncation and line-splitting, (and anything dealing with arabic or indic scripts), you need to never crop in the middle of a codepoint-sequence that defined a single grapheme; or else the visual display is incorrect, or bakamoji (jibbrish).
Re:Fonts missing in action by jrumney · 2015-01-11 13:22 · Score: 1

I'm pretty sure most font systems already DO do this.
Usually not the font systems themselves, as the font system API needs to be designed to let you use fonts in the way that suits your application, and not have random substitutions happen behind your back (though the font system provides the API functions to figure out what a good substitution font will be). But higher level UI libraries, like GTK, Qt, MFC, Windows Forms,Core Text, Skia etc will do it.

UTF-8 by Ark42 · 2015-01-11 04:39 · Score: 4, Insightful

The answer is UTF-8. It's pretty much going to be the de-facto character set now. It has backwards compatibility with ASCII, and can easily be extended in the future to support possible U+200000 - U+7FFFFFFF codepoints, as the original UTF-8 specification used to include that anyway.

Any important point is to not mess things up and end up with CESU-8 like MySQL did. There are completely valid 4-byte UTF-8 characters, so don't think of it as some special alternate UTF-8 by artificially capping UTF-8 at a max of 3 bytes per character.

--
Morphing Software

Re:UTF-8 by Anonymous Coward · 2015-01-11 05:11 · Score: 1

Hell, there's 5-byte UTF-8 characters too (how would we represent UTF-32 characters in UTF-8 otherwise)..
The nicest thing would of course be if the world could hurry up and switch to English so we could just have ASCII for most everything, and UTF-32 for museums and whatnot that needed to store Linear-B or ancient hieroglyphs ;-] [*]
[*] Written by a non-English person, in English, using a US keyboard. It's all good bros,.
Re:UTF-8 by Anonymous Coward · 2015-01-11 06:14 · Score: 1

UTF-8 is easily adopted by C based software like Nethack because null-terminated string logic works unmodified; a UTF-8 string has no embedded nulls to trip up any code that that measures string length by searching for a zero byte. For the most part things should "just work." UTF-16 and 32 strings have zero bytes embedded among characters, so you have to audit every bit of code to ensure compatibility.
Short of someone deliberately using a mismatched non-character fixed width type like uint8_t to check for a zero terminator in exactly 8-bit wide units or something equally brain-dead, the scenario above can't happen since the character width checked for is set by the type. Flipping the type from char to char16_t and char32_t should "just work".
This is not to say that there aren't any problems that can't occur (the blind assumption that sizeof(string) == countof(string) and referencing memory off of that, for example) but the PP's comment is just ridiculous.
Re:UTF-8 by Ark42 · 2015-01-11 06:24 · Score: 1

The official spec limits UTF-8 to 10FFFF to help it place nice with UTF-16, so no 5 or 6 byte sequence is valid anymore. There isn't any characters defined above 10FFFF yet anyway. But in the future, if those ranges are defined, it would be easy to have programs using UTF-8 utilize those characters. If you use UTF-16 like Windows, you'd be out of luck though.

--
Morphing Software
Re:UTF-8 by TheRaven64 · 2015-01-11 06:51 · Score: 1

If you use UTF-16 like Windows, you'd be out of luck though.
UTF-16 doesn't have the problem. UCS-2 (which Windows still mostly uses, even where it pretends to use UTF-16) does. UTF-16 combines the worst of both worlds: a space-inefficient variable length encoding.

--
I am TheRaven on Soylent News
Re:UTF-8 by OrangeTide · 2015-01-11 07:19 · Score: 1

libuncursed exists because ncursesw is so bad. libuncursed is not great, but it's simple. Just the sort of band-aid we need on *nix until someone rewrites ncursesw.

--
“Common sense is not so common.” — Voltaire
Re:UTF-8 by voights · 2015-01-11 12:05 · Score: 1

Oh, really? Name one. (;
Re:UTF-8 by tlhIngan · 2015-01-11 17:18 · Score: 1

UTF-8 is easily adopted by C based software like Nethack because null-terminated string logic works unmodified; a UTF-8 string has no embedded nulls to trip up any code that that measures string length by searching for a zero byte. For the most part things should "just work." UTF-16 and 32 strings have zero bytes embedded among characters, so you have to audit every bit of code to ensure compatibility.
Incorrect. UTF-9 works for 8-bit chars.
Here's the truth for C:
sizeof(char) <= sizeof(short int) <= sizeof(int) <=sizeof(long) <= sizeof(long long)
In the past, chars were often not 8 bits long. And there are many architectures where the smallest type is not 8 bits, but 16 or 32 bits.
It's why we have the exact-size specifiers of (u)int8_t, (u)int16_t, (u)int32_t and (u)int64_t.
Re:UTF-8 by Rich0 · 2015-01-12 00:11 · Score: 1

The problem with that is that there are certain thoughts and concepts that can't be expressed in English.
Are you suggesting that one of the most promiscuous languages on the planet wouldn't just add a butchered version of the original word if that were truly the case? Words mean whatever we want them to mean - they're arbitrary combinations of symbols, kind of like unicode. :)
Re:UTF-8 by Ark42 · 2015-01-12 03:40 · Score: 1

UTF-16 is terrible, yes, but Windows does support it. I'm sure naive programmers create bad code by assuming UCS-2 and all characters being 2 bytes, but surrogate pairs like Emoticons U+1F600 - U+1F64F work just fine.
And by "out of luck" I was referring to possible future codepoints above U+10FFF. UTF-16 can only support up to that by using surrogate pairs. It does not have any way to represent higher codepoints, where as UTF-8 can easily be extended with 5 and 6 byte sequences.

--
Morphing Software
Re:UTF-8 by Eunuchswear · 2015-01-12 07:36 · Score: 1

Citation needed

--
Watch this Heartland Institute video
Re:UTF-8 by ais523 · 2015-01-13 10:54 · Score: 1

I wrote libuncursed specifically for NH4 (but intended to work for other roguelikes too), because curses solves the wrong problem nowadays (the problem of "how to talk to an obscure terminal from the 1980s that uses nonstandard terminal codes", rather than the problem of "how to talk to a modern-day terminal emulator that's incompatible with xterm but nonetheless claims to be compatible". I wrote more about the problems here.
Vanilla/mainline NetHack doesn't use libuncursed or curses, but rather a homerolled terminal codes library.

--
(1)DOCOMEFROM!2~.2'~#1WHILE:1<-"'?.1$.2'~'"':1/.1$.2'~#0"$#65535'"$"'"'&.1$.2'~'#0$#65535'"$#0'~#32767$#1"
Re:UTF-8 by jonadab · 2015-01-13 13:27 · Score: 1

For practical purposes, you can think of libuncursed as the display layer of NetHack 4, replacing an older curses library that NitroHack used, which in turn replaced the extensive and rather complicated set of platform-specific user interfaces NetHack 3.4.3 used, which were never entirely consistent with one another, due to being separately maintained.

libnethack is distributed with the game, as part of it, and I think it is even linked in statically by default. Yes, it was written as a highly-generalized support library, so that it *could* be used by other projects if desired and could probably even be made a dynamic library. But if all you want to do is build and run NetHack 4, that doesn't matter.

But in any case the original question from the Dev Team is about what to do in the vanilla codebase that may eventually lead to a new vanilla release (with a number yet to be announced, but 3.6 is probable; the number 3.5 will not be used for reasons explained on nethack.org). The vanilla codebase does not use libuncursed and in a number of additional ways is far more similar to 3.4.3 than it is to NetHack 4.

Although, the NetHack 4 devs are probably following this thread as well and may also implement Unicode in a larger way. (Unicode graphics for map display are already supported there, but things like player names, fruit names, object names, and level annotations are still treated as ASCII, I think, the same as in 3.4.3.)

Another thing not mentioned in the post is that the Dev Team is known to have already implemented some Unicode support, using wchar_t, which you can find in the leaked code (a tarball made from the tip of the dev team's internal repository from a few months ago now), if you hunt down a copy of that. But apparently they have not entirely settled on that implementation as the final solution.

--
Cut that out, or I will ship you to Norilsk in a box.

UTF-8 by Anonymous Coward · 2015-01-11 04:43 · Score: 5, Interesting

UTF-8 is easily adopted by C based software like Nethack because null-terminated string logic works unmodified; a UTF-8 string has no embedded nulls to trip up any code that that measures string length by searching for a zero byte. For the most part things should "just work." UTF-16 and 32 strings have zero bytes embedded among characters, so you have to audit every bit of code to ensure compatibility.

Re: Short of memory? by jones_supa · 2015-01-11 04:43 · Score: 2

UTF-32 should make memory allocation more predictable as every character is guaranteed to be 32 bits.

Go with the majority by namgge · 2015-01-11 04:54 · Score: 4, Insightful

In my experience, if you are upgrading legacy code that assumed straightforward ascii then utf8 is the
way to go. It was invented for the purpose by someone very smart (Ken Thompson). If there were a 'Neatest Hacks of All Time' competition utf8 would be my nomination.

The only real issues I've encountered are the usual ones of comparisons between equivalent characters and defining collating order. These stop being a problem (or more precisely 'your' problem) once you abandon the idea of rolling your own and use a decent utf8 string library.

Re: Short of memory? by Anonymous Coward · 2015-01-11 05:11 · Score: 1

Every codepoint, not character. Big difference. No normalization form guarantees one character per codepoint. Well, except Perl's NFG, but that requires dynamic mapping.

Language by Tim+Locke · 2015-01-11 05:20 · Score: 1

Use what your programming language supports. If it supports Unicode, use UTF8 as it saves space. UTF32 isn't "one character per 32 bits" so it's no easier than UTF8.

--
*** On the Internet, no one knows you're using a VIC-20

Re:Language by Anonymous Coward · 2015-01-11 06:00 · Score: 1

A code point isn't a necessarily a character...

UTF-8 Already Works to Name Your Pet by Anonymous Coward · 2015-01-11 05:23 · Score: 1

I'm not sure why they need to do anything, I can successfully name my pet in Nethack 3.4.3 using UTF-8 characters.

Re:UTF-8 Already Works to Name Your Pet by L.+J.+Beauregard · 2015-01-11 07:47 · Score: 1

It may work well with pets; but name your fruit "éclair", with the accent, and see what happens.

--
Ooh, moderator points! Five more idjits go to Minus One Hell!
Delendae sunt RIAA, MPAA et Windoze
Re:UTF-8 Already Works to Name Your Pet by cruff · 2015-01-11 15:11 · Score: 1

It works just fine.
Re:UTF-8 Already Works to Name Your Pet by jdschulteis · 2015-01-12 10:59 · Score: 1

Yeah but how many wand charges does it take to engrave your pet's name? Huh?
That's why I always name my pet Elbereth...

Re: Short of memory? by jones_supa · 2015-01-11 05:35 · Score: 1

The UTF-32 form of a character is a direct representation of its codepoint.

Unicode-related poll on Slashdot?! by excelsior_gr · 2015-01-11 05:35 · Score: 1

I don't know if I'm supposed to marvel at the submitter's sarcastic nerve of laugh with the irony.

I think I'm gonna do both.

Re: Short of memory? by nmoore · 2015-01-11 05:41 · Score: 5, Informative

There are combined characters that are not represented by a single codepoint: http://en.wikipedia.org/wiki/U...

NetHack Development Team Not Dead by mlkj · 2015-01-11 05:50 · Score: 1

All of my fucking YES.

It wasn't justs a rumor.

Re:NetHack Development Team Not Dead by L.+J.+Beauregard · 2015-01-11 07:40 · Score: 2

AFAICT the original query came from the actual DevTeam. The blog post in the submission is from the NetHack4 guy, who I suspect is also the anonymous submitter.

--
Ooh, moderator points! Five more idjits go to Minus One Hell!
Delendae sunt RIAA, MPAA et Windoze
Re:NetHack Development Team Not Dead by chispito · 2015-01-11 13:00 · Score: 1

All of my fucking YES.
It wasn't justs a rumor.
Just switch to a different game or a variant. You'll be happier.

--
The Daddy casts sleep on the Baby. The Baby resists!
Re:NetHack Development Team Not Dead by ais523 · 2015-01-13 10:52 · Score: 1

It wasn't me who submitted this story. I would have done if I thought there was a chance of it being accepted - more opinions are always good and Slashdot has lots of technically-inclined users who likely have relevant opinions - but it seemed a little offtopic.
I did write the blog post for the purpose of being linked to news aggregators so that people would have more than a bare post from the devteam to introduce the issues, though.

--
(1)DOCOMEFROM!2~.2'~#1WHILE:1<-"'?.1$.2'~'"':1/.1$.2'~#0"$#65535'"$"'"'&.1$.2'~'#0$#65535'"$#0'~#32767$#1"

Re:Short of memory? by Anonymous Coward · 2015-01-11 05:51 · Score: 1

Plan for the future, man.. UTF-64

Fuck unicode by russotto · 2015-01-11 06:07 · Score: 1

Unicode is a clusterfuck. 7 bits is good enough for anyone.

Re: Short of memory? by jones_supa · 2015-01-11 06:08 · Score: 1

Hmm...

Fonts missing in action by bhaak1 · 2015-01-11 06:08 · Score: 1

First off, UTF-32 is least likely to cause bugs, since all chars are the same length and thus possible to determine memory usage simply by multiplying char count by 4.

The memory usage of UTF-8 is also at most char count multiplied by 4. The 5- and 6-byte sequences were declared invalid when Unicode was restricted to have no character above U+10FFFF.

--
UnNetHack: NetHack Improved!

The one true encoding by reanjr · 2015-01-11 06:33 · Score: 1

The answer is always UTF-8. It doesn't matter what project, or country, or language. Anything other than UTF-8 will cause completely avoidable problems. I wish more programmers would learn this rule, as it would make all our jobs easier.

Re:The one true encoding by Shados · 2015-01-11 06:50 · Score: 1

for ease of use and storage efficiency with flexibility, yeah, UTF-8 is always best.
For certain type of work with specific performance characteristics however, not so. Thats usually the problem.
Re:The one true encoding by ledow · 2015-01-11 07:08 · Score: 1

Very few places are going to be dealing in UTF-32 just because of the performance.
And certainly not Nethack.
In all my projects, I use UTF-8. Any performance hit is so far off the radar, it's just not worth worrying about.
Re:The one true encoding by Shados · 2015-01-11 08:06 · Score: 1

Absolutely. I was just saying that it was basically the only time anything other than UTF8 matters (especially since in the time when it matters, switching from one to the other is HELL).
My wife used to work on a faceted search system made to handle a few petabyte of data... the difference was pretty huge.
Since I personally never did something like that, I never had issues just using UTF8 :)

Re:Short of memory? by davester666 · 2015-01-11 06:39 · Score: 1

short-term thinker!

UTF-128 FTW!

--
Sleep your way to a whiter smile...date a dentist!

UTF-8 by jmccue · 2015-01-11 07:09 · Score: 1

I am a bit concerned about the statement on "libuncursed", which does not see to be in many distros. To me it seems the change is being made to cater to non UN*X systems and hoping to move away from curses. So given the way I read the articles, I would prefer UTF-8 and try to use 'standard' libs. This way data is easily moved between different system types and the change will still supported older under-powered hardware.

UTF-32 would save memory in some cases by OrangeTide · 2015-01-11 07:14 · Score: 1

A UTF-8 string would require a pointer to it, on a 64-bit system that's 8 bytes, plus the overhead of dynamic allocation (typically 8 bytes). But if you only needed a single character, then a UTF-32 could accomplish that in 4 bytes. Effectively making UTF-32 one quarter the size of a typical UTF-8 implementation, when operating with the constraint that there is a single character per data structure/item/tile/object/whatever.

--
“Common sense is not so common.” — Voltaire

Re:UTF-32 would save memory in some cases by wiredlogic · 2015-01-11 08:08 · Score: 1

As mentioned above this idea fails when combining characters are needed. This is the advantage of UTF-8 since you are forced to deal with variable length characters anyway. Support for combining chars won't be overlooked in most cases.

--
I am becoming gerund, destroyer of verbs.
Re:UTF-32 would save memory in some cases by petermgreen · 2015-01-11 15:53 · Score: 1

when operating with the constraint that there is a single unicode code point per data structure/item/tile/object/whatever.
Fixed that for you.
So you'd support rare chinese characters but exclude unusual letter/diacritic combinations.

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
Re:UTF-32 would save memory in some cases by OrangeTide · 2015-01-12 12:58 · Score: 1

Yes, your tiled data would be limited to situations where the NFC (Normalization Form Canonical Decomposition/Composition) is a single code point. It's extremely difficult to find exceptions that are valid in an extant natural language, but they do exist.
If the tiles themselves require multiple code points, where the number of code points is greater than about 3, then UTF-8+pointer is the more compact solution than a fixed UTF-32 array. (by my rough napkin estimations). A fixed UTF-8 array doesn't really offer any advantage over UTF-32, as the utf8 version can hold fewer code points in the worst case as the encoding can be up to 50% larger in some weird corner cases.
My comments were only about tiled data. Naming items (strings) in the game seems easiest to do as UTF-8. As a lot of the existing functions can work, and you don't have to alter the file format for bones and saves. The utf-8 versions might even show up correctly on old binaries, although the positioning on the screen would likely be screwed up.

--
“Common sense is not so common.” — Voltaire

So this is why Windows uses UTF-16? by jsrjsr · 2015-01-11 08:13 · Score: 1

Pack of sadists at Microsoft?

Re:So this is why Windows uses UTF-16? by DickBreath · 2015-01-12 02:49 · Score: 1

Masochists use code pages from Microsoft.

--

I'll see your senator, and I'll raise you two judges.

It's the library, not the encoding by Cacadril · 2015-01-11 08:57 · Score: 1

Wrong question. What text-handlig (string-handlig) library do you want to use? Then use whatever that library supports. I still in doubt, go for UTF-8. Then you will be less tempted to think you can handle things yourself in C code. If you do think so, you will invariably write buggy code because you don't know enough about the issues.

--
There is no substitute for common sense. Especially, no body of rules will do.

Re: Short of memory? by TheCarp · 2015-01-11 09:56 · Score: 1

I have worked with some people who would consider this :)

Actually a while back I found someone was passing around instructions on how to setup some software that needed a random key for a symmetric cipher. It used a 256 bit block cipher so it needed a 256 bit key.

The instructions being passed around where clearly cut and pasted from a web site (they might have even had the url) but they remembered that we had key policies for other things and so they changed the dd command to make a 1024 bit key....because we use at least 1024 bit keys by policy right?

A little bit of knowledge can be such an amusing thing.

--
"I opened my eyes, and everything went dark again"

What the hell? by DrXym · 2015-01-11 10:34 · Score: 1

It's still the same strings as far as the end user is concerned. A UTF-16 encoded string looks the same to the end user as a UTF-8 encoded string. But given that the codebase is legacy the only sensible choice is UTF-8.

What the web was built on by RogueWarrior65 · 2015-01-11 14:40 · Score: 3, Funny

Everyone knows (or should know) that the web was built on WTF-8 which explains a lot.

Re: Short of memory? by petermgreen · 2015-01-11 15:47 · Score: 3, Insightful

What does "character" mean?

Something represented by one unicode codepoint? (making your statement a tautology)
Grapheme cluster? (what most users would consider a character)
A position in the character grid of a console?

Which brings us to the real question. to what extent do you want to support unicode? do you care about

* Grapheme clusters that take multiple code points to represent? (letters with multiple diacritics, unusual letter/diacritic combinations etc)
* Right to left languages? (hebrew, arabic etc)
* Languages where chracters merge together such that computer output looks more like handwriting than type? (see above)
* Languages where "fixed" width fonts use two different widths giving "single width" and "double width" characters? (chineese, japanese, korean)
* Characters outside of the basic multilingual plane? (rare Chinese characters, dead languages, made up languages, rare mathematical symbols)

Once you have worked though that design decision it will help you make others. What you find is that "length in unicode code points" and "unicode code point n" really aren't much more useful than "length in utf-k code units" and "utf-k code point n". Either is fine for sanity checking string length or iterating through a string looking for delimiter. Neither is much use for anything more than unless you are doing a very limited implementation.

UTF-32 seems enticing initially but turns out to be fairly pointless, by the time you get to caring about non-BMP characters you are probably also going to be caring about combining characters etc and it will massively increase the size of the vast majority of text.

UTF-8 vs UTF-16 is something of a tossup. UTF-16 lets you get away with treating each unit of the string as one "character" much longer which may be considered either a blessing (because you don't care about the cases where it doesn't work) or a curse (because you realise your assumptions were wrong much later after basing much more code on them). UTF-8 is smaller for text with lots of latin chracters, UTF-16 is smaller for text with lots of CJK characters. UTF-8 is the usual choice on *nix systems and internet protocols. UTF-16 is the encoding chosen by windows and Java.

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register

Re: None by Hognoxious · 2015-01-11 20:26 · Score: 2

Definite a canadian

http://en.wikipedia.org/wiki/A...

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."

Re:Perhaps it's about translations? by bhaak1 · 2015-01-11 21:35 · Score: 1

No, it's not about translations (Source: I'm a NetHack fork developer who's somewhat involved in this DevTeam revival thing).

Translations of games like NetHack are inherently hard. You can't use the standard approaches as the program assembles the sentences out of several parts and usually, e.g. with gettext, you translate whole sentences. But here, we have dynamic sentences where this approach can't work.

For my German translation NetHack-De, I used internally latin-1 (so I could continue to use the char* strings) and for the output transformed the text into ASCII, latin-1, or utf-8 (depending on configuration).

--
UnNetHack: NetHack Improved!

Re: Short of memory? by SkepticalEmpiricist · 2015-01-12 02:06 · Score: 1

> Grapheme cluster? (what most users would consider a character)

Is there any easy way to tell where one grapheme cluster ends, and another begins? With UTF-8, it's easy to count the bits to see where one codepoint begins and ends, I hope there is something equally simple for grapheme clusters. Or perhaps it's all complicated and is different for each language?

Also, if I do accidentally split a grapheme cluster in two (while respecting codepoint boundaries), what will happen? If I attempt to display the two strings, can I expect a sensible result, or will the result be garbage?

Re:Short of memory? by hoggoth · 2015-01-12 02:32 · Score: 1

128 bits should be enough for anyone.

--
- For the complete works of Shakespeare: cat /dev/random (may take some time)

Re:Short of memory? by DickBreath · 2015-01-12 02:54 · Score: 1

No need for UTF-64. You could double the huge space of UTF-32 by just introducing UTF-33. Unlike UTF-8 or UTF-16 there would be no unpredictable memory allocation problems. Every character would get a nice clean 33 bits. Be sure to bitshift and pack characters so that no memory is wasted. That should make the world a wonderful place and everyone will be happy.

--

I'll see your senator, and I'll raise you two judges.

Add Nethack icon set to unicode and then use it by nonos · 2015-01-12 03:35 · Score: 1

The X Window version of Nethack has a nice default icon set for character classes, monsters, floors, walls... Why not propose this icon set as an addition to the Unicode categories and then use those characters ?

Re:Short of memory? by davester666 · 2015-01-12 09:37 · Score: 1

bytes! there are some crazy alien languages out there!

--
Sleep your way to a whiter smile...date a dentist!

Screw that, where's the emoji? by cdensch · 2015-01-12 11:17 · Score: 1

Wanna see a little ol poop clod all running around.

Re:Perhaps it's about translations? by jonadab · 2015-01-13 14:21 · Score: 1

Actually, there are German and Japanese variants. (The German one is a translation of UnNetHack, done I think by the same guy who did the English version of that variant. The Japanese one, somewhat older, is called NetHack Brass and seems to be mainly a flavor variant, i.e., it changes much more than just language.)

--
Cut that out, or I will ship you to Norilsk in a box.

Re: Short of memory? by petermgreen · 2015-01-19 07:36 · Score: 1

Is there any easy way to tell where one grapheme cluster ends, and another begins? With UTF-8, it's easy to count the bits to see where one codepoint begins and ends, I hope there is something equally simple for grapheme clusters. Or perhaps it's all complicated and is different for each language?

As I understand it it comes down to table lookups. The details of full unicode support are unfortunately not trivial and theres a reason libraries like ICU are as big as they are.

Also, if I do accidentally split a grapheme cluster in two (while respecting codepoint boundaries), what will happen? If I attempt to display the two strings, can I expect a sensible result, or will the result be garbage?

As I understand it normally the base character is first and then things added to it follow.

So if you cut the end off a string and cut in the middle of a cluster then the last character may be missing some bits but the string is likely to be otherwise OK.

If you cut the start off a string and cut in the middle of a cluster things get messier. You then have combining characters at the start of the string with nothing to combine with. If you just ask a display library to display it then it's going to be down to the display library what happens but I expect the combiners will either be not displayed at all or displayed with no base. If you add the cut string to the end of another string then the combiners will combine with whatever was at the end of the string you combined it with.

All in all you will probablly end up with something "ugly but usable".

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register

Slashdot Mirror

NetHack Development Team Polls Community For Advice On Unicode

111 of 165 comments (clear)