NetHack Development Team Polls Community For Advice On Unicode
An anonymous reader writes After years of relative silence, the development team behind the classic roguelike game NetHack has posteda question: going forward, what internal representation should the NetHack core use for Unicode characters? UTF8? UTF32? Something else? (See also: NH4 blog, reddit. Also, yes, I have verified that the question authentically comes from the NetHack dev team.)
utf-8
If masochist, just UTF-16. If slashdot coder use ASCII.
who cares? This only affects naming your character and displaying stuff on the map.
What use are those characters anyway? You don't need funny accents on letters to play Nethack. 7 bits should be enough for any character set! Hardcore hackers who want a workaround can just use LaTeX codes.
Considering the length of their release cycle, seems to be a safe choice.
It's not like the difference 1/2/4 bytes would make much performance difference for the application like NetHack.
Using the utf-32 internally would save them from some of the silliness the alternatives like utf-8 bring with them.
All hope abandon ye who enter here.
I started playing nethack before it was nethack, it was just hack. (I may well hold the record for longest time playing without an asencion, but that is beside the point.) I have played other roguelikes and keep coming back to nethack because it is the only one that keeps that same feel for me. It has had the same overall look my entire life. While the expanded character set in UTF would allow for significantly more characters to be used in drawing the map, and designating each monster with a different character, I beg of you not to do so. Keep the overall look the same, (or allow it as a compile time option at the very least) and just use UTF for the character name.
For which implimentation of UTF to use, I'd go with utf8 as it seems to have the widest adoption, or 32 because that will probably allow you the longest time before having to think about this again. I would avoid the middle ground.
Little Brother, watching the watchers
First off, UTF-32 is least likely to cause bugs, since all chars are the same length and thus possible to determine memory usage simply by multiplying char count by 4. So, if you're gonna do unicode, and you don't like your code to be buggy, this is the way to do it.
That said, unicode is a travesty. Unlike ascii, there is no such thing as a complete unicode font that implements all of unicode's code points. Unicode only defines how any implemented chars should be numbered, but doesn't actually require you to implement more than zero characters.
Calling unicode a standard - well it's true, of course. But it doesn't mean what people think it means.
The answer is UTF-8. It's pretty much going to be the de-facto character set now. It has backwards compatibility with ASCII, and can easily be extended in the future to support possible U+200000 - U+7FFFFFFF codepoints, as the original UTF-8 specification used to include that anyway.
Any important point is to not mess things up and end up with CESU-8 like MySQL did. There are completely valid 4-byte UTF-8 characters, so don't think of it as some special alternate UTF-8 by artificially capping UTF-8 at a max of 3 bytes per character.
Morphing Software
UTF-8 is easily adopted by C based software like Nethack because null-terminated string logic works unmodified; a UTF-8 string has no embedded nulls to trip up any code that that measures string length by searching for a zero byte. For the most part things should "just work." UTF-16 and 32 strings have zero bytes embedded among characters, so you have to audit every bit of code to ensure compatibility.
UTF-32 should make memory allocation more predictable as every character is guaranteed to be 32 bits.
In my experience, if you are upgrading legacy code that assumed straightforward ascii then utf8 is the
way to go. It was invented for the purpose by someone very smart (Ken Thompson). If there were a 'Neatest Hacks of All Time' competition utf8 would be my nomination.
The only real issues I've encountered are the usual ones of comparisons between equivalent characters and defining collating order. These stop being a problem (or more precisely 'your' problem) once you abandon the idea of rolling your own and use a decent utf8 string library.
Every codepoint, not character. Big difference. No normalization form guarantees one character per codepoint. Well, except Perl's NFG, but that requires dynamic mapping.
Use what your programming language supports. If it supports Unicode, use UTF8 as it saves space. UTF32 isn't "one character per 32 bits" so it's no easier than UTF8.
*** On the Internet, no one knows you're using a VIC-20
I'm not sure why they need to do anything, I can successfully name my pet in Nethack 3.4.3 using UTF-8 characters.
The UTF-32 form of a character is a direct representation of its codepoint.
I don't know if I'm supposed to marvel at the submitter's sarcastic nerve of laugh with the irony.
I think I'm gonna do both.
There are combined characters that are not represented by a single codepoint: http://en.wikipedia.org/wiki/U...
All of my fucking YES.
It wasn't justs a rumor.
Plan for the future, man.. UTF-64
Unicode is a clusterfuck. 7 bits is good enough for anyone.
Hmm...
First off, UTF-32 is least likely to cause bugs, since all chars are the same length and thus possible to determine memory usage simply by multiplying char count by 4.
The memory usage of UTF-8 is also at most char count multiplied by 4. The 5- and 6-byte sequences were declared invalid when Unicode was restricted to have no character above U+10FFFF.
UnNetHack: NetHack Improved!
The answer is always UTF-8. It doesn't matter what project, or country, or language. Anything other than UTF-8 will cause completely avoidable problems. I wish more programmers would learn this rule, as it would make all our jobs easier.
short-term thinker!
UTF-128 FTW!
Sleep your way to a whiter smile...date a dentist!
I am a bit concerned about the statement on "libuncursed", which does not see to be in many distros. To me it seems the change is being made to cater to non UN*X systems and hoping to move away from curses. So given the way I read the articles, I would prefer UTF-8 and try to use 'standard' libs. This way data is easily moved between different system types and the change will still supported older under-powered hardware.
A UTF-8 string would require a pointer to it, on a 64-bit system that's 8 bytes, plus the overhead of dynamic allocation (typically 8 bytes). But if you only needed a single character, then a UTF-32 could accomplish that in 4 bytes. Effectively making UTF-32 one quarter the size of a typical UTF-8 implementation, when operating with the constraint that there is a single character per data structure/item/tile/object/whatever.
“Common sense is not so common.” — Voltaire
Pack of sadists at Microsoft?
Wrong question. What text-handlig (string-handlig) library do you want to use? Then use whatever that library supports. I still in doubt, go for UTF-8. Then you will be less tempted to think you can handle things yourself in C code. If you do think so, you will invariably write buggy code because you don't know enough about the issues.
There is no substitute for common sense. Especially, no body of rules will do.
I have worked with some people who would consider this :)
Actually a while back I found someone was passing around instructions on how to setup some software that needed a random key for a symmetric cipher. It used a 256 bit block cipher so it needed a 256 bit key.
The instructions being passed around where clearly cut and pasted from a web site (they might have even had the url) but they remembered that we had key policies for other things and so they changed the dd command to make a 1024 bit key....because we use at least 1024 bit keys by policy right?
A little bit of knowledge can be such an amusing thing.
"I opened my eyes, and everything went dark again"
It's still the same strings as far as the end user is concerned. A UTF-16 encoded string looks the same to the end user as a UTF-8 encoded string. But given that the codebase is legacy the only sensible choice is UTF-8.
Everyone knows (or should know) that the web was built on WTF-8 which explains a lot.
What does "character" mean?
Something represented by one unicode codepoint? (making your statement a tautology)
Grapheme cluster? (what most users would consider a character)
A position in the character grid of a console?
Which brings us to the real question. to what extent do you want to support unicode? do you care about
* Grapheme clusters that take multiple code points to represent? (letters with multiple diacritics, unusual letter/diacritic combinations etc)
* Right to left languages? (hebrew, arabic etc)
* Languages where chracters merge together such that computer output looks more like handwriting than type? (see above)
* Languages where "fixed" width fonts use two different widths giving "single width" and "double width" characters? (chineese, japanese, korean)
* Characters outside of the basic multilingual plane? (rare Chinese characters, dead languages, made up languages, rare mathematical symbols)
Once you have worked though that design decision it will help you make others. What you find is that "length in unicode code points" and "unicode code point n" really aren't much more useful than "length in utf-k code units" and "utf-k code point n". Either is fine for sanity checking string length or iterating through a string looking for delimiter. Neither is much use for anything more than unless you are doing a very limited implementation.
UTF-32 seems enticing initially but turns out to be fairly pointless, by the time you get to caring about non-BMP characters you are probably also going to be caring about combining characters etc and it will massively increase the size of the vast majority of text.
UTF-8 vs UTF-16 is something of a tossup. UTF-16 lets you get away with treating each unit of the string as one "character" much longer which may be considered either a blessing (because you don't care about the cases where it doesn't work) or a curse (because you realise your assumptions were wrong much later after basing much more code on them). UTF-8 is smaller for text with lots of latin chracters, UTF-16 is smaller for text with lots of CJK characters. UTF-8 is the usual choice on *nix systems and internet protocols. UTF-16 is the encoding chosen by windows and Java.
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
Definite a canadian
http://en.wikipedia.org/wiki/A...
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
No, it's not about translations (Source: I'm a NetHack fork developer who's somewhat involved in this DevTeam revival thing).
Translations of games like NetHack are inherently hard. You can't use the standard approaches as the program assembles the sentences out of several parts and usually, e.g. with gettext, you translate whole sentences. But here, we have dynamic sentences where this approach can't work.
For my German translation NetHack-De, I used internally latin-1 (so I could continue to use the char* strings) and for the output transformed the text into ASCII, latin-1, or utf-8 (depending on configuration).
UnNetHack: NetHack Improved!
> Grapheme cluster? (what most users would consider a character)
Is there any easy way to tell where one grapheme cluster ends, and another begins? With UTF-8, it's easy to count the bits to see where one codepoint begins and ends, I hope there is something equally simple for grapheme clusters. Or perhaps it's all complicated and is different for each language?
Also, if I do accidentally split a grapheme cluster in two (while respecting codepoint boundaries), what will happen? If I attempt to display the two strings, can I expect a sensible result, or will the result be garbage?
128 bits should be enough for anyone.
- For the complete works of Shakespeare: cat
No need for UTF-64. You could double the huge space of UTF-32 by just introducing UTF-33. Unlike UTF-8 or UTF-16 there would be no unpredictable memory allocation problems. Every character would get a nice clean 33 bits. Be sure to bitshift and pack characters so that no memory is wasted. That should make the world a wonderful place and everyone will be happy.
I'll see your senator, and I'll raise you two judges.
The X Window version of Nethack has a nice default icon set for character classes, monsters, floors, walls... Why not propose this icon set as an addition to the Unicode categories and then use those characters ?
bytes! there are some crazy alien languages out there!
Sleep your way to a whiter smile...date a dentist!
Wanna see a little ol poop clod all running around.
Actually, there are German and Japanese variants. (The German one is a translation of UnNetHack, done I think by the same guy who did the English version of that variant. The Japanese one, somewhat older, is called NetHack Brass and seems to be mainly a flavor variant, i.e., it changes much more than just language.)
Cut that out, or I will ship you to Norilsk in a box.
Is there any easy way to tell where one grapheme cluster ends, and another begins? With UTF-8, it's easy to count the bits to see where one codepoint begins and ends, I hope there is something equally simple for grapheme clusters. Or perhaps it's all complicated and is different for each language?
As I understand it it comes down to table lookups. The details of full unicode support are unfortunately not trivial and theres a reason libraries like ICU are as big as they are.
Also, if I do accidentally split a grapheme cluster in two (while respecting codepoint boundaries), what will happen? If I attempt to display the two strings, can I expect a sensible result, or will the result be garbage?
As I understand it normally the base character is first and then things added to it follow.
So if you cut the end off a string and cut in the middle of a cluster then the last character may be missing some bits but the string is likely to be otherwise OK.
If you cut the start off a string and cut in the middle of a cluster things get messier. You then have combining characters at the start of the string with nothing to combine with. If you just ask a display library to display it then it's going to be down to the display library what happens but I expect the combiners will either be not displayed at all or displayed with no base. If you add the cut string to the end of another string then the combiners will combine with whatever was at the end of the string you combined it with.
All in all you will probablly end up with something "ugly but usable".
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register