NetHack Development Team Polls Community For Advice On Unicode
An anonymous reader writes After years of relative silence, the development team behind the classic roguelike game NetHack has posteda question: going forward, what internal representation should the NetHack core use for Unicode characters? UTF8? UTF32? Something else? (See also: NH4 blog, reddit. Also, yes, I have verified that the question authentically comes from the NetHack dev team.)
If short of memory use UTF-8. If not use UTF-32.
utf-8
who cares? This only affects naming your character and displaying stuff on the map.
What use are those characters anyway? You don't need funny accents on letters to play Nethack. 7 bits should be enough for any character set! Hardcore hackers who want a workaround can just use LaTeX codes.
Does this mean they're going release an update with more features? Or are they just trying to modernize the codebase?
I probably wouldn't even notice any major changes since I haven't really played in ages and didn't even scratch the surface of that massive game.
Either way I am excite.
Considering the length of their release cycle, seems to be a safe choice.
It's not like the difference 1/2/4 bytes would make much performance difference for the application like NetHack.
Using the utf-32 internally would save them from some of the silliness the alternatives like utf-8 bring with them.
All hope abandon ye who enter here.
I started playing nethack before it was nethack, it was just hack. (I may well hold the record for longest time playing without an asencion, but that is beside the point.) I have played other roguelikes and keep coming back to nethack because it is the only one that keeps that same feel for me. It has had the same overall look my entire life. While the expanded character set in UTF would allow for significantly more characters to be used in drawing the map, and designating each monster with a different character, I beg of you not to do so. Keep the overall look the same, (or allow it as a compile time option at the very least) and just use UTF for the character name.
For which implimentation of UTF to use, I'd go with utf8 as it seems to have the widest adoption, or 32 because that will probably allow you the longest time before having to think about this again. I would avoid the middle ground.
Little Brother, watching the watchers
ASCII is bloated shit solving a non-existent problem.
First off, UTF-32 is least likely to cause bugs, since all chars are the same length and thus possible to determine memory usage simply by multiplying char count by 4. So, if you're gonna do unicode, and you don't like your code to be buggy, this is the way to do it.
That said, unicode is a travesty. Unlike ascii, there is no such thing as a complete unicode font that implements all of unicode's code points. Unicode only defines how any implemented chars should be numbered, but doesn't actually require you to implement more than zero characters.
Calling unicode a standard - well it's true, of course. But it doesn't mean what people think it means.
The answer is UTF-8. It's pretty much going to be the de-facto character set now. It has backwards compatibility with ASCII, and can easily be extended in the future to support possible U+200000 - U+7FFFFFFF codepoints, as the original UTF-8 specification used to include that anyway.
Any important point is to not mess things up and end up with CESU-8 like MySQL did. There are completely valid 4-byte UTF-8 characters, so don't think of it as some special alternate UTF-8 by artificially capping UTF-8 at a max of 3 bytes per character.
Morphing Software
UTF-8 is easily adopted by C based software like Nethack because null-terminated string logic works unmodified; a UTF-8 string has no embedded nulls to trip up any code that that measures string length by searching for a zero byte. For the most part things should "just work." UTF-16 and 32 strings have zero bytes embedded among characters, so you have to audit every bit of code to ensure compatibility.
Common guys isn't it time you discovered sunlight, real women?
I have a metal image of some guy aged about 68 living in the basement of his 94 year old mothers house with a green screen VT102 coding away trying to work out how to implement unicode ! ;-)
In my experience, if you are upgrading legacy code that assumed straightforward ascii then utf8 is the
way to go. It was invented for the purpose by someone very smart (Ken Thompson). If there were a 'Neatest Hacks of All Time' competition utf8 would be my nomination.
The only real issues I've encountered are the usual ones of comparisons between equivalent characters and defining collating order. These stop being a problem (or more precisely 'your' problem) once you abandon the idea of rolling your own and use a decent utf8 string library.
http://utf8everywhere.org/
Dig around; the least common denominator I think is UTF-8. And lots of code handling "char" is then compatible with UTF-8 without any code changes.
Hah..."Unicode is bloated". We have found the neckbeard. ;)
then think some more about what to do.
UTF implementation is what made a mess of Python 3. After many years, very few companies and few important libraries have been updated to 3.
Yes, "everyone is doing it", but that does not necessarily make it a good idea. I do think that considering whether one would want to implement support for unicode at all would come first, and that the answer isn't quite as obvious as one would like.
Didn't the VT102 have a white on black display?
Use what your programming language supports. If it supports Unicode, use UTF8 as it saves space. UTF32 isn't "one character per 32 bits" so it's no easier than UTF8.
*** On the Internet, no one knows you're using a VIC-20
I'm not sure why they need to do anything, I can successfully name my pet in Nethack 3.4.3 using UTF-8 characters.
The ç®ç hits! You die... --More--
Do you want your possessions identified?
I don't know if I'm supposed to marvel at the submitter's sarcastic nerve of laugh with the irony.
I think I'm gonna do both.
All of my fucking YES.
It wasn't justs a rumor.
Unicode is a clusterfuck. 7 bits is good enough for anyone.
First off, UTF-32 is least likely to cause bugs, since all chars are the same length and thus possible to determine memory usage simply by multiplying char count by 4.
The memory usage of UTF-8 is also at most char count multiplied by 4. The 5- and 6-byte sequences were declared invalid when Unicode was restricted to have no character above U+10FFFF.
UnNetHack: NetHack Improved!
The answer is always UTF-8. It doesn't matter what project, or country, or language. Anything other than UTF-8 will cause completely avoidable problems. I wish more programmers would learn this rule, as it would make all our jobs easier.
Nethack needs a FSM type! It attacks with acidic spaghetti sauce!
I am a bit concerned about the statement on "libuncursed", which does not see to be in many distros. To me it seems the change is being made to cater to non UN*X systems and hoping to move away from curses. So given the way I read the articles, I would prefer UTF-8 and try to use 'standard' libs. This way data is easily moved between different system types and the change will still supported older under-powered hardware.
A UTF-8 string would require a pointer to it, on a 64-bit system that's 8 bytes, plus the overhead of dynamic allocation (typically 8 bytes). But if you only needed a single character, then a UTF-32 could accomplish that in 4 bytes. Effectively making UTF-32 one quarter the size of a typical UTF-8 implementation, when operating with the constraint that there is a single character per data structure/item/tile/object/whatever.
“Common sense is not so common.” — Voltaire
Pack of sadists at Microsoft?
Wrong question. What text-handlig (string-handlig) library do you want to use? Then use whatever that library supports. I still in doubt, go for UTF-8. Then you will be less tempted to think you can handle things yourself in C code. If you do think so, you will invariably write buggy code because you don't know enough about the issues.
There is no substitute for common sense. Especially, no body of rules will do.
Everyone seems to be assuming this is all about wacky names for your character, or drawing monsters with more characters. I guess it could be used for that, but how about something much simpler?
Translation. Play the game in French, or Russian, or heck, Chinese? There are plenty of off the shelf solutions for translating per language, but they all require a unicode front end to display the text.
It's still the same strings as far as the end user is concerned. A UTF-16 encoded string looks the same to the end user as a UTF-8 encoded string. But given that the codebase is legacy the only sensible choice is UTF-8.
Aldo possible an american
Everyone knows (or should know) that the web was built on WTF-8 which explains a lot.
Definite a canadian
http://en.wikipedia.org/wiki/A...
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
The X Window version of Nethack has a nice default icon set for character classes, monsters, floors, walls... Why not propose this icon set as an addition to the Unicode categories and then use those characters ?
Unicode is essentially a 32 bit unsigned integer number for each character. EXTERNAL representations are UTF-8, -16, 32, ASCII, Windows-1252, etc... and UTF-8 is probably the most efficient byte-wise as it only uses as many bytes as it needs to represent each character.
Wanna see a little ol poop clod all running around.