Mr. Pike, Tear Down This ASCII Wall!
theodp writes "To move forward with programming languages, argues Poul-Henning Kamp, we need to break free from the tyranny of ASCII. While Kamp admires programming language designers like the Father-of-Go Rob Pike, he simply can't forgive Pike for 'trying to cram an expressive syntax into the straitjacket of the 95 glyphs of ASCII when Unicode has been the new black for most of the past decade.' Kamp adds: 'For some reason computer people are so conservative that we still find it more uncompromisingly important for our source code to be compatible with a Teletype ASR-33 terminal and its 1963-vintage ASCII table than it is for us to be able to express our intentions clearly.' So, should the new Hello World look more like this?"
The thing with ASCII is that it's easy to write on standard keyboards, and does not require a specialized layout. Once someone can cram the necessary unicode symbols into a keyboard so that I don't have to remember arcane meta-codes or fiddle with pressing five different dead keys to get one symbol, I'm all for it.
I'm sorry, I only accept criticism in the form of sed expressions.
"Syntactic sugar causes cancer of the semicolon" - Alan Perlis.
Michael decided to use this huge amount of computer time to search the public domain books that were stored in our libraries, and to digitize these books. He also decided to store the electronic texts (eTexts) in the simplest way, using the plain text format called Plain Vanilla ASCII, so they can be read easily by any machine, operating system or software.
- Marie Lebert
Since its humble beginnings in 1971 Project Gutenberg has reproduced and distributed thousands of works to millions of people in - ultimately - billions of copies. They support ePub now and simple HTML, as well as robo-read audio files, but the one format that has been stable this whole time has been ASCII. It's also the format that is likely to survive the longest without change. Project Gutenberg texts can now be read on every e-reader, smartphone, tablet and PC.
If you want to use Rich Text format, or XML, or PostScript or something else then fine - please do. But don't go trying to deprecate ASCII.
Help stamp out iliturcy.
so we should start coding in Chinese?
Seems easier to spell words with a small set of symbols than to learn a new symbol for every item in a huge set of terms.
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
I can express my intentions just fine with ASCII. They have cunningly invented a system for that. It's called language and it comes in very handy. The only thing I would consider missing is a pile of shit-character. I could use that one right now.
Yes, it's the next fad that just _everyone_ has to wear. this season. Within 5 years, it will be something else, and given the ability of major vendors like Microsoft to get Unicode _wrong_, it's not stable for mission critical applications. If you want your code to remain parseable and cross-platform compatible and stable in both large and small tools, write it in flat, 7-bit ASCII. You also get a significant performance benefit from avoiding the testing and decoding and localization and most especially the _testing_ costs for multiple regions.
Look up "microsoft unicode error" on Google for hundreds if not thousands of examples. ASCII for code is like flat text for email. It assures that you're not simply publishing coding spam, and actually wrote what you meant.
Everyone who tried to do something useful in APL, put up your hand.
...the character set isn't the problem.
And I say this as an old APL coder.
(There aren't many new APL coders.)
-=Maggie Leber=-
So, what are his ideas?
EBCDIC?
How silly of us to be compiling to binary all this time!
We've been relegating ourselves to only two different options for decades!
I reckon that a memory cell and single bit of a processor opcode should have --at least-- 7000 different possibilities. Think of everything a computer could accomplish *then*!
Seriously, someone tell this guy you're allowed to use more than one character to represent a concept or action, and that these groups of characters represent things rather well.
Let's take our precious time on this planet to fix what's broken, not break what has clearly worked.
but fuck no.
I eagerly await comments saying how anglo-centric, racist, bigoted, culturally-imperialist the insistence of using ASCII is.
The nuanced indignation is salve for my frantic masturbation.
(If my post is the only one that mentions this, all the better)
the chinese have problems to learn his own language, because have all that signs, it make it unncesary complex.
26 letter lets you write anything, you dont need more letters, really. ask any novelist.
also, programming languages are something international, and not all keyboards have all keys, even keys like { or } are not on all keyboards, so tryiing to use funny characters like ñ would make programming for some people really hard.
all in all, this is not a very smart idea , imho
-Woof woof woof!
Programming languages usually have too much syntax and too much expressiveness, not too little. We don't need them to be even more cryptic and even more laden with hidden pitfalls for someone who is new, or imperfectly vigilant, or just makes a mistake.
If anything, programming needs to be less specific. Tell the system what you're trying to do and let the tools write the code and optimize it for your architecture.
We don't need longer character sets. We don't need more programming languages or more language features. We need more productive tools, software that adapts to multithreaded operation and GPU-like processors, tools that prevent mistakes and security bugs, and ways to express software behavior that are straightforward enough to actually be self-documenting or easily explained fully with short comments.
Focusing on improving programming languages is rearranging the deck chairs.
Because I don't want to have to own a 2000 key keyboard, or alternatively learn a shitload of special key combos to produce all sorts of symbols. The usefulness of ASCII, and just of the English/Germanic/Latin character set and Arabic numerals in general is that it is fairly small. You don't need many individual glyphs to represent what you are talking about. A normal 101 key keyboard is enough to type it out and have enough extra keys for controls that we need.
To see the real absurdity of it, apply the same logic to the numerals of the character set. Let's stop using Arabic numerals, let's use something more. Let's have special symbols to denote commonly used values (like 20, 25, 100, 1000). Let's have different number sets for different bases so that a 3 can be told what base its in just by the way it looks! ...
Or maybe not. Maybe we should stick with the Arabic numerals. There's a reason they are so widely used: The Indians/Arabs got it right. It is simple, direct, and we can represent any number we need easily. Combining them with simple character indicators like H to indicate hex works just fine for base as well.
You might notice that even languages that don't use the English/ASCII character set tend to use keyboards that use it. Japanese and Chinese enter transliterated expressions that the computer then interprets as glyphs. Doesn't have to be that way, they could different keyboards, some of them rather large depending on the character set being used, but they don't. It is easy and convenient to just use the smaller, widely used, character set.
Now none of this means that you can't use Unicode in code, that strings can't be stored using it, that programs can't display it. Indeed most programs these days can handle it, just fine. However to start coding in it? To try and design languages to interpret it? To make things more complex for their own sake? Why?
I am just trying to figure out what he thinks would be gained here. Also remembering that the programming languages, the compilers, would need to be changed at the low level. Compilers do not take ambiguity, if a command is going to change from a string of ASCII characters to a single unicode one, that has to be changed in the compiler, made clear in the language specs and so on.
ASCII art is cool!
Sun's Fortress language allowed you to use real, LaTeX-formatted math as source code. They reasoned, correctly I think, that for the mathematically literate, this would make the programs far clearer. Google for Fortress Programming Language Tutorial.
Fortress allows you to code in UTF-8. However it has a multi-char ASCII equivalent for every Unicode mathematical symbol that you can use, so there is a bijective map between the Unicode and ASCII versions of the source, and you can view/edit in either. That is the only acceptable way to advocate using Unicode anywhere in programming source other than string constants. Programming languages that use ASCII have done well over those that don't, for the same reason that Unicode has done well over binary formats.
Bullshit from the article:
OmegaZero is at least something everybody will recognize. And why would you name a variable like that anyway? It's programming, not math, use descriptive names.
Because we're not using the same IDE?
... WHAT? If you don't express your intentions clearly in a program it won't work!
vim does Unicode just fine. And from the Wikipedia entry on the author (http://en.wikipedia.org/wiki/Poul-Henning_Kamp):
Irony? Why does this guy come off as an idiot who got annoyed by VB in this article when he clearly should know better?
Sure, but Perl is often derided as a "write only language", and Perl 6 is simply continuing the tradition.
From Idiocracy: Keyboard for hospital admissions
Wired: Wingdings of Disease
Unicode has the entire gamut of Greek letters, mathematical and technical symbols, brackets, brockets, sprockets, and weird and wonderful glyphs such as "Dentistry symbol light down and horizontal with wave" (0x23c7). Why do we still have to name variables OmegaZero when our computers now know how to render 0x03a9+0x2080 properly?
Well, let's think. Possibly because nobody knows what 0x03a9+0x2080 does without looking it up, and nobody seeing the character it produces would know how to type said character again without looking it up? I know consulting a wall-sized "how to type X" chart is the first thing I want to do every 3 lines of code.
While we are at it, have you noticed that screens are getting wider and wider these days, and that today's text processing programs have absolutely no problem with multiple columns, insert displays, and hanging enclosures being placed in that space? But programs are still decisively vertical, to the point of being horizontally challenged. Why can't we pull minor scopes and subroutines out in that right-hand space and thus make them supportive to the understanding of the main body of code?
If you actually look at word processing programs, the document is also highly vertical. The horizontal stuff is stuff like notes, comments, revisions, and so on. Putting source code comments on the side might be a useful idea, but putting the code over there won't be unless the goal is to make it harder to read. (That said, widescreen monitors suck for programming.)
And need I remind anybody that you cannot buy a monochrome screen anymore? Syntax-coloring editors are the default. Why not make color part of the syntax? Why not tell the compiler about protected code regions by putting them on a framed light gray background? Or provide hints about likely and unlikely code paths with a green or red background tint?
So anybody who has some color-blindness (which is not a small number) can't understand your program? Or maybe we should make a red + do something different then a blue +? That's great once you do it six times, then it's just a mess. (Now if you want to have the code editor put protected regions on a framed light gray background, sure. But there's nothing wrong with sticking "protected" in front of it to define what it is.) It seems like he's trying to solve a problem that doesn't really exist by doing something that's a whole lot worse.
-- "So they told me that using the download page to download something was not something they anticipated." - Bill Gates
From TFA apparently he wants to be able to use (Omega) to name a variable, and ÷ (Division Sign) as an operator. My interpretation of his opinion is that a descriptive name for a variable is inferior to using greek letters, and that using mathematical operators that take an extra five or so keystrokes are superior to the standard +-*/^ set that people have become accustomed to.
IMHO, if you use more than 26 single letter variables something is seriously wrong, and trying to make mathematical formulas pretty in code isn't practical without a whole lot of unneeded complexity. Sure, having an eight line formula with fractions within fractions and tiny exponent numbers might be (slightly) better than five layers of parenthesis, but you aren't going to get that with just unicode (AFAIK), and the pain of dealing with a slightly misplaced term confounding the unicode to math converter isn't one I'd like to experience. Unicode or even LaTeX code for comments might be useful though.
Because that's what you find in JIS X 0213:2000. Even if you simplify it to just what is needed for basic literacy, you are talking 2000 characters. If you have that many characters your choices are either a lot of keys, a lot of modifier keys, or some kind of transliteration which is what it done now. There is just no way around this. You cannot have a language that is composed of a ton of glyphs but yet also have some extremely simple, small, entry system.
You can have a simple system with few characters, like we do now, but you have to enter multiple ones to specify the glyph you want. You could have a direct entry system where one keypress is one glyph, but you'd need a massive amount of keys. You could have a system with a small number of keys and a ton of modifier keys, but then you have to remember what modifier, or modifier combination, gives what. There is no easy, small, direct system, there cannot be.
Also, is it any more tedious than any Latin/Germanic language that only uses a small character set? While you may enter more characters than final glyphs, do you enter more characters than you would to express the same idea in French or English?
the point has been entirely missed, and blame placed on ASCII [correlation is not causation]. when you look at the early languages - FORTH, LISP, APL, and later even Awk and Perl, you have to remember that these languages were living in an era of vastly less memory. FORTH interpreters fit into 1k with room to spare for goodness sake! these languages tried desperately to save as much space and resources as possible, at the expense of readability.
it's therefore easy to place blame onto ASCII itself.
then you have compiled languages like c, c++, and interpreted ones like Python. these languages happily support unicode - but you look at free software applications written in those languages and they're still by and large kept to under 80 chars in length per line - why is that? it's because the simplest tools are not those moronic IDEs; the simplest programming tools for editing are straightfoward ASCII text editors: vi and (god help us) emacs. so by declaring that "Thou Shalt Use A Unicode Editor For This Language" you've just shot the chances of success of any such language stone dead: no self-respecting systems programmer is going to touch it.
not only that, but you also have the issue of international communication and collaboration. if the editor allows Kanji, Cyrillic, Chinese and Greek, contributors are quite likely to type comments in Kanji, Cyrillic, Chinese and Greek. the end-result is that every single damn programmer who wants to contribute must not only install Kanji, Cyrillic, Chinese and Greek unicode fonts, but also they must be able to read and understand Kanji, Cyrillic, Chinese and Greek. again: you've just destroyed the possibility of collaboration by terminating communication and understanding.
then, also, you have the issue of revision control, diffs and patches. by moving to unicode, git svn bazaar mercury and cvs all have to be updated to understand how to treat unicode files - which they can't (they'll treat it as binary) - in order to identify lines that are added or removed, rather than store the entire file on each revision. bear in mind that you've just doubled (or quadrupled, for UCS-4) the amount of space required to store the revisions in the revision control systems' back-end database, and bear in mind that git repositories such as linux2.6 are 650mb if you're lucky (and webkit 1gb) you have enough of a problem with space for big repositories as it is!
but before that, you have to update the unix diff command and the unix patch command to do likewise. then, you also have to update git-format-patch and the git-am commands to be able to create and mail patches in unicode format (not straight SMTP ASCII). then you also have to stop using standard xterm and standard console for development, and move to a Unicode-capable terminal, but you also have to update the unix commands "more" and "less" to be able to display unicode diffs.
there are good reasons why ASCII - the lowest common denominator - is used in programming languages: the development tools revolve around ASCII, the editors revolve around ASCII, the internationally-recognised language of choice (english) fits into ASCII. and, as said right at the beginning, the only reason why stupid obtuse symbols instead of straightforward words were picked was to cram as much into as little memory as possible. well, to some extent, as you can see with the development tools nightmare described above, it's still necessary to save space, making UNICODE a pretty stupid choice.
lastly it's worth mentioning python's easy readability and its bang-per-buck ratio. by designing the language properly, you can still get vast amounts of work done in a very compact space. unlike, for example java, which doesn't even have multiple inheritance for god's sake, and the usual development paradigm is through an IDE not a text editor. more space is wasted through fundamental limitations in the language and the "de-facto" GUI development environment than through any "blame" attached to ASCII.
COBOL was originally designed so that managers and customers could read it. But in practice they rarely did because programming logic is typically too low-level and requires knowing the technical context to understand by a non-programmer and/or non-team member anyhow. Being "English-like" or grammatically proper didn't really help that goal in practice. This is why the idea was abandoned in later languages.
Perhaps it's comparable to legalese. Making it proper English doesn't necessarily improve readability by non-lawyers. It's still gibberish to most of us without a legal background.
It's not worth-while to slow down production programmers in a trade for the rare case where non-programmers will want to read code for an actual need (not just curiosity). Thus, it's an uneconomical requirement as long as there is such a trade-off.
Table-ized A.I.
Grep on ascii is more than 100x faster for complex string expressions. THere's a lot of good reasons not to use unicode.
Some drink at the fountain of knowledge. Others just gargle.
Unicode has the entire gamut of Greek letters, mathematical and technical symbols, brackets, brockets, sprockets, and weird and wonderful glyphs such as "Dentistry symbol light down and horizontal with wave" (0x23c7). Why do we still have to name variables OmegaZero when our computers now know how to render 0x03a9+0x2080 properly?
The go spec is defined in terms of unicode, and specifically gives non-ascii characters as example identifiers. Go source code is defined to be UTF-8.
I've read that story before, and it's very neat. It's just too bad there's so little truth to it. Here's an example where it really falls apart: "As the railroads were built they were built using the same standard width of all the wagons since the tools had been standardized to that width." Anybody with casual knowledge of railway history should remember the crazy profusion of different -- widely varying -- gauge standards in the early days.
And, yes, me too: I wrote this in vi(1), which is why the article does not have all the fancy Unicode glyphs in the first place.
Excuse me - vim can handle utf-8 just fine. utf-8 file names and utf-8 content. on a vanilla slackware 13.1.
http://www.cl.cam.ac.uk/~mgk25/unicode.html#apps [cam.ac.uk]
# Vim (the popular clone of the classic vi editor) supports UTF-8 with wide characters and up to two combining characters starting from version 6.0.
# Emacs has quite good basic UTF-8 support starting from version 21.3. Emacs 23 changed the internal encoding to UTF-8.
And svn can handle utf-8 as well - http://svnbook.red-bean.com/en/1.4/svn.advanced.l10n.html [red-bean.com].
The repository stores all paths, filenames, and log messages in Unicode, encoded as UTF-8.
All it requires is ... set your locale and lang. "export LANG=en_DK.utf8" in "/etc/profile.d/lang.sh" (Slackware 13.1) and add some better fonts maybe.
I apologize for repeating myself. I've written the same thing further down already in reply to another user's post. But I just read tfa and felt the need to reply to the author of tfa.
visual programming has stagnated because it produces crap. Exhibit A, Microsoft Windows. Exhibit B, all Microsoft Applications not acquired by Microsoft.
GUI code wizard 'tards, hated to have them on my coding teams....
I blame the cult of Unix/Linux to some degree. The whole OS and all its tools and standards are based on ASCII text
you ever heard of the nls_utf8 kernel module? ever seen the "LANG" environment variable? set it to "en_DK.utf8" for example and you're ready to go.
vim, svn, rm, mv, cp can handle utf8 just fine. this being on slackware 13.1.
I worked for a Canada-based company and one of the magazines in the break room was Forces Quebec. It was something about packaging technology and had the articles written in both English and French, as is standard in Canada.
The bilingual nature isn't what caught my eye, though. What caught my eye was the fact that the typeface for the French articles was just plain smaller in order to fit more text in a certain space. It looked to me like the same page real estate was dedicated to each language, but the typeface for the French text was set to a smaller point size with tight kerning and spacing.
No wonder French people talk so fast. They have to!
In fact, when I mentioned the same thing to one of my coworkers, a Mexico native, he wasn't surprised at all. He said the same is true for Spanish as well.
When he told me that, I remembered Cheech Marin's "Born in East L.A." where he sings about being deported to Mexico despite being a US citizen "Next thing I know I'm in a foreign land. People talkin so fast I could not understand."
In a world of the blind, the one-eyed man is king--and the two-eyed man is a heretic.
Visual programming isn't big for the same reason people talk and not use drawings to communicate in day to day life. A decent well explained and understood language is faster, universal and more convenient. Drawings are used in situations where you can't communicate true a spoken or written language. As a replacement tool. It's very basic since with a spoken or written language you can uniformly have so much more precise interpretation of your intentions. Same goes for visual programming at this moment in time. I won't say there isn't a future for it, but as a replacement tool for the tried and tested programming environments it has a long way to go. Come up with a visual programming system for writing actually sophisticated code and you might have yourself a winner. Only party that comes in mind is Labview from NI.
Using full Unicode for programming causes lots of problems; even string equality is a tricky proposition for Unicode, let alone precise parsing. Most people don't even know how to enter Unicode characters not found in their own language. And once you allow Unicode, people will do things like they did in APL.
The only place Unicode should be allowed--if at all--is in comments. Everything else should be in ASCII.
I wouldn't consider Mr. Pike an authority on programming language design. At Google, he's known for designing Sawzall (described here: http://static.googleusercontent.com/externIal_content/untrusted_dlcp/research.google.com/en/us/archive/sawzall-sciprog.pdf) - a language that's so feature poor, esoteric, and ass-backwards, that Google engineers curse at length every time they have to use it. And use it they have, since it's darn near impossible, for various reasons, to do certain things without it. Try as I may, I don't see anything in Go that would make it better than half a dozen existing alternatives. It's like reinventing the bicycle again, but this time with square wheels and without the saddle. Yes, you guessed it right, that's where that pipe goes on this particular bicycle.
... it should be good enough for anyone. Just sayin'...
garethw
Ok, so everyone agrees this is a stupid idea... but are there ANY pros? I just don't understand the premiss at all...
This has come up in the context of domain names, where a long, painful set of rules has been devised to try to prevent having two domain names which look similar but are different to DNS. If exact equality of text matters, it's helpful to have a limited character set for identifiers.
There's currently a debate underway on Wikipedia over whether user names with unusual characters should be allowed. This isn't a language question; the issue is willful obfuscation by users who choose names with hard-to-type characters.
As for having more operators, it's probably not worth it. It's been tried; both MIT and Stanford had, at one time, custom character sets, with most of the standard mathematical operators on the keys. This never caught on. In fact, operator overloading is usually a lose. Python ran into this. "+" was overloaded for concatenation. Then somebody decided that "*" should be overloaded, so that "a" + "a" was equivalent to 2*"a". The result is thus "aa". This leads to results like 2*"10" being "1010". The big mistake was defining a mixed-mode overload.
In C++, mixed-mode overloads are fully supported by the template system and a nightmare when reading code.
In Mathematica, the standard representation for math uses long names for functions, completely avoiding the macho terseness the math community has historically embraced.
I'm truly saddened to see so many people took this article summary so literally. If you read TFA, it's actually a very bright, intelligent, humorous example of programming insight. I found it a very delightful read and I wholeheartedly felt that the article presented its thoughts lightheartedly and without expectation of seriousness. To hear all the commenters here, it's as if the article ran puppies over with a steamroller.
Please guys - I'm all for silly commentary. But read the article if you're going to pretend to write something clever. It's thoroughly tongue-in-cheek.
One thing many people aren't aware of is that for several years now (since GCC3), GCC and G++ accept UTF-8 as their default input encoding, and internally store narrow and wide strings as UTF-8 and UTF-32, respectively. It's recoded to the output stream locale when you do any output. This means you can write your source code in Unicode (in strings and comments at least) and it all works perfectly. It has full support in the C and C++ standard libraries. I've been using it for years; it works perfectly. It would be nice to get support for UTF-8 symbols in the linker, so we can have UTF-8 variable names as well. The same applies to Perl, though perl6 even gives you the ability to have Unicode operators, and possibly variable names.
I do routinely use UTF-8 symbols in R (example: "deltaCt" can be replace with the actual Delta symbol [Slashdot ate the Unicode--seriously poor!]). It makes the code more readable, and entry isn't the massive issue people make it out to be. AltGr/compose keys handle the common symbols, and you can look up the few odd ones that aren't in the compose tables.
Having the ability to use Unicode does not in any way detract from the ability to use ASCII. Since ASCII is a strict Unicode subset, the ability to use Unicode imposes zero overhead on those who wish to stick with ASCII, so the extent of the hate seen for wanting a bit of progress is a bit shocking. People pointed out how unreadable code could be made, but the reality is that when used sensibly and judiciously, it can make code more concise and readable.
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522776 for information about some of the issues.
Having native Unicode support end-to-end by default is still a goal we want to achieve; the ASCII C locale is the last holdout. Getting a UTF-8 C locale is the last remaining step, though it'll take a few years to get there.
Regarding editing Unicode sources, both Emacs and vim have pretty decent Unicode support, and Linux distributions have had unicode support for a decade now, and really good support for at least six years. Broken tools are no longer an excuse for not using Unicode.
Regards,
Roger
One problem is that word-for-word translations don't work. Other languages have both cases and genders applied to words, and often a different sentence structure too. Should "LET A=10" become "A=10 LASSEN"? What about Russian where the gender is significant? Or Japanese, where the status between speaker and listener determines the word? And what about right-to-left languages? Or top-to-bottom ones? But the biggest problems are, of course, compatibility and maintainability. You can't hire consultants who don't speak the language. And what if you branch out from Iceland to Sweden? Will you hire Swedes who speak Icelandic, or port all your apps to Swedish and maintain two different versions and prohibit unported e-mail attachments? Ask yourself why Microsoft doesn't have localized Office Basic anymore.
I really, really don't think so. Different tools for different jobs - a language for writing reliable infrastructure should look very very different from a language for exploration of datasets, for example - the first one must place emphasis on reliability and performance, the second on flexibility. Eg adding members to data structures on the fly is a great idea in the second case, but not in the first.
Sure you can try to sweep that under 'different paradigms', and indeed you could mix two arbitrary languages in the same file using some delimited blocks for example, and call it 'one language with different paradigms', but why would you want to? The convoluted multi-paradigm monstrosity that is C++ is a terrible example to us all there, in my opinion.
I think instead the shape of the future will be more like all those different languages that compile on the JVM - jython, Scala, Lua, and whatnot. They compile into interoperable modules without extra hassle, so in each module you can use the right tool for the job at hand.