Will We Ever Get Rid Of ASCII?

← Back to Stories (view on slashdot.org)

Will We Ever Get Rid Of ASCII?

Posted by Cliff on Wednesday May 10, 2000 @06:42PM from the making-way-for-unicode dept.

GeZ asks: "When will Unicode finally replace ASCII? When will 7-bit-encoded text finally disappear? When will 'extented' chars (like 'é' or 'ß', etc) be recognized as 'alphanumerics', letting us use all characters we want for file names, functions names, and DNS names? Most top-level modern apps and standards use Unicode so it deserves to be integrated at the lowest level, now. I really think old ASCII is too limited and fragmented to be useful. Using metachars in an ASCII file (a la HTML entity) is a boring way to solve the problem. A perfect integration with OSes (and base libraries) will "magically" make nearly all apps Unicode compliant, no? Yes, text chars will be encoded on 16 bits intead of 7 or 8 and would double text file size, but is this really troublesome, given today's storage medias?" Do any of you think that Unicode will completely replace ASCII or are there reasons why it's still in use as the primary way to represent text characters?

12 of 38 comments (clear)

Min score:

Reason:

Sort:

Ideo scripts need characters. by yerricde · 2000-05-11 20:55 · Score: 2
Who really needs 65536 different characters?
- Chinese and Japanese scripts use Han brand characters (Japanese calls them KanJi). There are estimated to be about 50,000 characters in Chinese, but only a small fraction of those are commonly used, and even a smaller fraction (about 5,000 or so) in Japanese.
- Sinhala, Devanagari, Greek, Cyrillic, Tengwar, Arabic, Hebrew, and other scripts need character codes; all are present and accounted for (except for Tengwar) in Unicode 3.0. (Tengwar can be found in a defacto con-script standard somewhere on the Net that specifies characters in the Private Use areas of Unicode.)
again, a char is supposed to be the same size as a byte

Nowhere in the C standard does it say that bytes must be 8 bits. Some C/C++ compilers for DSP architectures set char = short = int = long = 32 bits and still comply with the standard. There's also the wchar_t data type.
--
Will I retire or break 10K?
Re:Worse by Alex+Belits · 2000-05-12 02:13 · Score: 2

Unicode is one of "semi-proprietary" standards -- documents aren't available for free (be it ones from ISO or Unicode), however there is no legal barriers for making an implementation -- just the size of the table makes a job of creating fonts unreasonably huge. OTOH, the tables necessary for determining, what the characters are, are available for free.

The problem however is different -- people already use their own charsets, and those charsets were designed to reflect the structure of language, or just to be most convenient for their language, sometimes made quite different from part of Unicode that is supposed to be used for the same language. If instead of trying to _convert_ everything to Unicode, people adopted a reasonable (iso 2022 isn't reasonable) way to label, which charset and which language are where in their strings, the implementation would be able to use all known charsets, and programs that aren't concerned with operation thats depend on them can just ignore the whole thing and treat text as a sequence of bytes until charset-specific procedures are called to process/display/compare/convert/input/... the text where "real" size and mapping of characters will emerge -- and those procedures can be language-dependent, replaceable and expandable if they will implement an easy mechanism of mapping charset/language names to sets of procedures. Unicode could be used as one of possible charsets, and UTF-8 could be used as one of possible encodings in such a system, however it won't be "the" thing, that everyone is supposed to support and be aware about. At most some programs would have to know how label delimiters look.

It can be a very easy solution for the real problem, however it requires an agreement on how charsets/languages should be labeled (their "real" names should be used to make the thing expandable, however how those things should be separated from "normal" text remains a question).

--
Contrary to the popular belief, there indeed is no God.
OK... by cr0sh · 2000-05-11 06:19 · Score: 2

Cool - now I am going to have to find my punch card and look at it (I have one I found in a IBM 740 (or 704?) training manual - I guess someone was using it as a bookmark). The image you gave, though, was clear enough to see what you meant by zone holes.

Your explanation helps a lot - not that I have any use for such info, but I was curious about it. Between your explanation and the byte conversion array the other guy gave, I should be able to figure it out further.

Now, what does this say about EBCDIC and ASCII, about which came first? It sounds like ASCII came first - but what is the real answer?

--
Reason is the Path to God - Anon
Thank you! by cr0sh · 2000-05-11 06:21 · Score: 2

I wish I could mod you back up - you were most certainly on topic (code is fine by me)...

--
Reason is the Path to God - Anon
Worse by Alex+Belits · 2000-05-11 06:58 · Score: 2

Well, mebbe not. I am waiting to hear more comments from non-English slashdotters on this subject, the comments so far reflect a definite world view -- the English world.

Non-English slashdotters that at the same time use iso8859-1, most likely see Unicode (or UTF-8) as a good thing because first 256 characters of Unicode are the same as iso8859-1, and they don't give a damn about everything else, while non-English slashdotters that use other local encodings/charsets (like me, whose native language is Russian, with koi8-r as the charset used in unixlike systems) see Unicode as a monstrosity, forced on them by a bunch of dumbasses at Unicode Consortium, ISO and software vendors thart benefit from every incompatibility that can force people to upgrade.

If charset/language labeling was standardized, everyone would be able to use their own charset, and all software that is not directly involved in text editing/displaying would be able to continue working as it was before, however by STUPID decision made by "standard bodies" the priority is given to sticking "should support Unicode/UTF-8" into every standard in the place of "should pass the data as a stream of bytes regardless of the actual size of character, encoding and their possible meaning, except special characters involved in protocol" that would actually accomplish something.

--
Contrary to the popular belief, there indeed is no God.
Before converting, we need an interface... by jezzball · 2000-05-12 13:36 · Score: 2

Sure, it's all fine and dandy. We're mostly programmers here.

But someone, please, tell me the easiest way to type ü (u-umulat) in Windows? One of the things that I do on my Mac that shocks people is just type with the flow foreign characters (opt-u, u is u-umulat, opt-u, e is e-umulat, etc). I think one of the reasons no one wants to move from ASCII to anything else is because it's rather hard to type in anything else.

Just my .02

ls: .sig: File not found.

--
ls: .sig: File not found.
(A)bort, (R)etry, (I)gnore?
Never kill ASCII, please! by Rob+Kaper · 2000-05-10 13:58 · Score: 2

While I agree that texts, filenames, etc should by default support 8-bit characters, there are advantages to keep ASCII around:
The limitations in ASCII makes searching texts and code a lot easier. I _like_ restrictions for function and variable names.
Of course something like is_ascii might just be enough for such a backwards compatibility hack.
I think unicode would be best, due to utf-8 by pimaniac · 2000-05-10 14:04 · Score: 3

Utf-8 is the name of the set of all characters formed by the lower 8 bits of unicode, which are all the ascii characters.
Since unicode is a variable length encoding, utf-8 can look exactly like ascii to an ascii machine.
The best part is that utf-8 requires no change. All ascii programs can read utf-8 and all utf-8 programs can read ascii. So therefore all unicode programs can read and write ascii. And all ascii programs can read and write a unicode subset.
To top it off, if a file does use the extended unicode stuff (>8 bits) then it will just look like line noise to an ascii machine, and a normal document in whatever language to a unicode machine.
The file size increase wont happen for ascii characters, but an additional 8 bits is needed for extended characters.
In conclusion, Unicode will completely replace ascii, and almost no one (in english speaking countries at least) will notice. :)

Example:
ascii A == 65. or 1000001
unicode/utf-8 A == 65, or 1000001.
There wont be any problems here. :)
1. Re:I think unicode would be best, due to utf-8 by randombit · 2000-05-10 20:50 · Score: 2
  
  Utf-8 is the name of the set of all characters formed by the lower 8 bits of unicode, which are all the ascii characters. Since unicode is a variable length encoding, utf-8 can look exactly like ascii to an ascii machine.
  
  Not quite right. Unicode is fixed-size (16 bits), UTF-8 is an variable length encoding of Unicode which, _if the text consists entirely of the 7-bit ASCII subset, will look exactly like ASCII. Other characters (in the larger range, around 0x6000 to 0xFFFF [I'm guessing]) will take up to 3 (maybe 4?) bytes to represent.
Magically compliant... by ghutchis · 2000-05-10 21:09 · Score: 2

A perfect integration with OSes (and base libraries) will "magically" make nearly all apps Unicode compliant, no?

No.

Remember, there's a large amount of plain ol' text lying around. Heck, all of the web (including Slashdot) is essentially just ASCII with SGML entities. Nobody will suggest converting all of this to straight Unicode.

This is why there's UTF-8, a variable-length version of Unicode that's essentially backwards-compatible.

But that's not the whole problem. You mention implementing Unicode/UTF-8 in libraries and OS'es to get "magical compliance." No such luck. If you take a lot of code out there (including some of my own), it makes assumptions that byte=char. So people use char * and perform pointer additions and so on to parse. This is fine when you have 8-bit text. But what happens when you go to 16-bit text or in the case of UTF-8 variable-length chars? Things break.

However, getting good solid implementations of UTF-8 in core libraries and OS'es will help a lot. Right now there really isn't one standard API for treating UTF-8 text. The new glibc has a good implementation, but if you want to write portable code, this is a problem--you don't have glibc on all systems (e.g. *BSD, Solaris...).

But the day will soon come when programs that are not Unicode/UTF-8 compliant are in the minority.

-Geoff
One thing I have wondered (slightly offtopic)... by cr0sh · 2000-05-11 00:30 · Score: 2

Was the resoning behind EBCDIC - from what I have seen, it is nearly totally different from ASCII (or maybe it is the other way around - which came first?). I have not ever been able to find an EBCDIC to ASCII conversion chart/table/code, nor have I ever seen more than a subset of an EBCDIC chart. On top of this, I have never been able to find a history or anything on how EBCDIC came about or why. Can anyone point me to a resource?

--
Reason is the Path to God - Anon
Re:One thing I have wondered (slightly offtopic).. by vrmlguy · 2000-05-11 05:00 · Score: 2

EBCDIC was a method of translating punched-cards into binary. Here is a picture of a punched card. (The image comes from here.) EBCDIC means "Extended BCD Interchange Code", and BCD means "Binary Coded Decimal". In BCD, the digit "0" is encoded with a low-order nybble of "0000" and "9" is "1001". On a punched card, 0-9 were encoded as single punches, and A-Z were encoded as 1-9 with additional "zone" punches. As a result, the EBCDIC encoding for the letters followed the encoding for digits, so when expressed in binary, there are gaps between "I" ("yyyy1001") and "J" ("xxxx0001"), and again between "R" ("xxxx0001") and "S" ("zzzz0010").
BTW, my pseudo-values for the high-order nybbles follows from the zone punches that were overpunched. The top row was the "Y" zone, then came the "X" zone, and then the "zero" zone.

--
Nothing for 6-digit uids?