spitzak · Slashdot Mirror

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-05 08:33 · Score: 1

Maybe I should clear this up a bit more.

If your editor inserted the UTF-8 encoding of two bytes (0xc2,0xa3 I think) the result should be those same two bytes. However I/O routines when told to print the string should then decode the UTF-8 and produce the pound sign. If the compiler is producing something other than UTF-8 (such as current Python does if you put a 'u' before the quote) then the compiler does the conversion, not the I/O routine. My main argument is that I think this is a job for I/O, not the compiler, and I don't like Python changing the default.

It is a requirement that if you actually put the 8 characters "\xc2\xa3" into the compiler input then you get the same two bytes. The primary reason for this is compatibility with existing compilers where this is the only reliable way to quote UTF-8. However this is also necessary so that you can make string constants with invalid UTF-8 encodings in them. I think doing u"\xc2\xa3" should produce a UTF-16 string with two characters in it as well.

Giving the compiler the text "\xa3" must result in that byte being in the string, despite the fact that it is not a valid UTF-8 encoding. u"\xa3" should result in a UTF-16 string with a single pound sign in it.

I think your question was if they inserted the pound sign as an actual 0xa3 byte in the source file. IMHO this should result in a single byte. Some people disagree, they say this should either produce an error (as it is not UTF-8) or it should be turned into the UTF-8 encoding of the pound or an error indicator. They may be right.

Re:raise UnicodeError on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-05 08:14 · Score: 1

Throwing exceptions on bad UTF-8 strings is great if they are strings you control. It is not useful for strings provided by the outside environment. I can assure you that users want that data copied even if it contains errors, and they only want to see an error message when the data is interpreted.

The best that could be done with exceptions is make some kind of union of the UTF-16 and the bytes (or perhaps convert the bytes by just padding each out to 16 bits), along with a flag indicating if the data converted right. Though it is possible you could save the overhead of repeatedly testing if the conversion works, I suspect most programs will have to leave the data as bytes.

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-05 06:42 · Score: 1

Why is it so important that "number of characters" (actually number of Unicode code points) is O(1), but "number of words", "number of sentences", "number of lines", "number of glyphs", and a zillion other possible questions are O(n)?

This is the basic question that everybody here refuses to answer. They just blindly state that "it is really important for it to be fast to figure out the 'number of characters'"

Please give an actual real example of source code where you *use* the "number of characters". You are either going to realize that you are not using the "number of characters" or you are going to make a fool of yourself, possibly by saying string[number_of_character(string)-1] or something. To avoid making a fool of yourself, please use at least two completly unrelated strings (where there is absolutly no relationship between the contents), where one of them is the one where you measured this "number of characters" and the other is the one you somehow apply this answer to, without in fact measuring a "number of characters" in this replacement string. Think very very very hard, to see if you can come up with an example where "number of bytes" (or "number of words" for UTF-16) would NOT work.

IMHO the problem is that programmers have for decades been using ASCII where "number of characters" is O(1) and thus they think it is "important". In fact what is important is "number of bytes", despite your glib comment right at the start that even you seem to think it is unimportant.

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-05 06:33 · Score: 1

If you actually have the byte 163 in the file, it almost certainly will be an invalid UTF-8 encoding (it would have to be directly proceeded with an accented letter in ISO-8859-1 for it to look like legal UTF-8).

One of the big reasons why I want the strings to remain bytes is because of exactly this. Yes the compiler can convert, but, believe it or not, we really do read text produced by other programs, often with incorrect UTF-8 encoding. Only by leaving it as bytes can we properly analyize this. It is relatively ok if when we draw your string we get an error box where your pound sign is. It is NOT ok if when we read your string it is *converted* to an error box and the fact that you attempted to put a pound sign in is irretrivably lost!

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-05 06:29 · Score: 1

Sorry but I was pissed at him for calling me stupid: "We're too stupid to fix the glaring encoding errors in our product..."

I should not be trading insults, you are right.

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-05 06:27 · Score: 1

If "characters" are important, then the combining characters and invisible formatting ones in Unicode mean that UTF-32 and every other way of encoding Unicode is useless as well, they are *all* variable length. It is in fact far preferrable to use UTF-8 as this forces programmers to understand variable length right away.

I would also like a really clear explanation as to why "characters" are important, but "words", "sentences", "paragraphs", "lines", and all kinds of other structures that most readers of text think is important are ok to be variable-sized. Maybe we should be making *all* of them fixed-size, since they are "imporant".

Currently use of UTF-16 is strongly biased against full Chinese and against any language that uses combining characters because it encourages a very Western interpretation of text as individual characters, despite the fact that a lot of the push for UTF-16 is due to a misguided attempt to be "fair" to foreign languages.

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-05 06:20 · Score: 1

Are you sure it is doing this?

In Python 2.5.2 this works:

>>> u"abc"=="abc"
True

So it would appear some kind of conversion is done automatically.

In my opinion this means programs will port easily, but it is going to open a whole lot of nasty holes as non-equal bytes strings can appear equal when converted to UTF-16.

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-03 15:09 · Score: 1

You are describing UTF-16. The characters outside the BMP take 2 words and thus len is 2.

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-03 15:07 · Score: 1

The fact that len returns 2 for a non-BMP character indicates that UTF-16 *is* being used. len is returning the number of words that the string occupies. This is a useful number (it indicates how much memory is needed to copy the string). The number of "characters" is completely useless, it causes crashes if you think it has something to do with memory usage, and it is useless for analyzing text unless you believe all the letters in Unicode are like fixed-pitch Latin letters.

x.len() when x is a UTF-8 string should return the number of bytes as well, and in fact this is how Python works.

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-03 15:00 · Score: 1

Well in a lot of ways that (not doing any automatic conversion) is the only correct solution if they really want plain quotes to be Unicode and not bytes/utf-8. It will be such a pain to fix existing code, though, that I would not have thought they would do that.

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-03 13:51 · Score: 3, Insightful

I think the lesson is that there is ONLY byte sequences.

The fact that some code can interpret that byte sequence and draw something on the screen that the user thinks of as "text" is completely irrelevant and should not be a fundemental datatype of a programming language. This should be part of the code that draws the text. Imagine if every other type of data, such as image pixels, or sound samples, had a different IO routine and you could never read a file with the wrong routine because the conversion was lossy.

The real problem is that everybody's mind has been polluted by decades of ASCII where there was no difference between characters and bytes. All I can suggest is to try to think of text as words or sentences. Nobody would suggest that it would be good to make all words use the same amount of storage, or that it is important that you be unable to split a string except at word boundaries. But there has been so much use of ASCII that people think this is important for "characters".

I also believe there is a serious politically-correctness problem. Otherwise logical programmers are consumed with guilt because Americans get the "better" short encodings, and therefore feel they have to punish themselves by making the conversion to i18n as painful as possible so that Americans have just as much trouble as anybody else. The fact that they have actually made I18N far harder for everybody and thus actually discouraged it is the ironic result of this guilt.

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-03 13:42 · Score: 2, Interesting

Spoken like somebody that's never had to deal with encoding issues. Using UTF-8 internally is fine, but exposing it to the programmer is insane and error-prone. And if the programmer then proceeds to manipulate that raw byte buffer as a string, he's an idiot.

The compiler will turn "unicode" into the utf-8 encoding. The programmer does not see \xnn sequences of the utf-8 bytes. Try some modern compilers with utf-8 support some day before you say anything stupid again.

Any programmer that modifies UTF-16 as a raw array of words is an idiot. Besides surrogate pairs, there are combining characters and bidirectional indicators and lots of other trouble. In fact I prefer UTF-8 exactly because it discourages such misuse of strings, which are really made of words, sentences, etc.

If you try to convert bytes that aren't in UTF-8 using a UTF-8 codec, an error will be raised. This behavior is proper -- if you don't know what format your input is in, there's no way to perform text-based operations on it.

You have just introduced a massive DOS hole into your programs. Or do you really think you should run a "is this correct UTF-8" call before any attempt to convert? Sorry, it is not going to raise an error, it will instead convert to error UTF-16 characters.

Every developer I know uses Unicode strings already. The new behavior is just one less character to type in front of literals.

You know that Python will convert your bytes from UTF-8 to "Unicode" automatically when needed? No you didn't? Might want to study up on that...

Otherwise said as: "We're too stupid to fix the glaring encoding errors in our product

The encoding errors are not in our product. They are in the files we are attempting to read (metadata attached to images, mostly). Dumbass

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-03 13:33 · Score: 0, Troll

No, Python is using UTF-16 nowadays. At least be somewhat informed before trying to argue with me about this.

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-03 13:32 · Score: 2, Insightful

People expect a string to be a sequence of characters. Please notice the first word in that sentence.

"People" are not computers. "people" LOOK at the display. People are not trying to copy the data literally from one place to another or do comparisons of strings or read files that might (horrors) not contain correct UTF-8 data. There is no reason to mangle the data until the very last moment before it is put on the display.

I can quite confirm that if you have more than one way to represent the same sequence (such as different ways of producing the same UTF-8 error) you will produce a MAJOR screw up, quite likely an exploitable security hole. It also is not nice if "copy" mangles data just because it had a sequence that could not be coinverted correctly to glyphs.

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-03 13:28 · Score: 2, Interesting

You might not be aware of this, but computers are used for more than just transmitting text. I don't want my binary streams being rewritten to gibberish because some I/O routine was written to be too clever

Thank you for explaining exactly why I want UTF-8 to be used, while thinking you were arguing against it.

Data is NOT just text. Therefore we should not be mangling it because we think it is text. We have enough trouble with MSDOS inserting \r characters. This crap is a million times worse.

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-03 13:18 · Score: 1

Interesting. I was afraid they were making all these functions return strings. If they are returning bytes as well it would certainly make things a lot better. However I would expect them to have the same trouble I am having.

Let's assume read returns a string of bytes. What I am worried is that the following example text will not work as expected:

if file.read()=="utf8 string" ...

I expect this will automatically convert the result of file.read() to UTF-16 and then do the comparison. This will not produce the correct test if in fact the UTF-8 is an invalid encoding. Even if it turns the result into a string with error characters, it will still match the other string if it had error characters in the same place resulting from a different wrong utf-8 string.

From your description it sounds like the following will do the correct thing, which is better than I thought from my reading:

if file.read()==b"utf8 string" ...

So at least this can be achieved. However I am worried that users will be tempted to type the incorrect code because it is easier.

It's possible that the == test will not work unless the compared string is a bytes string, but I would think that would break far too many Python programs. The other possibility is that failures to convert the utf8 to Unicode will throw an error, but then you have just introduced a million DOS flaws into everybody's programs.

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-03 13:11 · Score: 2, Interesting

No, think a little harder.

Imagine a file system that names the files with strings of bytes.

It is absolutely vital that if I ask for a list of files and then try to open them, that this all work, no matter what byte sequence has managed to get in there as a filename.

It is also *nice* but nowhere near as vital that I be able to show these names to users and they read them as Unicode strings.

String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-03 11:41 · Score: 3, Interesting

Reading the release, they have decided to really push 16-bit strings (they call this "Unicode" but it really is what is called UTF-16). I think this is a serious mistake.

The proper solution is to use 8-bit strings, but any functions that care (such as I/O) should treat them as being UTF-8. Most functions do not care and thus the treatment of "Unicode" and "bytes" are the same.

The problem with UTF-16 is you cannot losslessly convert a string that *might* be UTF-8 to UTF-16 and then back again. This is because any illegal UTF-8 byte sequences will be lost or altered. This is a MAJOR problem for code that wants to process data that is likely to be text but must not be altered under any circumstances, in effect such programs are forced to be ASCII-only, even though UTF-8 is purposly designed so that such programs could display all the Unicode characters. Note that bad UTF-16 (ie with mismatched surrogate pairs) can be losslessly converted to UTF-8 and back.

This has been a real pain so far in our use of Python, and I am quite alarmed to see that they are changing the meaning of plain quotes in 3.0 to "Unicode". This is really a serious step backwards, as we will be forced to tell anybody using our system to put 'b' before all their string constants and I suspect there will be a lot less automatic conversion of these strings to unicode when we want to display them. Note that Qt is also causing a lot of trouble here too.

Re:Should lead to possibly great advertisements on How Kernel Hackers Boosted the Speed of Desktop Linux · 2008-10-03 06:41 · Score: 1

I think 3.2G is the actual location that the video memory starts at. This is a limit of physical memory for a 32-bit address bus, as the video memory and other hardware reserves all the space from there up to 4G. As others have stated, modern processors avoid this for physical memory by having more than 32 address bits.

The virtual memory is limited to 4G by the chip, as the actual instructions can only produce 32-bit addresses. However the virtual memory hardware can map these addresses to larger ones and thus more than 4G can be used if mulitple processes are running.

There is a further limit of virtual memory because the operating system reserves addresses so you can jump into the kernel and it can still see your data. On Windows through XP and on the first versions of Linux this was the top 2 Gigabytes. Modern Linux and I believe Vista have reduced this to only 1 gigabyte. There were also Linux versions that cut this to only a few K, but the speed hit was unacceptable and since the advent of 64 bit processors interest in making this any better has pretty much disappeared.

Re:stop and go on Plug-in Hybrids May Not Go Mainstream, Toyota Says · 2008-10-03 06:17 · Score: 1

Did you know how gas stations get the gas out of the underground tanks?

Hint: it involves electricity.

You sir are one of the biggest idiots I have ever seen posting here. This is by far the STUPIDEST argument possible.

Re:Electric Gas Cans? on Plug-in Hybrids May Not Go Mainstream, Toyota Says · 2008-10-03 06:02 · Score: 1

Very funny, but in fact there is a seperate normal 12v (or maybe 6) battery that runs the starter and lights and radio when the car is turned off. If that is dead you jump-start it like a normal car.

Re:This should be interesting... on How the LSB Keeps Linux One Big Happy Family · 2008-09-22 10:25 · Score: 1

WHOOSH!

Re:About the Candidates on Obama Significantly Revises Technology Positions · 2008-09-22 06:53 · Score: 1

Hey I'm no fan of McCain (at least not after he showed such idiocy in choosing his running mate) but I don't think it is very fair to say he "collaborated with the enemy". He signed a meaningless "confession" in order to avoid even more torture. Apparently the Vietnamese didn't even bother trying to release it as propaganda (at least I have never heard of them publishing his "confession", while they did do others) probably because it was so obvious that it was signed to avoid torture.

He also turned down an offer of early release because it was obvious the Vietnamese would use it for propaganda purposes. This is despite the fact that he knew quite well that turning it down would lead to worse conditions for him and was not likely to lead to somebody else being released in his place. I would say that is a very brave and active act of *not* collaborating with the enemy.

Re:I call bullshit on Obama Significantly Revises Technology Positions · 2008-09-22 06:33 · Score: 1

This story is STILL bullshit, despite your desperate attempt to remedy it. The old text says "priciple" as well.

Re:No I didn't Read TFA on Japanese Begin Working On Space Elevator · 2008-09-22 04:41 · Score: 1

The plane of the equator does not vary during the year (ignoring slow stuff like precession and wobble).

However it is at an angle to the ecliptic, so there is quite a range of angles to the ecliptic you can throw something off the end of the tether at, and it varies through the whole range every 24 hours.

Slashdot Mirror

User: spitzak

Comments · 5,741