I would say it is rather shocking that this Peter Hutterer actually did about 90% of the work, then posted something that is not a clue as to how to see the answer.
And that the original poster (who I assume made this Slashdot story) did not post any followup for 3 months, probably leading Peter to forget all about fixing this.
In your scenario, originally you had $100, and the other person had a stock share that he could trade for $100. Therefore there was $200 in total value. After the stock dropped to $10 value, one person has $100 and the other has a stock worth $10, therefore the total is $110. $90 of value was lost! Inflation/deflation of dollars does not matter, the result is that you now have $90 less of value, whatever $90 is now worth.
Also your claim that "the balance of transactions is $110" is pretty bogus. If the two decided to trade the stock back and forth 5,000 for $100 then by your calculation "the balance of transactions is $500,000". That number is obviously meaningless. The actual amount of money moved around is $100.
Oh no! What if your split() worked in Unicode code points, and split a combining pair? What would you do, surely your computer will instantly self-destruct in a devastating explosion! What if your split() split an english word in two? What if your split() cut a UTF-16 surrogate pair in half (which EVERY single alternative to UTF-8 does!!!!!!) Yike! Disaster! Um, well, maybe not...
Stop making up non-existent problems.
1. Splitting is done after pattern searching. It is TRIVIAL to make your pattern search (which is likely doing something like "find the next space") only find full UTF-8 code units. In fact it will help get you to write stuff that matches more complex structures such as combining pairs.
2. If you are splitting at totally arbitrary points, it is because you are copying the data to a fixed-sized buffer. Virtually every use of this later pastes the contents of the buffers together (think of buffered file I/O) and thus it is harmless.
3. This splitting is 100% detectable because *both* ends will be invalid UTF-8.
4. For some reason nobody seems to worry about this for UTF-16. Hmmmm, I wonder why?
Maybe you should design your own platform where strings will be represented internally as UTF-8. It would be an interesting exercise.
FLTK and Nuke, and the project I am doing at R&H all use UTF-8 with tolerance for encoding errors for all internal storage. It is really easy, far easier than dealing with two types of text.
About 90% of the work is to get around default converters in Python and Qt that screw up the UTF-8.
Stupid software that thinks it has to convert to UTF-16 is about 95% of the problem.
UTF-16 cannot losslessly store invalid UTF-8. It also cannot losslessly store an odd subset of arrangements of Unicode code points (it can't store a low surrogate followed by a high surrogate, because this pattern is reserved to mean a non-BMP code point). It also forces a weird cutoff at 0x10FFFF which a lot of programmers get wrong (either using 0x1FFFF or 0x1FFFFF). UTF-16 is also variable sized and has invalid sequences, thus it has NO advantages over UTF-8, so the entire scheme is a waste of time.
Unfortunately a bunch of people are so enamored with all the work they did to convert everything to 16-bit that they are refusing to admit they made a mistake. One way is to declare invalid UTF-8 as throwing errors and thus make it virtually impossible to manipulate text in UTF-8 form. Note that they don't throw exceptions on invalid UTF-16, care to explain that??? HMM????
UTF-8 can store all possible UTF-16 strings losslessly (including lone surrogates which are considered "invalid" in UTF-16), as well as storing invalid UTF-8. It can encode a continuous range of code points from 0-0x10FFFF, or 0x1FFFFF with a trivial change (it can do up to 0x7FFFFFFF if you use the original UTF-8 design).
PEP 393 does NOT solve the problem. The "ascii" is limited to only 7-bit characters and thus cannot store UTF-8 (valid or not).
There is a "utf-8" entry in the PEP 393 strings but it appears current design requires it to be translated to UTF-16 and back to UTF-8 to store there, thus disallowing invalid strings. My proposal is that converting bytes to a string copies the data unchanged to this UTF-8 storage, and checking for encoding errors be deferred until there actually is a reason to look at Unicode code points, which is VERY VERY RARE, despite the impression of amateur programmers. I also propose some small changes to how the parser interprets "\xNN" and "\uNNNN" in string constants so that it is possible to swap between bytes and "unicode" strings without having to change the contents of the constant.
No, substr() does not require decoding, because offsets can be in code units.
No, replace() does not require decoding, because pattern matching does not require decoding, since UTF-8 is self-synchronizing.
No split() does not require decoding because offsets can be in code units
No, join() does not require decoding (and in fact I cannot think of any reason you would think it does, at least the above have beginning-programmer mistakes/assumptions).
Well the first thing you need to do to clean up the invalid UTF-8, for instance in filenames, is to detect it.
If reading the filename causes it to immediatly throw an exception and dispose of the filename, I think we have a problem. Right now you cannot do this in Python unless you declare it "bytes" and give up on actually looking at the Unicode in the vast majority of filenames that *are* correct.
It is also necessary to pass the incorrect filename to the rename() function, along with the correction. That is impossible with Python 3.0's library, and is probably the more serious problem.
Both of these problems are trivial to fix if it would just consider arbitrary byte sequences valid values for strings, and defer complaining about incorrect encoding until the string actually needs to be *decoded*, which actually is only really needed to display it, and sometimes for parsing in the rare cases that non-ASCII has syntactic value and is not just treated as letters.
I really doubt a majority of people think affirmative action helps Asians. It helps underrepresented minorities, and in most jobs and schools Asians are not underrepresented. It seems incredibly unlikely that 95% of people (whether they approve or disapprove of affirmative action) think it helps Asians.
I suspect you actually made a typo of some sort but am curious what exactly you were trying to say there.
I am trying to PREVENT denial of service bugs. If a program throws an unexpected exception on a byte sequence that it is doing nothing with except reading into a buffer, then it is a denial of service. If you really thing that invalid UTF-8 can lead to an exploit you seem to completely misunderstand how things work. All decoders throw errors when they decode UTF-8, including for overlong sequences and ll other such bugs. So any code looking at the unicode code points will still get errors. And if you think there is some exploit that relies on the byte pattern that somehow only works for invalid UTF-8 then you have quite a fantastic imagination but no knowledge of reality.
The program should produce an error AT THE MOMENT IT TRIES TO EXTRACT A Unicode CODE POINT. Not before, and not after.
If the program reds the invalid string from one file and does not check it and writes it to another file, I expect, and REQUIRE, that the invalid byte sequence be written to the new file. It should not be considered any more of a problem than the fact that programs don't fix spelling mistakes when copying strings from one place to another.
What I want to do is gracefully handle tiny mistakes in the UTF-8 without having to rewrite every function and every library function it calls to take a "bytes" instead of a "string", and thus completely abandon useful Unicode handling!
Come on, it is blindingly obvious why this is needed, and I cannot figure out why people like you seem to think that physically possible arrangements of bytes will not appear in files. The fact that all serious software cannot use Unicode and has to resort to byte twiddling should be a clue, you know.
No, all that means is that EVERYTHING has to be changed to use the bytes type.
I mean every single library function that takes a unicode string, every use of ParseTuple that translates to a string, etc. Pretty much the entire Python library must be rewritten, or a wrapper added around every function that takes a string argument.
Everybody saying that "it's good to catch the error earlier" obviously has ZERO experience programming. Let's see, would it be a good idea if attempting to read a text file failed if there was a spelling error? Or perhaps it might be a good idea to defer this problem until it actually makes a difference?
This crazy belief that somehow some physically possible patterns of bytes will just magically not happen because you said they are "invalid" is inexplictable. No other system than UTF-8 seems to cause this weird brain damage, no other system is so totally unprepared for invalid storage and pretends that all storage will be valid. I cannot explain it except that it seems like exposure to ASCII where all bytes sequences are always valid has rotted people's minds so that they dismiss the problem.
If your UTF-8 string is not completely valid, Python 3 barfs in useless and unpredictable ways. This is not a problem with Python 2.x.
Until they fix the string so that an arbitrary sequence of bytes can be put into it and pulled out *UNCHANGED* without it throwing an exception then it cannot be used for any serious work. Bonus points if this is actually efficient (ie it is done by storing the bytes with a block copy).
Furthermore it would help if "\xNN" produced raw byte values rather than the UTF-8 encoding of "\u00NN" which I can get by typing (gasp!) "\u00NN".
Besides China, I think also if Russia capable of reducing the US to Radioactive Ash, then the United States is capable of doing it to itself, if for some reason it decided to do so. So he is obviously wrong with his claim that only Russia can do it.
No, it is obvious if you follow development mailing lists that the announcement of Mir was a big kick in the pants for the Wayland developers and they started actually working on the real thing. So I think Mir did a good thing.
Yea I would agree that it seems more fair if the company instead made a 50/50 split, so the employee is now paying $100 and the company another $100. The main reason this seems fair is that I'll be that if the cost went *up* they would not eat all the extra but would have split the higher cost so the employee paid more.
Real answer: I have had or experienced medical care in England, Spain, and the US. Despite horror stories I saw no difference and the English medical care at an Emergency room was far faster and got directly to the solution rather than using referrals. They tried to get me to stay overnight and I kind of got out of that but I now feel (having later had to spend a significant stay in a very new American hospital and realizing the English one was just as clean and new-looking) that perhaps I had been scared by propaganda. Spain was completely free clinic even though the patient (not me) was a visiting tourist and was also really fast and friendly. But that was not a major medical emergency.
In England there certainly are complaints about the Dental system. The NHS is not paying enough and dentists can get out of serving NHS patients so there is either huge lines or you pay a lot. I did not experience it so I can't say first-hand, but this is the one area where I believe the US system is superior. There was some other posts here pointing out that how Dental works here with users actually able to and having a motive to do price comparisons may be an explanation. I also know first-hand (being across from the USC Dental School) that poor are served by these for free, though I am unsure if this is enough to make up for the lack of an NHS-style government program to serve them.
I am unsure how that could be applied to major medical however: if your deduction is $3000 then you don't care if the hospital is going to charge $10000 or $50000, that's a good deal different from comparing a $50 or $100 cleaning. Maybe it could apply to doctor's visits but then people just don't go at all if it is not free, while they will get their teeth cleaned because it is an obvious service, not just somebody looking at you.
By far the worst place I ever saw was when I was a kid and went with my father to an emergency room in Vegas. We went to the public hospital and it was a kafka scene, pretty horrible. After hours we finally saw somebody, who realized my father had insurance and said we were at the wrong hospital, and sent us to the really nice and clean and completely empty private one where he was treated within 30 minutes of arrival (it was a fractured ankle). This is before Reagan signed the law that said all emergency rooms must treat all incoming patients. I think it is interesting that this has not turned all emergency rooms into this scene, instead the ones I have been to since seem to be as nice as that empty private one was.
He just said the employees were paying half of $550 (ie $275), and now they are paying $130-$250. So yes, the employees are saving money (assuming he is telling the truth).
I'll bet this is going to be patched in the git repositor within a half hour.
But I'm not sure if posting Slashdot stories is the best way to get a bug fixed. But if it is the only one that works, might as well do it.
I still feel the original poster should have put *something* on that bug report in all the time since January 16th!
Goddamn that was painful, but I found the actual patch:
http://cgit.freedesktop.org/xo...
I would say it is rather shocking that this Peter Hutterer actually did about 90% of the work, then posted something that is not a clue as to how to see the answer.
And that the original poster (who I assume made this Slashdot story) did not post any followup for 3 months, probably leading Peter to forget all about fixing this.
Somebody has already narrowed the problem down to specific patch:
Comment 7 Peter Hutterer 2014-01-16 05:43:43 UTC
bisected to this commit:
commit 11319a922575f1da1d3c5774728c0dee12bab069
Author: Peter Hutterer
Date: Thu Oct 11 16:03:33 2012 +1000
xkb: ProcesssPointerEvent must work on the VCP if it gets the VCP
It would help if that number was a link to the git log.
What?
In your scenario, originally you had $100, and the other person had a stock share that he could trade for $100. Therefore there was $200 in total value. After the stock dropped to $10 value, one person has $100 and the other has a stock worth $10, therefore the total is $110. $90 of value was lost! Inflation/deflation of dollars does not matter, the result is that you now have $90 less of value, whatever $90 is now worth.
Also your claim that "the balance of transactions is $110" is pretty bogus. If the two decided to trade the stock back and forth 5,000 for $100 then by your calculation "the balance of transactions is $500,000". That number is obviously meaningless. The actual amount of money moved around is $100.
Oh no! What if your split() worked in Unicode code points, and split a combining pair? What would you do, surely your computer will instantly self-destruct in a devastating explosion! What if your split() split an english word in two? What if your split() cut a UTF-16 surrogate pair in half (which EVERY single alternative to UTF-8 does!!!!!!) Yike! Disaster! Um, well, maybe not...
Stop making up non-existent problems.
1. Splitting is done after pattern searching. It is TRIVIAL to make your pattern search (which is likely doing something like "find the next space") only find full UTF-8 code units. In fact it will help get you to write stuff that matches more complex structures such as combining pairs.
2. If you are splitting at totally arbitrary points, it is because you are copying the data to a fixed-sized buffer. Virtually every use of this later pastes the contents of the buffers together (think of buffered file I/O) and thus it is harmless.
3. This splitting is 100% detectable because *both* ends will be invalid UTF-8.
4. For some reason nobody seems to worry about this for UTF-16. Hmmmm, I wonder why?
Maybe you should design your own platform where strings will be represented internally as UTF-8. It would be an interesting exercise.
FLTK and Nuke, and the project I am doing at R&H all use UTF-8 with tolerance for encoding errors for all internal storage. It is really easy, far easier than dealing with two types of text.
About 90% of the work is to get around default converters in Python and Qt that screw up the UTF-8.
The real bummer would have been if the plane had hit your base's self-destruct mechanism.
Stupid software that thinks it has to convert to UTF-16 is about 95% of the problem.
UTF-16 cannot losslessly store invalid UTF-8. It also cannot losslessly store an odd subset of arrangements of Unicode code points (it can't store a low surrogate followed by a high surrogate, because this pattern is reserved to mean a non-BMP code point). It also forces a weird cutoff at 0x10FFFF which a lot of programmers get wrong (either using 0x1FFFF or 0x1FFFFF). UTF-16 is also variable sized and has invalid sequences, thus it has NO advantages over UTF-8, so the entire scheme is a waste of time.
Unfortunately a bunch of people are so enamored with all the work they did to convert everything to 16-bit that they are refusing to admit they made a mistake. One way is to declare invalid UTF-8 as throwing errors and thus make it virtually impossible to manipulate text in UTF-8 form. Note that they don't throw exceptions on invalid UTF-16, care to explain that??? HMM????
UTF-8 can store all possible UTF-16 strings losslessly (including lone surrogates which are considered "invalid" in UTF-16), as well as storing invalid UTF-8. It can encode a continuous range of code points from 0-0x10FFFF, or 0x1FFFFF with a trivial change (it can do up to 0x7FFFFFFF if you use the original UTF-8 design).
PEP 393 does NOT solve the problem. The "ascii" is limited to only 7-bit characters and thus cannot store UTF-8 (valid or not).
There is a "utf-8" entry in the PEP 393 strings but it appears current design requires it to be translated to UTF-16 and back to UTF-8 to store there, thus disallowing invalid strings. My proposal is that converting bytes to a string copies the data unchanged to this UTF-8 storage, and checking for encoding errors be deferred until there actually is a reason to look at Unicode code points, which is VERY VERY RARE, despite the impression of amateur programmers. I also propose some small changes to how the parser interprets "\xNN" and "\uNNNN" in string constants so that it is possible to swap between bytes and "unicode" strings without having to change the contents of the constant.
Aha! Somebody who really does not have a clue.
No, substr() does not require decoding, because offsets can be in code units.
No, replace() does not require decoding, because pattern matching does not require decoding, since UTF-8 is self-synchronizing.
No split() does not require decoding because offsets can be in code units
No, join() does not require decoding (and in fact I cannot think of any reason you would think it does, at least the above have beginning-programmer mistakes/assumptions).
Well the first thing you need to do to clean up the invalid UTF-8, for instance in filenames, is to detect it.
If reading the filename causes it to immediatly throw an exception and dispose of the filename, I think we have a problem. Right now you cannot do this in Python unless you declare it "bytes" and give up on actually looking at the Unicode in the vast majority of filenames that *are* correct.
It is also necessary to pass the incorrect filename to the rename() function, along with the correction. That is impossible with Python 3.0's library, and is probably the more serious problem.
Both of these problems are trivial to fix if it would just consider arbitrary byte sequences valid values for strings, and defer complaining about incorrect encoding until the string actually needs to be *decoded*, which actually is only really needed to display it, and sometimes for parsing in the rare cases that non-ASCII has syntactic value and is not just treated as letters.
It sounds like he is complaining that affirmative action is *not* applying to Asians.
So I still stand by my statement that the claim that "95% of people think affirmative action is to help Asians" has got to be incorrect.
I really doubt a majority of people think affirmative action helps Asians. It helps underrepresented minorities, and in most jobs and schools Asians are not underrepresented. It seems incredibly unlikely that 95% of people (whether they approve or disapprove of affirmative action) think it helps Asians.
I suspect you actually made a typo of some sort but am curious what exactly you were trying to say there.
I'm arguing against a design that is the equivalent of saying "you can't run cp on this file because it contains invalid XML".
There is nothing wrong with the xml interpreter throwing an error AT THE MOMENT YOU TRY TO READ DATA FROM THE STRING.
There is a serious problem that just saying "this buffer is XML" causes an immediate crash if you put non-xml into it.
God damn you people are stupid.
I am trying to PREVENT denial of service bugs. If a program throws an unexpected exception on a byte sequence that it is doing nothing with except reading into a buffer, then it is a denial of service. If you really thing that invalid UTF-8 can lead to an exploit you seem to completely misunderstand how things work. All decoders throw errors when they decode UTF-8, including for overlong sequences and ll other such bugs. So any code looking at the unicode code points will still get errors. And if you think there is some exploit that relies on the byte pattern that somehow only works for invalid UTF-8 then you have quite a fantastic imagination but no knowledge of reality.
The program should produce an error AT THE MOMENT IT TRIES TO EXTRACT A Unicode CODE POINT. Not before, and not after.
If the program reds the invalid string from one file and does not check it and writes it to another file, I expect, and REQUIRE, that the invalid byte sequence be written to the new file. It should not be considered any more of a problem than the fact that programs don't fix spelling mistakes when copying strings from one place to another.
The text is 99.9999999% UTF-8.
What I want to do is gracefully handle tiny mistakes in the UTF-8 without having to rewrite every function and every library function it calls to take a "bytes" instead of a "string", and thus completely abandon useful Unicode handling!
Come on, it is blindingly obvious why this is needed, and I cannot figure out why people like you seem to think that physically possible arrangements of bytes will not appear in files. The fact that all serious software cannot use Unicode and has to resort to byte twiddling should be a clue, you know.
No, all that means is that EVERYTHING has to be changed to use the bytes type.
I mean every single library function that takes a unicode string, every use of ParseTuple that translates to a string, etc. Pretty much the entire Python library must be rewritten, or a wrapper added around every function that takes a string argument.
Everybody saying that "it's good to catch the error earlier" obviously has ZERO experience programming. Let's see, would it be a good idea if attempting to read a text file failed if there was a spelling error? Or perhaps it might be a good idea to defer this problem until it actually makes a difference?
This crazy belief that somehow some physically possible patterns of bytes will just magically not happen because you said they are "invalid" is inexplictable. No other system than UTF-8 seems to cause this weird brain damage, no other system is so totally unprepared for invalid storage and pretends that all storage will be valid. I cannot explain it except that it seems like exposure to ASCII where all bytes sequences are always valid has rotted people's minds so that they dismiss the problem.
This exactly.
If your UTF-8 string is not completely valid, Python 3 barfs in useless and unpredictable ways. This is not a problem with Python 2.x.
Until they fix the string so that an arbitrary sequence of bytes can be put into it and pulled out *UNCHANGED* without it throwing an exception then it cannot be used for any serious work. Bonus points if this is actually efficient (ie it is done by storing the bytes with a block copy).
Furthermore it would help if "\xNN" produced raw byte values rather than the UTF-8 encoding of "\u00NN" which I can get by typing (gasp!) "\u00NN".
Besides China, I think also if Russia capable of reducing the US to Radioactive Ash, then the United States is capable of doing it to itself, if for some reason it decided to do so. So he is obviously wrong with his claim that only Russia can do it.
No, it is obvious if you follow development mailing lists that the announcement of Mir was a big kick in the pants for the Wayland developers and they started actually working on the real thing. So I think Mir did a good thing.
Yea I would agree that it seems more fair if the company instead made a 50/50 split, so the employee is now paying $100 and the company another $100. The main reason this seems fair is that I'll be that if the cost went *up* they would not eat all the extra but would have split the higher cost so the employee paid more.
Real answer: I have had or experienced medical care in England, Spain, and the US. Despite horror stories I saw no difference and the English medical care at an Emergency room was far faster and got directly to the solution rather than using referrals. They tried to get me to stay overnight and I kind of got out of that but I now feel (having later had to spend a significant stay in a very new American hospital and realizing the English one was just as clean and new-looking) that perhaps I had been scared by propaganda. Spain was completely free clinic even though the patient (not me) was a visiting tourist and was also really fast and friendly. But that was not a major medical emergency.
In England there certainly are complaints about the Dental system. The NHS is not paying enough and dentists can get out of serving NHS patients so there is either huge lines or you pay a lot. I did not experience it so I can't say first-hand, but this is the one area where I believe the US system is superior. There was some other posts here pointing out that how Dental works here with users actually able to and having a motive to do price comparisons may be an explanation. I also know first-hand (being across from the USC Dental School) that poor are served by these for free, though I am unsure if this is enough to make up for the lack of an NHS-style government program to serve them.
I am unsure how that could be applied to major medical however: if your deduction is $3000 then you don't care if the hospital is going to charge $10000 or $50000, that's a good deal different from comparing a $50 or $100 cleaning. Maybe it could apply to doctor's visits but then people just don't go at all if it is not free, while they will get their teeth cleaned because it is an obvious service, not just somebody looking at you.
By far the worst place I ever saw was when I was a kid and went with my father to an emergency room in Vegas. We went to the public hospital and it was a kafka scene, pretty horrible. After hours we finally saw somebody, who realized my father had insurance and said we were at the wrong hospital, and sent us to the really nice and clean and completely empty private one where he was treated within 30 minutes of arrival (it was a fractured ankle). This is before Reagan signed the law that said all emergency rooms must treat all incoming patients. I think it is interesting that this has not turned all emergency rooms into this scene, instead the ones I have been to since seem to be as nice as that empty private one was.
Transient-for hint works across process boundaries.
He just said the employees were paying half of $550 (ie $275), and now they are paying $130-$250. So yes, the employees are saving money (assuming he is telling the truth).
COBRA has a limited period it works for.
And previously you could not get that private policy if you had a preexisting condition.
Not everything is wine and roses with Obamacare, but ignoring what was happening before is not helping any arguments about it.