Also, stuff like QString are actually better than C++ stdlib equivalents - providing reference counting, copy-on-write and unicode support.
std::string was certainly designed to support reference counting and copy on write. It is true that the G++ and Windows versions do not do this, but I believe they ran timing tests and determined that it was slower. Reference counting on modern machines dirties the memory that the source string was in, which can cause considerable slowness on modern multiprocessors as it requires a sync between all the processor caches.
More importantly std::string supports Unicode just fine, due to UTF-8. If you don't believe me then you have not tried programming with true UTF-8 without any api's that require conversion.
In fact QString and all the other UTF-16 things are a huge hindrance to Unicode, because of the simple fact that you cannot easily put a UTF-8 string (such as is stored in every file and Internet api used everywhere) into one, because it will be lossy (which can lead to security bugs) or throw an exception (which leads to DOS bugs) if there are invalid encodings in there. Storage as bytes allows the invalid encodings to be preserved and deferred until the last moment, when it is much easier to manage errors.
UTF-16 makes most programmers give up and pretty much resort to only supporting ISO-8859-1 (or worse, only ASCII) when reading bytes. Don't believe me? Just yesterday, a quite smart programmer where I work, in response to encountering an invalid UTF-8 string, "fixed" it by running every string through a filter that replaced every byte with the high bit set with "\xNN" (where NN is the byte in Hex). That was considered good enough, as English still printed. They do not give a damn about Unicode, and all those politically-correct idiots pushing UTF-16 need to learn that they are causing far more damage to Unicode than they are helping!
The main constraint is that the LGPL requires that the Qt library be distributed as a shared library (that is not really what the LGPL says, but it is the only practical way to do it). I think it is likely Nokia will add an exception to allow static linking very soon, as they already had a big list of exceptions to their GPL license that allowed that (though not for commercial products).
The other constraint which is very unlikely to change is that if you modify the library itself, you have to include the source code for the modifications. Thus you cannot make some secret change to Qt. However in reality there is really no reason to do that. It is extremely unlikely you have some tricky piece of code you want secret and you are unable to arrange things so that it is not part of Qt itself.
Since the LGPL requires those companies to publish their "patch set", they have absolutely no reason not to try to contribute it back to Nokia, since they can't keep it secret anyway. And as you said, it is a pain to maintain a patch set so they also have that incentive to get their changes into the main version.
Re:I'm not a copyright lawyer
on
Qt Becomes LGPL
·
· Score: 0
Nonsense. "Compatible" means "they can be made to work together". Yes the result is GPL.
They would only be "incompatible" if it was impossible to combine the code. That is obviously false.
Personally I think the LGPL is not doing what it was intended to do, because when it was written they were thinking only about libc and not libraries that might not be included with the operating system.
Static linking should be allowed. The requirement that should be enforced is that if you modify the code in the LGPL library itself, you have to distribute the modifications. The rules are a bit more complicated so that you are not allowed to modify it to call a pointer that is set to point at a secret implementation, and other tricks.
The LGPL requirement for shared libraries is actually a big hindrance to complex libraries. It pretty much requires the binary api to be frozen. As anybody who has tried to write anything that complicated knows, that is quite impossible.
The other popular solution is to add a "linking exception" to the GPL/LGPL. Something called Classpath has the most popular wording. This pretty much makes the LGPL work like most people expect. One problem with this is that the "linking exception" completely hides all the differences between the GPL and LGPL (ie the result is exactly the same if you add the linking exception to either one). But without the name "LGPL" people don't think they can use the code in closed source. I think it would help a lot if GNU would standardize the wording of the "linking exception" and make it commonly known, so that people who see "GPL+linking exception" would know that it is even more useful than LGPL.
The main problem is that the LGPL forces you to use Qt as a shared library if your program is closed-source. You cannot statically link with it as this violates the LGPL (technically it requires that end users have some method of relinking with a new version of the LGPL code, but the only practical solution is a shared library, as object files make your distribution twice as big and makes reverse-engineering your program a lot easier).
This is a problem for commercial developers who want to distribute a program to machines that don't have Qt installed.
It also is a problem for developers that want to modify Qt as they have to figure out how to make it call their own version and not cause DLL hell.
Wrong. Qt's license did not allow modification of the Qt code, and did not allow commercial distribution of the result (whether or not the source was included). This is not a BSD/GPL type of comparison. Qt was in fact completely unusable by an open source application.
Qt did not put any license that allowed commercial redistribution until long after Gnome was forced to use GTK because of this.
The Linux kernel has flourished without any real competitors for while now...
I seem to remember there is something called BSD. Maybe you should look it up.
Actually an interesting thing I have noticed is that in open source there tends to be exactly TWO "popular" versions of everything. Whether there are a hundred competing versions (such as chat programs and music players) or very few (toolkits), it seems that exactly two are used by 90% of the users or more. Not one, not three.
Just re-read your post and I think you're making a few mistakes.
Actually the data is 8x as dense in that example, whoever wrote the page is not doing their geometry right... Plus in the above comparison they ignored the white line which is another pixel of height.
For the scale of the drawing, the data is merely 4x density in the MS Tag.
I did the geometry right, but I was being stupid about the bits. 4 colors is 2 bits, not 4. So the data density (if the sides of the objects are the same size) is 4x. Also the caption never says 4x, it just says "4 symbols", so I did not read it very carefully.
The space used by the white line in this image is far smaller than the ones in the actual codes. In the graphical image it implies the vertical spacing is equal the the side of a triangle. But their actual examples are 6 triangles wide but only 5 rows high, so the vertical spacing is really 6/5 the size of a triangle.
Also the QR code example is 4x the size of any real ones I ever saw printed in a paper.
That's okay -- the bigger the better.
No, I mean that it has 4x as many squares in it as any actual QR I have seen (I have seen plenty printed in UK newspapers). It also has 6 small targets in it instead of 1 that I have seen in all examples.
I don't know why they did not take a normal QR and print it a lot larger. What I meant by "honesty" is that the only explanation I can think of is that there is a printing resolution limit (probably due to color printing alignment issues). They don't mention this limit but don't want to lie in their example, so to make the QR bigger they made it have a lot more data than necessary.
The biggest dishonesty is they they are comparing a very long URL with a Microsoft number being looked up on their servers.
I think you're mistaking the redirecting service for the data encoded in the tag...
I don't know what you are talking about, and you seem to be ignoring my TinyURL example. TinyURL is a "redirecting service" that can be used by QR codes. The Microsoft one says "123456" and you store the actual link on Microsoft's servers. A TinyURL QR would say "http://tinyurl.com/123456" and you store the actual link on TinyURL's servers. The TinyURL is 19 bytes larger, but you are not locked into using TinyURL!
I do think it is somewhat dishonest to try to claim that their color is what makes it so tiny, when in fact it is that it is storing very little data.
Actually, they are claiming exactly what you said -- that they have to store very little data in their tag. That's why they're able to make it so small. That's also one of the reasons pattern recognition works so much better for them.
I'm sorry, please point to the text on the page that says the amount of data is smaller. I have just reread it carefully and it NEVER says that. All it talks about is "High Capacity Bar Codes"
The very paper you linked to shows that the conversion built into Python does UTF-16, not UCS-2:
>>> char = u"\N{MUSICAL SYMBOL G CLEF}"
>>> len(char)
2
If it did UCS-2 the assignment would have to produce some kind of error, or at least produce a 1-character incorrect string.
It seems to me you don't know what you are talking about.
And I really want you to explain how you plan to handle UTF-8 data that may have errors in it. Are you thinking that somehow the real world outside the program is some kind of magical perfect place that does not produce incorrect strings? Do you think people are going to catch errors? Or do you think, like I do, that a million Python programmers are going to give up on Unicode completely and treat all 8-bit data as ISO-8859-1, which is the easy solution?
However I think their comparisons are quite bogus.
The first image shows how using color (actually it probably uses brightness) to store 2 bits per triangle makes the data 4x as dense. Actually the data is 8x as dense in that example, whoever wrote the page is not doing their geometry right.
However their second example comparing the actual QR code to their code, for some reason (probably honesty) prints the QR code so small that each Microsoft triangle is approximately the size of a 3x3 rectangle of QR code. Plus in the above comparison they ignored the white line which is another pixel of height. So a 3x4 rectangle of QR code is the same size as 2 triangles (not 1 triangle as the rectangle contains 1/2 of two other triangles), thus the QR code has 12 bits in the same area as Microsoft has 8 bits. Also the QR code example is 4x the size of any real ones I ever saw printed in a paper.
The biggest dishonesty is they they are comparing a very long URL with a Microsoft number being looked up on their servers. If the QR code was a TinyURL it would itself be almost as small (it will have the overhead of the TinyURL website name). I do think it is somewhat dishonest to try to claim that their color is what makes it so tiny, when in fact it is that it is storing very little data. Also artificially inflating the size of the QR code is not very honest as well.
I do suspect that they have made a better form of pattern recognition. The QR code does seem to be a rather amateur attempt and I was surprised when I first saw them that such an obvious pattern was used. I would prefer however if they had worked on storing arbitrary data such as a URL, relying on keeping your information on Microsoft's servers in order to use this does not sound really like something everybody will want.
I posted about this before in a previous Python 3.0 article and a lot of people attacked me. However I very much feel that Pythons treatment of Unicode as UTF-16 is a HUGE problem that will cause no end of pain. I think a far cleaner solution to Unicode is to do the following:
- Make unmarked plain quoted strings produce byte strings just like they do now. Unless there are backslashes, the contents are precisely the bytes that are in the input file. Keep the automatic casting of byte strings to unicode strings.
- Force the encoding to be UTF-8 by default, or at least make it trivial to turn this mode on (in Python2.x the default init deletes the api to do this!)
- The sequence \uXXXX in a byte string constant should turn into the correct UTF-8 sequence. And the sequence \xXX in a Unicode string should be interpreted as bytes and converted from UTF-8 to unicode. This is necessary so that a string constant can easily be changed between bytes and Unicode.
- We must have lossless conversion of UTF-8 to UTF-16. The most popular method I have seen is to turn invalid bytes into 0xd8xx (which is invalid UTF-16 as it is lower-half surrogate pairs). Oddly enough this makes the UTF-16 api useless because the reverse conversion is not lossless, I have looked into this and it may be fixable but is complex: the to-UTF-8 converter must not translate a sequence of these to a legal UTF-8 sequence and instead convert that sequence to the typical 3-byte encoding of that number, and the from-UTF-8 converter must treat these typical 3-byte encodings as invalid byte sequences except when they are arranged such that the back converter would make them! This is messy but I see no other way to be able to use backends that insist on UTF-16 (in particular Windows filenames and it's clipboard).
The reason for this is that real Python programs need to handle arbitrary data that is *PROBABLY* UTF-8. Note that by "PROBABLY" I mean that the programmer really really wants to think of it as a sequence of unicode characters, not as a "byte sequence", but it must NOT compare any two different byte sequences as being equal.
I'm very afraid that Python3.0 as designed will encourage byte sequences to be treated as ISO-8859-1 rather than UTF-8 (because when you set the translation to that it is lossless and no errors are thrown, and \xXX does the same thing in both constants). IMHO this would be very, very bad for internationalization efforts. Believing the programmers will not take this easy solution, and instead rewrite their interfaces to the new byte/unicode naming and correctly handle exceptions thrown by converters is, I think, quite ignorant.
I am not joking or trolling about this. This has bitten me already and forced us to change all our use of Python from Unicode to byte strings. And we are just reading metadata from image files. Searching for comments on Python 3.0 on the web, it is apparent that web programmers are encountering this far more often and are very worried about this, and they certainly are trying to handle many orders of magnitude more data from sources that may be actively trying to exploit security holes.
I certainly put all the Windows machines I use to the Classic look. It has nothing to do with speed or interaction, it is because all the new stuff starting with XP is incredibly ugly!
Those candy colors must have appealed to the idiots they use for user testing. But it is shameful they would switch to something that looks like the worst of the Enlightenment themes from the 90's. Personally those colors and the shinyness is appallingly tasteless and very distracting.
Really, what happened to the designers who did Win95?
I did in fact defeat some DRM back in the age you are talking about.
We had cable and wanted to watch it on the TV in the kitchen as well as the one in the living room. The cable company solution was to buy a second cable connection.
I bought a switch box from Radio Shack (it had three rotating knobs on the front to send any of several inputs to any of three outputs). I also got some coax and ran it across the basement ceiling to connect the living room to the kitchen.
Then I had to defeat the "DRM" that the cable company provided. This was in the form of a special connection with a long sleeve over the coax screw-on connector on the cable coming from outside. After much fiddling I managed to undo it with needle nosed pliers from the cable box.
I could then connect this cable to the Radio shack switcher, and from there send the signal over the new cable to the kitchen tv. The non-encrypted channels were then available there (I think the reason I did not get the encryted ones is that the output of the decoder box was a tv signal with channel 2/3 modulation, the larger TV in the living room was too old to have direct cable input).
Tabs are automatically aligned with each other to occupy the same screen space. That is a big difference from separate overlapping windows. Also Microsoft was using tabs in some programs (especially the IDEs) for quite awhile so it is rather difficult to claim they wanted the taskbar to be used for this purpose.
The taskbar really replaced the "iconized" of windows that was used by earlier Windows and copied from Unix desktops. The taskbar *does* have one of the major Microsoft innovations however, the fact that the "icon" did not disappear when the window was "deiconized". All popular previous systems had (at least by default) the idea that a window was switched between the "icon" and "window".
I do agree with your hate of MDI, but MDI was a development by Microsoft, all previous systems I ever saw kept each document and the control panels in different windows. They initially did it to make it much less likely that another application not being used would be swapped in (since the MDI enforced a single rectangular area that was controlled by the foreground program, and users could move the subwindows without exposing any of another app's window). However they kept at it because of their stupid decision to make click always raise a window (which pretty much means overlapping windows are useless). This stupid decision was copied by Linux, unfortunatly. Try a really old Unix such as Irix to see how it should be working.
Your post seemed to imply "using Linux means you are using Emacs and VI", that is what I was complaining about. It is exactly as legitimate to say "using Windows means you are using Emacs and VI".
You do know that Emacs and VI work on Windows as well, right? Emacs in particular is quite popular. So your argument is just stupid. You might as well say "learning Word is no good because it won't teach you Emacs, which is also used on Windows".
If your input file is supposed to be UTF-8 text, and is not, then surely it's an error?
UTF-8 with errors is STILL UTF-8. It just is not "valid UTF-8" which is a mostly uninteresting subset. The set of UTF-8 strings is every single possible byte sequence. The set of "valid UTF-8" strings is a SUBSET that a tiny portion of software (mostly validators) should have to care about.
People are trying to make this far more difficult than it really is by somehow saying that we must restrict ourselves to that subset at a very low level. That is wrong and is the main reason why there is so much confusion about UTF-8. Nobody seems to care that UTF-16 can have illegal sequences (Python handles them without complaint) and nobody cared for 10 years that the Japanese encodings could have illegal sequences. But for some reason UTF-8 brings out this complaint over and over again. I suspect the problem is that people have invested too much effort in UTF-16 and don't want to admit they made a huge mistake, and the only way is to try to make UTF-8 hard.
But, of course, as soon as you want to start treating it as an actual string - so that you can say things such as "give me the 10th character" (and not "10th byte") - it has to be valid, otherwise all string-specific operations would simply be undefined.
Well of course. Therefore THAT function should throw the damn exception! Not every single string manipulation!!!!
Also you amazingly did the same bogus example of "move by 10 characters" I have seen before. Please look at real software and you will see that NOBODY EVER MOVES BY "10 CHARACTERS". 1 maybe. Otherwise the only use EVER of such code is because "10 characters" was previously calculated by another function looking at the EXACT SAME STRING and therefore a byte offset or UTF-16 word offset or whatever will work just as well.
L"\xC2\xA2" is not a cent sign in either C or C++. It's a wide (string with two characters.
It is byte values converted using ISO-8859-1 encoding. What I want is the ability to change that encoding.
The compiler isn't assuming UTF-8, the code which reads the file as a sequence of characters (before lexing, much less parsing, takes place) does that.
That is wrong, because it would not be possible to create a byte string containing an invalid UTF-8 sequence. This would break any software that has a string constant with ISO-8859-1 encoding in it (the programmer will still need to put a 'b' in front of it, but that is a lot easier and readable than going and replacing all the foreign letters with \x sequences).
In any case I don't see any reason why the Lexer should assume a different locale than the parser. That would be pretty confusing.
I remember the frustration of using a Linux desktop in the morning. All night various things had been running and as they were loaded it swapped out the desktop programs. But then it stayed like that, because nobody was using the desktop programs.
When you tried to use the desktop in the morning, it was *really* slow, because it now had to swap out all that stuff it had loaded, and then swap in the desktop programs. After awhile it would return to normal speed.
I believe Windows tried to address this. They tried to get the "nightly" programs to at least write themselves out to the swap file, so that getting the desktop programs back required only 1/2 as much work (only reading them in, and not having to write out the nightly programs). Probably more significant, they keep track of the "active" process and continuously swap in that process even if it is doing nothing.
This is all from 8 or so years ago. Nowadays Linux seems to work fine. I don't know if it is because the swapping was fixed, or because a modern machine has so much memory that the desktop programs don't get swapped out. It is also possible that some of the crap that ran at night has been eliminated. Meanwhile it does seem that Windows attempts to solve this have backfired because they cause far more swap writing than is necessary.
The scheme of turning errors into U+DCxx is called "UTF-8b"
As currently stated it has a big problem in that it destroys the current lossless conversion of UTF-16 to UTF-8. This would mean that a system using UTF-16 but translating to/from byte streams using UTF-8b could not safely operate a backend that uses UTF-16, though it can now operate a backend that uses UTF-8. This is somewhat counter-intuitive.
But I'm wondering if in fact a true bidirectional lossless scheme between UTF-8 and UTF-16 is possible. I'm trying to figure it out, but it makes my head hurt:
1. Invalid UTF-8 would convert each byte to U+DCxx.
2. The conversion from UTF-16 should undo U+DCxx to the matching bytes. But it has to look at the result and make sure it is *not* a valid UTF-8 encoding. If so then it should translate the first one as in CESU to 3 bytes.
3. The UTF-8/CESU encoding of U+DCxx normally should be considered invalid. But the UTF-8 decoder has to look at the following bytes and determine if the result would be something the encoder would turn into CESU according to rule 2 above.
That is as far as I can figure it out. Unfortunately my impression is that each rule requires the opposite direction to detect more cases. I cannot tell if there is a stable result that is practical to implement.
Reading the changelog, it sure does sound like b"abc"=="abc" will produce an error. I do find this extremely suprising as I would think this would break enormous amounts of software.
It sounds like Python 3.0 will throw an error if you read a file that contains invalid UTF-8, until the program is rewritten to read the file as "bytes". Then it will throw errors when you convert the bytes to "str", until you rewrite the functions reading the files to return bytes instead of str. Then the users will hit this problem in that their code will no longer compile. I can't see this being any good.
Checking the web pages, I am certainly not alone in this worry. A more popular solution however seems to be to stop throwing errors. The conversion to Unicode would instead translate invalid bytes to U+DCxx (ie unpaired UTF-16 lower-half surrogates). This would avoid the exceptions and also make the translation lossless. I have examined this before and it has a big problem in that the translation of (possibly invalid) UTF-16 to UTF-8 is no longer lossless (imagine the UTF-16 had a sequence of these invalid symbols that actually match a valid UTF-8 encoding), which might lead to bad security holes.
if it's invalid, it's no longer UTF-8, right?
You are parroting the same crap used by people who don't like UTF-8 and try to make it more difficult than it really is. It is indeed UTF-8, just because it has errors in it does not make it not be UTF-8, anymore than a misspelled word makes this post not be English.
It's not true for most post-Java mainstream and/or generally well-known languages
You seem to have forgotten languages called "C" and "C++". I heard they were pretty popular...
I think you might also check exactly what some of those languages do, you can't put more than \xff into most of them so they are actually doing exactly what I am saying, except they are assuming ISO-8859-1 as the encoding. If the encoding can be changed to UTF-8 then it would work exactly like I am stating. (if values greater than 0xff are accepted they could ignore the encoding and you would remain compatible).
What you are saying is that there is no difference between \x and \u, which seems pretty stupid to me.
The main reason I want this is so that a string constant can be changed between bytes and unicode by just changing the 'b' to a 'u'. This is also why I want \uXXXX to work in byte strings.
On b"\u00A2": Well, of course it's invalid - it's a byte array, not a string! And why do you think that it would have to be UTF-8 even if it was allowed? Why not UTF-16 or UCS4?
The compiler is already assuming UTF-8 when it parses u"abÂ" so I see no reason it can't assume UTF-8 here as well.
Damn you are right. They are not copying it or modifying it. What they are doing is violating the EULA, and whether that is illegal is quite questionable!
I got flamed for this before, but I am very concerned about their use of UTF-16 in the string constants by default. But if anybody more informed can correct me, please tell me.
The problem is that I have an arbitrary byte string that *MIGHT* be UTF-8. I want to test if it is a particular Unicode string. I do this:
if byte_string=="UTF-8 constant":
The above describes the actual sequence of bytes in my Python source file. Where I say "UTF-8 constant" I mean the Python source file has the correct sequence of bytes to encode some piece of Unicode in UTF-8.
In Python 2.0, the string constant is converted to a byte string without change. The UTF-8 will then compare exactly like I intended.
In Python 3.0, I am very unsure what happens. It appears the compiler will have converted the UTF-8 string constant into a Unicode string long before this statement is executed. So what happens? Here are the possibilities:
1. The statement is an error as the types don't match. Quite a few people claimed this in response to my previous posts. But I find it hard to believe this as it would break vast amounts of Python software and I don't see this mentioned in any of the porting guides.
2. The byte string is converted to Unicode before comparison. This will fail to do what I want if the current translation is not UTF-8. I am willing to set it to UTF-8 (though it would be great if Python defaulted to that!). But then there is the problem of what to do if the byte_string contains invalid UTF-8. It cannot be translated to Unicode. But I don't want an exception, I obviously intend this to return false in that case!
3. In this example the Unicode could be converted to a byte string before comparison, and it would work (provided I set the current translation to UTF-8). However this does not work for the much more common case of a function defined to take a "string" parameter, which would require #1 or #2 above.
It also appears to be impossible to make an unadorned string constant that contains an *invalid* UTF-8 encoding, since the translation is done at compile time, so no changes to the current encoding will help.
I also see serious difficulties with programs that use backslash escapes to insert UTF-8 into string constants. In Python 2.0 and in most other languages "\xC2\xA2" is a cent-sign (or at least the UTF-8 encoding of a cent sign). In Python 3.0 it is two Unicode characters, and does not compare equal to b"\xC2\xA2"!!!
Also the documentation claims that b"\u00A2" is invalid, but that makes it really difficult to make byte string constants containing arbitrary UTF-8 in a more readable way. It would be really nice if they fixed this.
I know a lot of people don't believe me, but I see nothing but grief from this decision. If you can actually state how the above work and/or why they are not a problem I would love to hear it.
This might be clearer if you showed the equivalent Python syntax. I think you are saying it that "f x, g y" might either mean "(f(x),g(y))" or "f(x,g(y))". However I really can't see any reason for the first interpretation, it looks to me that it is unambiguoulsly the second one.
I do agree however there must be ambiguous statements, but they are more complex than this. One area is that two string constants seem to concatenate, this appears to be done by the tokenizer, not the parser?
Since the purpose is to provide back-compatability with the print statement, maybe only the first token is special. "f a,b,c" turns into "f(a,b,c)" but "a+f a,b,c" is a syntax error just like now.
Somebody else pointed out that the Python shell could do this without changing Python internals at all and most people would be happy.
Also, stuff like QString are actually better than C++ stdlib equivalents - providing reference counting, copy-on-write and unicode support.
std::string was certainly designed to support reference counting and copy on write. It is true that the G++ and Windows versions do not do this, but I believe they ran timing tests and determined that it was slower. Reference counting on modern machines dirties the memory that the source string was in, which can cause considerable slowness on modern multiprocessors as it requires a sync between all the processor caches.
More importantly std::string supports Unicode just fine, due to UTF-8. If you don't believe me then you have not tried programming with true UTF-8 without any api's that require conversion.
In fact QString and all the other UTF-16 things are a huge hindrance to Unicode, because of the simple fact that you cannot easily put a UTF-8 string (such as is stored in every file and Internet api used everywhere) into one, because it will be lossy (which can lead to security bugs) or throw an exception (which leads to DOS bugs) if there are invalid encodings in there. Storage as bytes allows the invalid encodings to be preserved and deferred until the last moment, when it is much easier to manage errors.
UTF-16 makes most programmers give up and pretty much resort to only supporting ISO-8859-1 (or worse, only ASCII) when reading bytes. Don't believe me? Just yesterday, a quite smart programmer where I work, in response to encountering an invalid UTF-8 string, "fixed" it by running every string through a filter that replaced every byte with the high bit set with "\xNN" (where NN is the byte in Hex). That was considered good enough, as English still printed. They do not give a damn about Unicode, and all those politically-correct idiots pushing UTF-16 need to learn that they are causing far more damage to Unicode than they are helping!
The main constraint is that the LGPL requires that the Qt library be distributed as a shared library (that is not really what the LGPL says, but it is the only practical way to do it). I think it is likely Nokia will add an exception to allow static linking very soon, as they already had a big list of exceptions to their GPL license that allowed that (though not for commercial products).
The other constraint which is very unlikely to change is that if you modify the library itself, you have to include the source code for the modifications. Thus you cannot make some secret change to Qt. However in reality there is really no reason to do that. It is extremely unlikely you have some tricky piece of code you want secret and you are unable to arrange things so that it is not part of Qt itself.
Since the LGPL requires those companies to publish their "patch set", they have absolutely no reason not to try to contribute it back to Nokia, since they can't keep it secret anyway. And as you said, it is a pain to maintain a patch set so they also have that incentive to get their changes into the main version.
Nonsense. "Compatible" means "they can be made to work together". Yes the result is GPL.
They would only be "incompatible" if it was impossible to combine the code. That is obviously false.
Replying to myself:
Personally I think the LGPL is not doing what it was intended to do, because when it was written they were thinking only about libc and not libraries that might not be included with the operating system.
Static linking should be allowed. The requirement that should be enforced is that if you modify the code in the LGPL library itself, you have to distribute the modifications. The rules are a bit more complicated so that you are not allowed to modify it to call a pointer that is set to point at a secret implementation, and other tricks.
The LGPL requirement for shared libraries is actually a big hindrance to complex libraries. It pretty much requires the binary api to be frozen. As anybody who has tried to write anything that complicated knows, that is quite impossible.
The other popular solution is to add a "linking exception" to the GPL/LGPL. Something called Classpath has the most popular wording. This pretty much makes the LGPL work like most people expect. One problem with this is that the "linking exception" completely hides all the differences between the GPL and LGPL (ie the result is exactly the same if you add the linking exception to either one). But without the name "LGPL" people don't think they can use the code in closed source. I think it would help a lot if GNU would standardize the wording of the "linking exception" and make it commonly known, so that people who see "GPL+linking exception" would know that it is even more useful than LGPL.
The main problem is that the LGPL forces you to use Qt as a shared library if your program is closed-source. You cannot statically link with it as this violates the LGPL (technically it requires that end users have some method of relinking with a new version of the LGPL code, but the only practical solution is a shared library, as object files make your distribution twice as big and makes reverse-engineering your program a lot easier).
This is a problem for commercial developers who want to distribute a program to machines that don't have Qt installed.
It also is a problem for developers that want to modify Qt as they have to figure out how to make it call their own version and not cause DLL hell.
Wrong. Qt's license did not allow modification of the Qt code, and did not allow commercial distribution of the result (whether or not the source was included). This is not a BSD/GPL type of comparison. Qt was in fact completely unusable by an open source application.
Qt did not put any license that allowed commercial redistribution until long after Gnome was forced to use GTK because of this.
The Linux kernel has flourished without any real competitors for while now...
I seem to remember there is something called BSD. Maybe you should look it up.
Actually an interesting thing I have noticed is that in open source there tends to be exactly TWO "popular" versions of everything. Whether there are a hundred competing versions (such as chat programs and music players) or very few (toolkits), it seems that exactly two are used by 90% of the users or more. Not one, not three.
Anybody have any explanation for this?
Just re-read your post and I think you're making a few mistakes.
Actually the data is 8x as dense in that example, whoever wrote the page is not doing their geometry right... Plus in the above comparison they ignored the white line which is another pixel of height.
For the scale of the drawing, the data is merely 4x density in the MS Tag.
I did the geometry right, but I was being stupid about the bits. 4 colors is 2 bits, not 4. So the data density (if the sides of the objects are the same size) is 4x. Also the caption never says 4x, it just says "4 symbols", so I did not read it very carefully.
The space used by the white line in this image is far smaller than the ones in the actual codes. In the graphical image it implies the vertical spacing is equal the the side of a triangle. But their actual examples are 6 triangles wide but only 5 rows high, so the vertical spacing is really 6/5 the size of a triangle.
Also the QR code example is 4x the size of any real ones I ever saw printed in a paper.
That's okay -- the bigger the better.
No, I mean that it has 4x as many squares in it as any actual QR I have seen (I have seen plenty printed in UK newspapers). It also has 6 small targets in it instead of 1 that I have seen in all examples.
I don't know why they did not take a normal QR and print it a lot larger. What I meant by "honesty" is that the only explanation I can think of is that there is a printing resolution limit (probably due to color printing alignment issues). They don't mention this limit but don't want to lie in their example, so to make the QR bigger they made it have a lot more data than necessary.
The biggest dishonesty is they they are comparing a very long URL with a Microsoft number being looked up on their servers.
I think you're mistaking the redirecting service for the data encoded in the tag...
I don't know what you are talking about, and you seem to be ignoring my TinyURL example. TinyURL is a "redirecting service" that can be used by QR codes. The Microsoft one says "123456" and you store the actual link on Microsoft's servers. A TinyURL QR would say "http://tinyurl.com/123456" and you store the actual link on TinyURL's servers. The TinyURL is 19 bytes larger, but you are not locked into using TinyURL!
I do think it is somewhat dishonest to try to claim that their color is what makes it so tiny, when in fact it is that it is storing very little data.
Actually, they are claiming exactly what you said -- that they have to store very little data in their tag. That's why they're able to make it so small. That's also one of the reasons pattern recognition works so much better for them.
I'm sorry, please point to the text on the page that says the amount of data is smaller. I have just reread it carefully and it NEVER says that. All it talks about is "High Capacity Bar Codes"
The very paper you linked to shows that the conversion built into Python does UTF-16, not UCS-2:
>>> char = u"\N{MUSICAL SYMBOL G CLEF}"
>>> len(char)
2
If it did UCS-2 the assignment would have to produce some kind of error, or at least produce a 1-character incorrect string.
It seems to me you don't know what you are talking about.
And I really want you to explain how you plan to handle UTF-8 data that may have errors in it. Are you thinking that somehow the real world outside the program is some kind of magical perfect place that does not produce incorrect strings? Do you think people are going to catch errors? Or do you think, like I do, that a million Python programmers are going to give up on Unicode completely and treat all 8-bit data as ISO-8859-1, which is the easy solution?
It may say "UCS-2" but Python on Windows uses UTF-16, by the simple fact that it copies the strings unchanged to the Windows API, and that is UTF-16.
The web page does mention QR codes:
http://www.microsoft.com/tag/content/overview/
However I think their comparisons are quite bogus.
The first image shows how using color (actually it probably uses brightness) to store 2 bits per triangle makes the data 4x as dense. Actually the data is 8x as dense in that example, whoever wrote the page is not doing their geometry right.
However their second example comparing the actual QR code to their code, for some reason (probably honesty) prints the QR code so small that each Microsoft triangle is approximately the size of a 3x3 rectangle of QR code. Plus in the above comparison they ignored the white line which is another pixel of height. So a 3x4 rectangle of QR code is the same size as 2 triangles (not 1 triangle as the rectangle contains 1/2 of two other triangles), thus the QR code has 12 bits in the same area as Microsoft has 8 bits. Also the QR code example is 4x the size of any real ones I ever saw printed in a paper.
The biggest dishonesty is they they are comparing a very long URL with a Microsoft number being looked up on their servers. If the QR code was a TinyURL it would itself be almost as small (it will have the overhead of the TinyURL website name). I do think it is somewhat dishonest to try to claim that their color is what makes it so tiny, when in fact it is that it is storing very little data. Also artificially inflating the size of the QR code is not very honest as well.
I do suspect that they have made a better form of pattern recognition. The QR code does seem to be a rather amateur attempt and I was surprised when I first saw them that such an obvious pattern was used. I would prefer however if they had worked on storing arbitrary data such as a URL, relying on keeping your information on Microsoft's servers in order to use this does not sound really like something everybody will want.
I posted about this before in a previous Python 3.0 article and a lot of people attacked me. However I very much feel that Pythons treatment of Unicode as UTF-16 is a HUGE problem that will cause no end of pain. I think a far cleaner solution to Unicode is to do the following:
- Make unmarked plain quoted strings produce byte strings just like they do now. Unless there are backslashes, the contents are precisely the bytes that are in the input file. Keep the automatic casting of byte strings to unicode strings.
- Force the encoding to be UTF-8 by default, or at least make it trivial to turn this mode on (in Python2.x the default init deletes the api to do this!)
- The sequence \uXXXX in a byte string constant should turn into the correct UTF-8 sequence. And the sequence \xXX in a Unicode string should be interpreted as bytes and converted from UTF-8 to unicode. This is necessary so that a string constant can easily be changed between bytes and Unicode.
- We must have lossless conversion of UTF-8 to UTF-16. The most popular method I have seen is to turn invalid bytes into 0xd8xx (which is invalid UTF-16 as it is lower-half surrogate pairs). Oddly enough this makes the UTF-16 api useless because the reverse conversion is not lossless, I have looked into this and it may be fixable but is complex: the to-UTF-8 converter must not translate a sequence of these to a legal UTF-8 sequence and instead convert that sequence to the typical 3-byte encoding of that number, and the from-UTF-8 converter must treat these typical 3-byte encodings as invalid byte sequences except when they are arranged such that the back converter would make them! This is messy but I see no other way to be able to use backends that insist on UTF-16 (in particular Windows filenames and it's clipboard).
The reason for this is that real Python programs need to handle arbitrary data that is *PROBABLY* UTF-8. Note that by "PROBABLY" I mean that the programmer really really wants to think of it as a sequence of unicode characters, not as a "byte sequence", but it must NOT compare any two different byte sequences as being equal.
I'm very afraid that Python3.0 as designed will encourage byte sequences to be treated as ISO-8859-1 rather than UTF-8 (because when you set the translation to that it is lossless and no errors are thrown, and \xXX does the same thing in both constants). IMHO this would be very, very bad for internationalization efforts. Believing the programmers will not take this easy solution, and instead rewrite their interfaces to the new byte/unicode naming and correctly handle exceptions thrown by converters is, I think, quite ignorant.
I am not joking or trolling about this. This has bitten me already and forced us to change all our use of Python from Unicode to byte strings. And we are just reading metadata from image files. Searching for comments on Python 3.0 on the web, it is apparent that web programmers are encountering this far more often and are very worried about this, and they certainly are trying to handle many orders of magnitude more data from sources that may be actively trying to exploit security holes.
I certainly put all the Windows machines I use to the Classic look. It has nothing to do with speed or interaction, it is because all the new stuff starting with XP is incredibly ugly!
Those candy colors must have appealed to the idiots they use for user testing. But it is shameful they would switch to something that looks like the worst of the Enlightenment themes from the 90's. Personally those colors and the shinyness is appallingly tasteless and very distracting.
Really, what happened to the designers who did Win95?
I did in fact defeat some DRM back in the age you are talking about.
We had cable and wanted to watch it on the TV in the kitchen as well as the one in the living room. The cable company solution was to buy a second cable connection.
I bought a switch box from Radio Shack (it had three rotating knobs on the front to send any of several inputs to any of three outputs). I also got some coax and ran it across the basement ceiling to connect the living room to the kitchen.
Then I had to defeat the "DRM" that the cable company provided. This was in the form of a special connection with a long sleeve over the coax screw-on connector on the cable coming from outside. After much fiddling I managed to undo it with needle nosed pliers from the cable box.
I could then connect this cable to the Radio shack switcher, and from there send the signal over the new cable to the kitchen tv. The non-encrypted channels were then available there (I think the reason I did not get the encryted ones is that the output of the decoder box was a tv signal with channel 2/3 modulation, the larger TV in the living room was too old to have direct cable input).
Tabs are automatically aligned with each other to occupy the same screen space. That is a big difference from separate overlapping windows. Also Microsoft was using tabs in some programs (especially the IDEs) for quite awhile so it is rather difficult to claim they wanted the taskbar to be used for this purpose.
The taskbar really replaced the "iconized" of windows that was used by earlier Windows and copied from Unix desktops. The taskbar *does* have one of the major Microsoft innovations however, the fact that the "icon" did not disappear when the window was "deiconized". All popular previous systems had (at least by default) the idea that a window was switched between the "icon" and "window".
I do agree with your hate of MDI, but MDI was a development by Microsoft, all previous systems I ever saw kept each document and the control panels in different windows. They initially did it to make it much less likely that another application not being used would be swapped in (since the MDI enforced a single rectangular area that was controlled by the foreground program, and users could move the subwindows without exposing any of another app's window). However they kept at it because of their stupid decision to make click always raise a window (which pretty much means overlapping windows are useless). This stupid decision was copied by Linux, unfortunatly. Try a really old Unix such as Irix to see how it should be working.
Your post seemed to imply "using Linux means you are using Emacs and VI", that is what I was complaining about. It is exactly as legitimate to say "using Windows means you are using Emacs and VI".
You do know that Emacs and VI work on Windows as well, right? Emacs in particular is quite popular. So your argument is just stupid. You might as well say "learning Word is no good because it won't teach you Emacs, which is also used on Windows".
If your input file is supposed to be UTF-8 text, and is not, then surely it's an error?
UTF-8 with errors is STILL UTF-8. It just is not "valid UTF-8" which is a mostly uninteresting subset. The set of UTF-8 strings is every single possible byte sequence. The set of "valid UTF-8" strings is a SUBSET that a tiny portion of software (mostly validators) should have to care about.
People are trying to make this far more difficult than it really is by somehow saying that we must restrict ourselves to that subset at a very low level. That is wrong and is the main reason why there is so much confusion about UTF-8. Nobody seems to care that UTF-16 can have illegal sequences (Python handles them without complaint) and nobody cared for 10 years that the Japanese encodings could have illegal sequences. But for some reason UTF-8 brings out this complaint over and over again. I suspect the problem is that people have invested too much effort in UTF-16 and don't want to admit they made a huge mistake, and the only way is to try to make UTF-8 hard.
But, of course, as soon as you want to start treating it as an actual string - so that you can say things such as "give me the 10th character" (and not "10th byte") - it has to be valid, otherwise all string-specific operations would simply be undefined.
Well of course. Therefore THAT function should throw the damn exception! Not every single string manipulation!!!!
Also you amazingly did the same bogus example of "move by 10 characters" I have seen before. Please look at real software and you will see that NOBODY EVER MOVES BY "10 CHARACTERS". 1 maybe. Otherwise the only use EVER of such code is because "10 characters" was previously calculated by another function looking at the EXACT SAME STRING and therefore a byte offset or UTF-16 word offset or whatever will work just as well.
L"\xC2\xA2" is not a cent sign in either C or C++. It's a wide (string with two characters.
It is byte values converted using ISO-8859-1 encoding. What I want is the ability to change that encoding.
The compiler isn't assuming UTF-8, the code which reads the file as a sequence of characters (before lexing, much less parsing, takes place) does that.
That is wrong, because it would not be possible to create a byte string containing an invalid UTF-8 sequence. This would break any software that has a string constant with ISO-8859-1 encoding in it (the programmer will still need to put a 'b' in front of it, but that is a lot easier and readable than going and replacing all the foreign letters with \x sequences).
In any case I don't see any reason why the Lexer should assume a different locale than the parser. That would be pretty confusing.
I remember the frustration of using a Linux desktop in the morning. All night various things had been running and as they were loaded it swapped out the desktop programs. But then it stayed like that, because nobody was using the desktop programs.
When you tried to use the desktop in the morning, it was *really* slow, because it now had to swap out all that stuff it had loaded, and then swap in the desktop programs. After awhile it would return to normal speed.
I believe Windows tried to address this. They tried to get the "nightly" programs to at least write themselves out to the swap file, so that getting the desktop programs back required only 1/2 as much work (only reading them in, and not having to write out the nightly programs). Probably more significant, they keep track of the "active" process and continuously swap in that process even if it is doing nothing.
This is all from 8 or so years ago. Nowadays Linux seems to work fine. I don't know if it is because the swapping was fixed, or because a modern machine has so much memory that the desktop programs don't get swapped out. It is also possible that some of the crap that ran at night has been eliminated. Meanwhile it does seem that Windows attempts to solve this have backfired because they cause far more swap writing than is necessary.
The scheme of turning errors into U+DCxx is called "UTF-8b"
As currently stated it has a big problem in that it destroys the current lossless conversion of UTF-16 to UTF-8. This would mean that a system using UTF-16 but translating to/from byte streams using UTF-8b could not safely operate a backend that uses UTF-16, though it can now operate a backend that uses UTF-8. This is somewhat counter-intuitive.
But I'm wondering if in fact a true bidirectional lossless scheme between UTF-8 and UTF-16 is possible. I'm trying to figure it out, but it makes my head hurt:
1. Invalid UTF-8 would convert each byte to U+DCxx.
2. The conversion from UTF-16 should undo U+DCxx to the matching bytes. But it has to look at the result and make sure it is *not* a valid UTF-8 encoding. If so then it should translate the first one as in CESU to 3 bytes.
3. The UTF-8/CESU encoding of U+DCxx normally should be considered invalid. But the UTF-8 decoder has to look at the following bytes and determine if the result would be something the encoder would turn into CESU according to rule 2 above.
That is as far as I can figure it out. Unfortunately my impression is that each rule requires the opposite direction to detect more cases. I cannot tell if there is a stable result that is practical to implement.
Reading the changelog, it sure does sound like b"abc"=="abc" will produce an error. I do find this extremely suprising as I would think this would break enormous amounts of software.
It sounds like Python 3.0 will throw an error if you read a file that contains invalid UTF-8, until the program is rewritten to read the file as "bytes". Then it will throw errors when you convert the bytes to "str", until you rewrite the functions reading the files to return bytes instead of str. Then the users will hit this problem in that their code will no longer compile. I can't see this being any good.
Checking the web pages, I am certainly not alone in this worry. A more popular solution however seems to be to stop throwing errors. The conversion to Unicode would instead translate invalid bytes to U+DCxx (ie unpaired UTF-16 lower-half surrogates). This would avoid the exceptions and also make the translation lossless. I have examined this before and it has a big problem in that the translation of (possibly invalid) UTF-16 to UTF-8 is no longer lossless (imagine the UTF-16 had a sequence of these invalid symbols that actually match a valid UTF-8 encoding), which might lead to bad security holes.
if it's invalid, it's no longer UTF-8, right?
You are parroting the same crap used by people who don't like UTF-8 and try to make it more difficult than it really is. It is indeed UTF-8, just because it has errors in it does not make it not be UTF-8, anymore than a misspelled word makes this post not be English.
It's not true for most post-Java mainstream and/or generally well-known languages
You seem to have forgotten languages called "C" and "C++". I heard they were pretty popular...
I think you might also check exactly what some of those languages do, you can't put more than \xff into most of them so they are actually doing exactly what I am saying, except they are assuming ISO-8859-1 as the encoding. If the encoding can be changed to UTF-8 then it would work exactly like I am stating. (if values greater than 0xff are accepted they could ignore the encoding and you would remain compatible).
What you are saying is that there is no difference between \x and \u, which seems pretty stupid to me.
The main reason I want this is so that a string constant can be changed between bytes and unicode by just changing the 'b' to a 'u'. This is also why I want \uXXXX to work in byte strings.
On b"\u00A2": Well, of course it's invalid - it's a byte array, not a string! And why do you think that it would have to be UTF-8 even if it was allowed? Why not UTF-16 or UCS4?
The compiler is already assuming UTF-8 when it parses u"abÂ" so I see no reason it can't assume UTF-8 here as well.
Damn you are right. They are not copying it or modifying it. What they are doing is violating the EULA, and whether that is illegal is quite questionable!
I got flamed for this before, but I am very concerned about their use of UTF-16 in the string constants by default. But if anybody more informed can correct me, please tell me.
The problem is that I have an arbitrary byte string that *MIGHT* be UTF-8. I want to test if it is a particular Unicode string. I do this:
if byte_string=="UTF-8 constant":
The above describes the actual sequence of bytes in my Python source file. Where I say "UTF-8 constant" I mean the Python source file has the correct sequence of bytes to encode some piece of Unicode in UTF-8.
In Python 2.0, the string constant is converted to a byte string without change. The UTF-8 will then compare exactly like I intended.
In Python 3.0, I am very unsure what happens. It appears the compiler will have converted the UTF-8 string constant into a Unicode string long before this statement is executed. So what happens? Here are the possibilities:
1. The statement is an error as the types don't match. Quite a few people claimed this in response to my previous posts. But I find it hard to believe this as it would break vast amounts of Python software and I don't see this mentioned in any of the porting guides.
2. The byte string is converted to Unicode before comparison. This will fail to do what I want if the current translation is not UTF-8. I am willing to set it to UTF-8 (though it would be great if Python defaulted to that!). But then there is the problem of what to do if the byte_string contains invalid UTF-8. It cannot be translated to Unicode. But I don't want an exception, I obviously intend this to return false in that case!
3. In this example the Unicode could be converted to a byte string before comparison, and it would work (provided I set the current translation to UTF-8). However this does not work for the much more common case of a function defined to take a "string" parameter, which would require #1 or #2 above.
It also appears to be impossible to make an unadorned string constant that contains an *invalid* UTF-8 encoding, since the translation is done at compile time, so no changes to the current encoding will help.
I also see serious difficulties with programs that use backslash escapes to insert UTF-8 into string constants. In Python 2.0 and in most other languages "\xC2\xA2" is a cent-sign (or at least the UTF-8 encoding of a cent sign). In Python 3.0 it is two Unicode characters, and does not compare equal to b"\xC2\xA2"!!!
Also the documentation claims that b"\u00A2" is invalid, but that makes it really difficult to make byte string constants containing arbitrary UTF-8 in a more readable way. It would be really nice if they fixed this.
I know a lot of people don't believe me, but I see nothing but grief from this decision. If you can actually state how the above work and/or why they are not a problem I would love to hear it.
This might be clearer if you showed the equivalent Python syntax. I think you are saying it that "f x, g y" might either mean "(f(x),g(y))" or "f(x,g(y))". However I really can't see any reason for the first interpretation, it looks to me that it is unambiguoulsly the second one.
I do agree however there must be ambiguous statements, but they are more complex than this. One area is that two string constants seem to concatenate, this appears to be done by the tokenizer, not the parser?
Since the purpose is to provide back-compatability with the print statement, maybe only the first token is special. "f a,b,c" turns into "f(a,b,c)" but "a+f a,b,c" is a syntax error just like now.
Somebody else pointed out that the Python shell could do this without changing Python internals at all and most people would be happy.