Python 2.6 to Smooth the Way for 3.0, Coming Next Month
darthcamaro writes "Some programming languages just move on to major version numbers, leaving older legacy versions (and users) behind, but that's not the plan for Python. Python 2.6 has the key goal of trying to ensure compatibility between Python 2.x and Python 3.0, which is due out in a month's time. From the article: 'Once you have your code running on 2.6, you can start getting ready for 3.0 in a number of ways,' Guido Van Rossum said. 'In particular, you can turn on "Py3k warnings," which will warn you about obsolete usage patterns for which alternatives already exist in 2.6. You can then change your code to use the modern alternative, and this will make you more ready for 3.0.'"
Here are the changes.
I really have to check out the multiprocessing package. Too bad that I have to wait for the print function and the new division handling.
And if it's like some other languages you might have a long time to wait before 3.0.
Given that the first release candidate of Python 3.0 is already out, I doubt we'll be in for a very long wait.
I think the point is that with 2.6, your old code will work but will tell you what to change. If you move to 3.0, unless you have those changes already, it just won't work.
The problem is that there are three kinds of string-like objects in Python: UTF-16 strings, ASCII strings, and uninterpreted arrays of 8-bit bytes. Python 2.5 sort of supports all 3, with "array of bytes" the least well supported. Since this is a language without declarations, the semantics of this gets messy.
The most common problem was that functions like ".read()" yielded strings, not arrays of bytes. This follows C standard library semantics, but is a bad fit to Python. In 3.0, ".read()" yields an array of bytes, not a string. If the data read is to be converted to a string, "decode" is required. That's the right answer.
This is consistent with modern thinking about data representation. Consider SQL, which makes a similar distinction between "TEXT" and "BLOB".
You might not be aware of this, but computers are used for more than just transmitting text. I don't want my binary streams being rewritten to gibberish because some I/O routine was written to be too clever. Furthermore, not every system uses UTF-8. Some may even need to send data over a *gasp* network! Good luck getting every other computer in the world to start using UTF-8 immediately.
If you try to convert bytes that aren't in UTF-8 using a UTF-8 codec, an error will be raised. This behavior is proper -- if you don't know what format your input is in, there's no way to perform text-based operations on it.
Every developer I know uses Unicode strings already. The new behavior is just one less character to type in front of literals.
Otherwise said as: "We're too stupid to fix the glaring encoding errors in our product, so we'll just use bytes everywhere and pretend it's all working". Also, Unicode strings in Python are implemented with either UTF-16 or UCS-4 depending on platform.
Anthony Baxter gave a pretty good talk on the implications at LCA 2008 earlier this year.
http://video.google.com/videoplay?docid=4264641260805367198&hl=en
"Everything is adjustable, provided you have the right tools"
Python does not use UTF-16 strings; it uses UCS-2 strings. The difference is that in UCS-2, every character is represented by exactly two bytes, while in UTF-16, some characters, those outside Plane 0, are represented by two "surrogate" pairs, totaling four bytes. UCS-2 does not provide any representation for characters outside the BMP. In other words, UCS-2 is a straightforward fixed length encoding, while UTF-16 is a more complex variable-length encoding.
Python can in fact use either of two internal representations for text: UCS-2 or UTF-32 = UCS-4. If you give the option --enable-unicode=ucs4 to configure when building Python, you will get a Python that supports all of Unicode rather than just the BMP.
"slightly resemble python"? Python 3.0 code looks just like the Python that's been around for years. Maybe there's some handy new syntax (with), but it's still Python.
This is not about fundamentally changing Python. This is about cleaning up warts, some of which have been around since Python 1.x.
If you're going to modify a language, you *must* do it in a compatible manner, otherwise what you're doing is making a new language that will require an entirely new community. Names notwithstanding, and resemblance beyond incompatibilities notwithstanding.
From what I've seen, the Python devs have put together about the best possible migration path while still actually making the changes that need to be made.
Here's the picture, in case it's not clear: Python 2.6 is just as backwards compatible as the other 2.x releases. Which is to say that porting from 2.5 to 2.6 is pretty trivial. I'd expect any actively used and maintained library to be 2.6 compatible within weeks (and a great many probably didn't break at all).
2.6 lets you use many of 3.0's features that don't break compatibility (and there are many). It also has a warnings mode to help you spot 3.0 incompatible code. And it lets you selectively turn on 3.0 features within a module.
Want to start using the new print function?
from __future__ import print_fiunction
Voila! The print keyword goes away and you have the new print function. Certainly bits of new Python 3.0 syntax work now as well:
try:
1/0
except ZeroDivisionError as e:
pass
The "as e" bit is new.
Finally, there's actually a "2to3" tool that makes many of the changes in an automated fashion.
The single biggest change from a compatibility standpoint is that "foo" is a unicode object in 3.0 and a string (set of bytes) in 2.x. You can even prepare for that switch:
from __future__ import unicode_literals
foo = "foo" # this will be unicode
bar = b"bar" # this is a set of bytes
unibar = bar.decode("utf-8") # get a unicode from the bytes
They have put *a lot* of thought into how to make this transition. People will gradually shift to 2.6, just as they did with 2.5. And, over time, they will change to using the new features. They'll probably upgrade to 2.7 (yes, there will be one), and use the new features even more. And eventually their code will just be 3.0 code and the switch will be a no brainer.
If not, why wouldn't I just wait for 3.0 and then just fix everything ONCE?
Well, first of all, 2.6 and 3.0 come out at the same time and share many of the same new features... so there's no "just wait for 3.0" possible, it's either/or right now.
The advantage is that if you have a big pile of 2.5 code right now, you can slowly turn on the "use 3.0 style" switches in 2.6 and migrate your code one little switch at a time over a long period of time.
That way, a few years from now when they decide to stop supporting new features in the 2.x path and you really "must have" some new feature in the 3.x path, it will be significantly easier for you to switch if you've turned on the "use 3.0" switches previously.
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
From What's new in Python 3.0: The str and bytes types cannot be mixed; you must always explicitly convert between them, using the str.encode() (str -> bytes) or bytes.decode() (bytes -> str) methods.
That's the right way to do it, but I agree that as a retrofit to existing code, it's a headache.
Worse, it's a problem that's detected at run time, not compile time, at least with the CPython implementation.
In fact I am better informed than you are. When not compiled to use UCS-4, Python uses what is properly called UCS-2, with half-baked extensions for treating it as UTF-16. Certain functions know about surrogate pairs, such as those that convert between UTF-8 and the internal representation. However, such basic functions as len do not know about surrogate pairs. Try giving a character outside the BMP as the argument to len. It will return 2, not 1.
Reading the release, they have decided to really push 16-bit strings (they call this "Unicode" but it really is what is called UTF-16). I think this is a serious mistake.
The proper solution is to use 8-bit strings, but any functions that care (such as I/O) should treat them as being UTF-8. Most functions do not care and thus the treatment of "Unicode" and "bytes" are the same.
I'm going to try once more, slightly differently. Two other people apparently have tried and failed.
Python 3.0's handling of strings is basically the same as Java's, because it has proven to work quite well there.
For webapps, and the rules may be a little different on the desktop, "best practices" in Python for some time have been that you use unicode objects everywhere internally when you are representing text. When you hit a boundary (a file on disk, the net), you encode that unicode string into whatever encoding makes sense (often UTF-8). So far, so good, I hope?
Python's internal representation of unicode objects is only relevant in that you need it to support whatever code points you care about. I don't think there are any code points that you can represent in UTF-8 that Python will screw up after decoding/encoding. I'm sure there are many people who would be interested to see such a test case.
If you have a bunch of bytes that *might* be UTF-8, you're screwed. "process data that is likely to be text but must not be altered"? What do you mean by text? 7-bit ASCII? UTF-8? And where is the text coming from? Unless you tell Python the encoding of the file, you're going to get bytes out, not unicode objects.
The whole point is that Python unicode objects know how to represent code points. If you have get a set of bytes from somewhere you *have* to know what encoding it is in order to be able to treat it as a bunch of text characters. Python unicode objects will not be "bad UTF-16". How they're stored is not generally important. What's important is that Python internally keeps track of the code points and will either successfully convert to whatever encoded sequence of bytes you want or it will raise an exception because the encoding you've chosen doesn't have one of the characters in your string.
Python 3.0 makes this all clearer. When you talk about a "string", you're talking about a bunch of unicode characters. Anything else is a collection of bytes.
By the way, you can specify what encoding a Python source file is in so that your string literals are all properly decoded.
For further reading...
http://www.joelonsoftware.com/articles/Unicode.html
Actually, this has been explicit in Python for some time. In Python 2.x, "string" objects are byte sequences and "unicode" objects are character sequences.
What changes in Python 3.0 is that "unicode" objects have been renamed "string" and "string" objects have been renamed "bytes". So, not only is it explicit, but the naming makes more sense.
The other related change is that string literals in your code are interpreted as Python 3.0 "string" objects ("unicode" in Python 2.x terminology), whereas previously you had to stick a 'u' in front of the string to get that behavior. And you can indeed specify the encoding of your source files, which is nothing new.
All of this to say, you're right on the money and Python is already in the spot you describe as "better off".
I was on your side right up until you said:
Dumbass
This is quite true: but sort of irrelevant. Even the core developers on Python-dev have been seen to state on more then one occasion that they don't expect Python 3.0 to be the "standard" for a period of time that will stretch to years: one? three? The specifics don't exactly matter.
That's why they've done the releasing of Python 2.6 and Python 3.0 in parallel (although 3.0 was recently delayed a little, the development of each have been hand in hand); they fully expect to maintain the 2.x line for awhile, and are already talking of 2.7.
Each new iteration of 2.x will bring it closer to 3.0, and the third party modules will steadily become more and more available. Right now the IMHO biggest hurdle in the development of the modules for 3.0 is a lack of a serious conversion document from the point of view of the C internals. But they're even working on that.
3.0 seems to be, more then anything else, a work yet in progress. Even when it's released, its not fully expected to everyone will be converting their code over to be 3.0. They don't expect people to *really* start using it in a standard way until 3.1, 3.2 or so -- and whatever version of 2.x that will accompany it that people willll be converting from at that time, complete with additional features to help ease the transition.
Personally, I find the strategy for migrating Python to 3.0 ... comforting. I don't necessarily agree with *all* of the changes done to 3.0, but most I quite like. Since I have a massive codebase at work that's currently running on 2.x, a major/incompatible change to "fix" the language is something that alarmed me early on.
However, now I know that 2.x will be supported for quite awhile, and new releases will be made upon it to ease the way, I have a roadmap to follow that makes the burden significantly easier. Once we update our codebase to 2.6., I'll probably start slowly modifying things to activate more optional 3.x-isms, and by that time the myriad third party libraries will probably be supported.
2.6 brings a number of interesting features to us; and allows us to start working slowly towards migrating to the 3.0 world. This is a -very- well thought out migration plan, IMHO.
They're actually hard at work on that problem too. In addition to Python 2.6 being released, the Python documentation is now generated using Sphinx. See for example the new tutorial output. Big WTF the first time I saw it, but it's a decent improvement with more in the pipeline.
This sig is intentionally left blank
Changing my path is not practical. It's too broad. I'd have to write a shell script wrapper for the application which did 'env PATH=new_python:$PATH the_real_application "$*"' or something. And it's not just me; I'd have to communicate this to all other users of the system somehow. And changing one line of a script is not trivial, if I'm not root.
You have a system admin problem not a python problem. If you can't run system installed software and your admin refuses to help, you have an admin problem. Making it a python problem when your admin isn't doing his job, doesn't really make it a python problem.
All this may seem like minor things, but it adds up. And no other good language puts me in situations like that.
You still have multiple ways to address the issue. It is trivial. Even with multiple users.
Or those of us who have been around for a while, and seen innocent backwards-incompatible changes become maintenance nightmares ... Ok, maybe not a nightmare in this case, but an inconvenience and annoyance which will keep being inconvenient and annoying for years, until the last Python 2.x dependency goes away.
Or you can trivially fix it as above and be done with it. You're making it a mountain when it isn't even a mole hill. If you have such problems, stop using that version. It really is that easy.
Here's why your issues simply don't exist. For your situation to have occurred, you must have an admin that installs a new version of python and makes it the default system version. Furthermore, you must have multiple users using scripts installed system wide, which would have been installed by the same admin, which are now broken, and an admin that refuses to help make these system wide scripts which you can't edit, and can't run using the old version of python. And, that means you refuse to change your user environment. That's nothing but a bad admin and lazy users, pure and simple. Furthermore, it's unlikely that your admin would install a new python version as the default, installed the non-default libraries, and decided the user base doesn't really need the new version and that they users requiring python in the first place don't need to run the scripts which are the entire purpose of having a new python install in the first place. In other words, nothing in your argument makes practical sense.
And yes, those are run on sentences. I used them on purpose to highlight your convoluted argument.