tazzzzz · Slashdot Mirror

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-05 12:31 · Score: 1

No, think a little harder.

Imagine a file system that names the files with strings of bytes.

It is absolutely vital that if I ask for a list of files and then try to open them, that this all work, no matter what byte sequence has managed to get in there as a filename.

It is also *nice* but nowhere near as vital that I be able to show these names to users and they read them as Unicode strings.

While many file systems do likely represent file names with strings of bytes, odds are that the OS is using some kind of encoding for those filenames. After all, the OS should be able to display filenames to the user, right?

So, it all boils down to what the Python 3.0 os.listdir (and related) routines return. I don't know the answer to that offhand (and I don't feel like building Python 3.0 to confirm). If Python has no idea what encoding the filenames are in, it has no choice but to return bytes objects.

You can only ever get a unicode object if you know what encoding the source is, and that would go for filenames as well.

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-05 12:17 · Score: 1

Maybe I should clear this up a bit more.

If your editor inserted the UTF-8 encoding of two bytes (0xc2,0xa3 I think) the result should be those same two bytes. However I/O routines when told to print the string should then decode the UTF-8 and produce the pound sign. If the compiler is producing something other than UTF-8 (such as current Python does if you put a 'u' before the quote) then the compiler does the conversion, not the I/O routine. My main argument is that I think this is a job for I/O, not the compiler, and I don't like Python changing the default.

The compiler also has to do I/O to read the file, and to do so successfully it needs to know what encoding your source file is in (and Python has had a mechanism for this for quite some time).

What's good about this change, imho, is that people are now *forced* to consider what encoding their I/O is being done in if they want to do string-like things. For too long, too many people have just plain ignored the issue of encodings and run into problems at inopportune time. This change pushed people closer to best practices.

By the way, when you put a "u" in front of the string in current Python versions (which is the default behavior in Python 3.0), it's not a matter of the compiler "producing something other than UTF-8". Rather, the compiler is using either your declared encoding for the file or your system default encoding to decode your source file and turn your literal string into a proper unicode string. If you stick UTF-8 encoded literals in your file and tell Python that it's a UTF-8 file, you will get a proper unicode object and you can convert to whatever encoding you want when you are presenting that literal externally.

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-03 14:27 · Score: 1

I think the lesson is that there is ONLY byte sequences.

The fact that some code can interpret that byte sequence and draw something on the screen that the user thinks of as "text" is completely irrelevant and should not be a fundemental datatype of a programming language.

No, text is important and there certainly are more than byte sequences. Yes, byte sequences are important and they certainly still exist in Python 3.0 (and, in fact, you now get a mutable byte sequence type as a bonus).

Let's say I have a webapp and there's a form with a state/province field. The user selects "California" from the list. The browser converts that set of characters to UTF-8 (because that's what's specified on the page) and then sends those bytes to the server. The web framework on the server properly spots the UTF-8 encoding, decodes it back into a bunch of characters.

This sequence of steps allows me to validate that the characters "California" represent a valid state.

If all I had was a series of bytes and not actual characters, I'd be SOL.

>>> u"California".encode("rot-13")
'Pnyvsbeavn'

Pnyvsbeavn is a perfectly legit series of bytes to represent "California", but I clearly couldn't do any useful validation there unless I decode it.

So, in many instances, the code does care about more than a sequence of bytes and "strings" containing "characters" are a very useful construct.

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-03 14:14 · Score: 3, Informative

Actually, this has been explicit in Python for some time. In Python 2.x, "string" objects are byte sequences and "unicode" objects are character sequences.

What changes in Python 3.0 is that "unicode" objects have been renamed "string" and "string" objects have been renamed "bytes". So, not only is it explicit, but the naming makes more sense.

The other related change is that string literals in your code are interpreted as Python 3.0 "string" objects ("unicode" in Python 2.x terminology), whereas previously you had to stick a 'u' in front of the string to get that behavior. And you can indeed specify the encoding of your source files, which is nothing new.

All of this to say, you're right on the money and Python is already in the spot you describe as "better off".

Re:String f**k up on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-03 14:09 · Score: 4, Informative

Reading the release, they have decided to really push 16-bit strings (they call this "Unicode" but it really is what is called UTF-16). I think this is a serious mistake.

The proper solution is to use 8-bit strings, but any functions that care (such as I/O) should treat them as being UTF-8. Most functions do not care and thus the treatment of "Unicode" and "bytes" are the same.

I'm going to try once more, slightly differently. Two other people apparently have tried and failed.

Python 3.0's handling of strings is basically the same as Java's, because it has proven to work quite well there.

For webapps, and the rules may be a little different on the desktop, "best practices" in Python for some time have been that you use unicode objects everywhere internally when you are representing text. When you hit a boundary (a file on disk, the net), you encode that unicode string into whatever encoding makes sense (often UTF-8). So far, so good, I hope?

Python's internal representation of unicode objects is only relevant in that you need it to support whatever code points you care about. I don't think there are any code points that you can represent in UTF-8 that Python will screw up after decoding/encoding. I'm sure there are many people who would be interested to see such a test case.

If you have a bunch of bytes that *might* be UTF-8, you're screwed. "process data that is likely to be text but must not be altered"? What do you mean by text? 7-bit ASCII? UTF-8? And where is the text coming from? Unless you tell Python the encoding of the file, you're going to get bytes out, not unicode objects.

The whole point is that Python unicode objects know how to represent code points. If you have get a set of bytes from somewhere you *have* to know what encoding it is in order to be able to treat it as a bunch of text characters. Python unicode objects will not be "bad UTF-16". How they're stored is not generally important. What's important is that Python internally keeps track of the code points and will either successfully convert to whatever encoded sequence of bytes you want or it will raise an exception because the encoding you've chosen doesn't have one of the characters in your string.

Python 3.0 makes this all clearer. When you talk about a "string", you're talking about a bunch of unicode characters. Anything else is a collection of bytes.

By the way, you can specify what encoding a Python source file is in so that your string literals are all properly decoded.

For further reading...
http://www.joelonsoftware.com/articles/Unicode.html

Re:Not sure about this one on Python 2.6 to Smooth the Way for 3.0, Coming Next Month · 2008-10-03 12:54 · Score: 5, Informative

...which is why some heavy python users, myself included, aren't going to use 2.6 or 3.0. I have huge amounts of python in operation, and the very last thing I'm going to do is break any of it with an incompatible language that happens to slightly resemble python (no matter who wrote it, and no matter what they call it, it isn't python if it can't run mundane python code.)

"slightly resemble python"? Python 3.0 code looks just like the Python that's been around for years. Maybe there's some handy new syntax (with), but it's still Python.

This is not about fundamentally changing Python. This is about cleaning up warts, some of which have been around since Python 1.x.

If you're going to modify a language, you *must* do it in a compatible manner, otherwise what you're doing is making a new language that will require an entirely new community. Names notwithstanding, and resemblance beyond incompatibilities notwithstanding.

From what I've seen, the Python devs have put together about the best possible migration path while still actually making the changes that need to be made.

Here's the picture, in case it's not clear: Python 2.6 is just as backwards compatible as the other 2.x releases. Which is to say that porting from 2.5 to 2.6 is pretty trivial. I'd expect any actively used and maintained library to be 2.6 compatible within weeks (and a great many probably didn't break at all).

2.6 lets you use many of 3.0's features that don't break compatibility (and there are many). It also has a warnings mode to help you spot 3.0 incompatible code. And it lets you selectively turn on 3.0 features within a module.

Want to start using the new print function?

from __future__ import print_fiunction

Voila! The print keyword goes away and you have the new print function. Certainly bits of new Python 3.0 syntax work now as well:

try:
1/0
except ZeroDivisionError as e:
pass

The "as e" bit is new.

Finally, there's actually a "2to3" tool that makes many of the changes in an automated fashion.

The single biggest change from a compatibility standpoint is that "foo" is a unicode object in 3.0 and a string (set of bytes) in 2.x. You can even prepare for that switch:

from __future__ import unicode_literals

foo = "foo" # this will be unicode
bar = b"bar" # this is a set of bytes
unibar = bar.decode("utf-8") # get a unicode from the bytes

They have put *a lot* of thought into how to make this transition. People will gradually shift to 2.6, just as they did with 2.5. And, over time, they will change to using the new features. They'll probably upgrade to 2.7 (yes, there will be one), and use the new features even more. And eventually their code will just be 3.0 code and the switch will be a no brainer.

Re:Don't forget ownership! on Best Way To Distribute Video Online? · 2008-09-05 07:11 · Score: 1

I'm pretty sure most of them don't actually claim rights to the video. The TOS would generally say "you give us a *non-exclusive* license to redistribute this content ad infinitum".

Re:Video software on TurboGears: Python on Rails? · 2005-10-11 04:42 · Score: 1

Yes, I used Snapz Pro X. I had a couple of people ask for some form of captions for those with sound off or those that are hard of hearing... something like Camtasia on Windows could make that easier, but then I couldn't use TextMate.

Snapz Pro X was nice to use, on the whole.

Re:My Rights Online on HP Discusses Anti-Counterfeiting Measures · 2004-02-07 07:12 · Score: 1

Err... "Congress shall make no law"... Since when has HP become Congress? Companies can make stupid products. Just don't buy em.

Why this costs $15000 on Pluto: Linux-based Do-everything System · 2004-01-03 08:33 · Score: 5, Informative

I should have given some more info knowing that the site would be slashdotted...

For that price, you get the Pluto Core, which is the Linux-based server. You get some number (unclear to me how many) of media distributors (PCs with DVD drives and network interfaces) that hook up to your TV and the Core to show video and play music. You also get "Orbiters", which are hand-held devices to which you can stream video from your security cameras and control the Pluto system.

So, we're not talking one Linux PC. It's a whole system of stuff. I've requested more pricing info, because I'm curious how much you have to pay for the various parts. $15K is a lot of money, but this can give technically unsophisticated folks a usable "home of the future" sort of setup.

Kevin

Re:-1: Slashvertisement on Pluto: Linux-based Do-everything System · 2004-01-03 08:11 · Score: 5, Insightful

As the person who posted this, I can say that I have absolutely no affiliation with the company that makes this. It seemed an appropriate topic for slashdot to me, because here's a product that incorporates doubtless dozens of open source projects into a useful, usable package. (At least, that's the idea... I don't have this system to play with...)

This is, I assure you, not a product placement (unless the /. editors convinced this company to fork over some dough between last night when I submitted this and now when it appeared on the site.)

Kevin

Re:Key Feature: directory awareness on Designing a New Version Control System? · 2002-07-16 04:50 · Score: 2, Informative

Meta-CVS is a wrapper around CVS that adds directory structure versioning.

View Source on Mozilla Tree Closes for 1.0 · 2002-03-28 06:31 · Score: 5, Interesting

Sigh... 1.0 comes along and they still haven't fixed the view source bug. Yep, still can't view the source of a dynamic page. The bug is labeled as "Future".

Is it me or does the ability to view the source of whatever your looking at seem to be something that even a 1.0 browser should do correctly?

Re:The lesser of two evils on C# From a Java Developer's Perspective · 2001-11-19 09:46 · Score: 2, Informative

For people who haven't yet checked it out, IBM's Eclipse project... IBM has developed a GUI toolkit for Java that uses native widgets.

Technology is enabling diversity of recording on Sheet Music to Napster: Music Distribution Tech · 2001-06-06 01:23 · Score: 1

Not only is technology changing distribution, but it is also changing recording big time. Studio time is quite expensive, unless the studio is in your bedroom.

Today, a moderately zippy PC can replace: multitrack recorder, mixer, effects units and even synthesizers. A modern studio certainly may have better acoustics, equipment with more dynamic range, etc. But, the gap between the quality of what comes out of a commercial studio and what can come out of a bedroom is shrinking significantly.

So, less money in the system doesn't mean that acoustic folk will be the only style left standing.

Slashdot Mirror

User: tazzzzz

Comments · 15