Slashdot Mirror


Text Processing in Python

Ursus Maximus writes "If you have read an introductory book or two about Python programming, but you are far from being an expert, then you will benefit a lot from reading this book. If you are a competent programmer in any other language, you will benefit from this book. If you are an expert Python programmer, you will also benefit from this book." Ursus Maximus's review continues below. Text Processing in Python author David Mertz pages 520 publisher Addison Wesley rating 10 reviewer Ursus Maximus ISBN 0321112547 summary How to use Python to process text.

As you probably know, there are many good introductory texts about Python. This is not one of them, for this is an advanced book, but not an inaccessible one. David Mertz has a unique style and focus that we have become familiar with from his series of articles on the IBM Developer Network. Dr. Mertz is more interested in facilitating our learning process than in lecturing us, and rather than fill his pages with impressive examples designed to illustrate his expertise, he gently guides us by offering subtle yet important examples of code and analysis that makes us think for ourselves.

He has a special talent for programming in the functional style, and this is a great introduction to that style of Python programming. Thus, this is also a good guide to using the newer features introduced into Python in the last few revisions, which often facilitate the functional style of programming.

The text includes, in an appendix, a 40 page tutorial covering the basic Python language. This tutorial is, like the book, unique in its approach and is worthwhile even for experienced Pythonistas, as it sheds light on some of the underlying ideas behind the syntax and semantics, and it also illustrates the functional style of programming, which is sometimes quite useful when doing text processing. And, despite its many other virtues, this is a book about text processing.

Chapter 1 covers the Python basics, but with a particular eye towards those features most critical and useful for text processing. Chapter 2 covers the basic string operations as found in the string module and the newer built-in string functions. Chapter three is about Regular Expressions, and, although I am shy about regexes because of their relative complexity, I am very glad to have read this chapter and will no longer be intimidated when regexes are the correct approach to take! Chapter 4 is on Parsers and State machines, which are important for processing nested text, as in everyday HTML, XML and the like. This chapter is not as esoteric as its title may sound to relative newbies (like myself), as it does offer useful ideas and principles for dealing with HTML. How much more useful can a topic be than that? It is true that a deep understanding of this subject may be beyond myself and other relative duffers, but this chapter has much to offer those like me and I am sure much more to offer professionals.

Chapter 5 is on Internet tools and techniques, and this a good example of how text processing touches every important area of computer programming. We manipulate text for email, newsgroups, CGI programs, HTML and many other aspects of net programming. A good summary of XML programming is included, as well as useful synopses of other Python internet modules, from a text processing point of view.

Appendix A is the aforementioned selective and short review of Python basics. Appendix B is a ten page Data Compression primer that is quite educational. Appendix C offers the same good service for Unicode, and Appendix D covers the author's own software, a state machine for adding markup to text, which is backed up by his extensive web site that has a lot of free software to support those doing extensive text processing. Lastly, Appendix E is a Glossary for technical terms from the book. This is very much an educational book, and would be suitable for classroom work at the University level, beyond the introductory programming level; in fact, as part of a curriculum to teach programming using Python at the University level, this would be an excellent text for the second course.

One of the highlights of the book is that each chapter is concluded with a problem and discussion section. These are of the highest quality I have encountered in computer texts. Rather than overwhelming the reader with a large number of problems, the author has obviously given a lifetime of thought in coming up with a few key problems that are meant to stimulate thought, creativity, and ultimately understanding and growth in the reader. I will be coming back to the problems often, as they cannot be absorbed quickly anyway; they require thought. These would be most useful in a classroom environment; but as they are accompanied by excellent discussion material, and backed up by the author's web site, the individual reader will be well served also.

The book is more than the sum of its parts. It will be a most useful reference source for when I am doing various text related tasks for some time to come, and it was also a delightful and educational quick read in the here and now. It also amply illustrates the centrality of text processing in all areas of computer science, and I am confident that the book will be useful and educational for all programmers, whatever their area of expertise.

To sum it all up, this book is educational. It is also beautifully bound and printed, and excellently written. I rate it five stars, my highest rating, and heartily recommend its purchase.

You can purchase Text Processing with Python from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

18 of 215 comments (clear)

  1. The book in full by TheRoss · · Score: 5, Informative

    is here, as a series of text files. This is official.

  2. Another... by Pinguu · · Score: 4, Informative
    --
    --
    1. Re:Another... by Mister+Furious · · Score: 5, Informative

      yeah, this is a good book. also it's released under the GNU Free Documentation License and is available to download in various formats here.

  3. Re:Great Intro by damiangerous · · Score: 2, Informative

    Novice coders. You should either have some background in Python or have the fundamentals that allow you to treat languages as tools rather than being a " language X programmer."

  4. Re:What do you use python for? by Minam · · Score: 2, Informative

    First, a disclaimer: I haven't used Python for about a year and a half, and so may be out of touch with the most recent developments in the language. I am writing the following NOT to bash Python or to invite flames, merely to explain what I feel to be weaknesses of Python. If someone can counter them rationally, please do so.

    That said, I learned, wrote in, and loved Python for a few months. However, the whole whitespace issue eventually drove me away from Python; some people like it, I didn't.

    Second, I disliked how you had to explicitly pass "this" as a parameter to each method. It seemed very NON-object-oriented, and in a language that claimed to be OO, I found it to be a glaring discrepancy.

    Lastly, I found the regular expression handling in Python to be rather inconvenient. I much prefer the way Perl and Ruby do it (though there isn't much else I prefer about Perl).

    Other than those points, there were many things I liked about Python. Unfortunately, I can't remember enough of the language to say what they were, although it seems that operator overloading was one of them.

    ---------------

  5. Re:What do you use python for? by tuffy · · Score: 5, Informative
    What do Slashdotters use python for?

    I use it for data management, system administration chores and CGI programming.

    What are its strengths

    Python has a nice clean syntax that tends to re-use language constructs, which makes it easy to learn and read. It makes good use of objects and exceptions and it has a solid standard libarary of goodies. And, it has no shortage of additional modules to use. Plus, the whole of it is highly malleable.

    and its weaknesses?

    It's not the fastest language out there, some don't like its whitespace-based syntax and it doesn't have the breadth of pre-built modules as older languages like Perl have.

    Why is it worth learning another programming language?

    It is if you have problems to solve and don't particularly care for the tools you're using now.

    --

    Ita erat quando hic adveni.

  6. Python is the Lord by ultrabot · · Score: 5, Informative

    I'm not sure if this is maintaining legacy apps, but it certainly scared me!

    Python jobs are hardly for legacy app maintenance. More like rapid development of cutting edge stuff, prototyping, exploring, enterprise application integration... and Agile development in general. I introduced Python to my previous workplace, and after the guys there learned it, they didn't switch back (even though their chief python advocate/fascist, i.e. your truly, left :-).

    Python can be used for very large problems (hundreds of modules, and much more classes), in addition to trivial scripts (0 functions). It is *fun* as hell. Python programmer is always an architect, there is very little monkey-level "grunt work", which tends to form most of your day-to-day C++/Java programming.

    You really have no clue about OOP before you have tried one of the dynamic OOP languages: Python, Smalltalk, or Ruby. Smalltalk has fallen to a legacy role these days, while Ruby is much less mature and has a smaller community than Python. Additionally, Ruby is less "tasteful", in that it borrows more heavily from perl, but that is a matter of controversy ;-).

    Additionally, Python is an embodiment of Open Source, because the code is actually readable and concise enough to lower the barrier of reading it. In fact I have taken a look at the source code of several Open Source projects that use Python "just for kicks", while I hardly bother in case of e.g. C programs. One line of Python is equivalent of 10-20 lines of C++, so you can digest more with the typical geek attention span (i.e. borderline ADD ;-).

    --
    Save your wrists today - switch to Dvorak
  7. Re:What do you use python for? by mapMonkey · · Score: 3, Informative

    However, the whole whitespace issue eventually drove me away from Python; some people like it, I didn't.

    I think you are the first person I have ever heard to hold this POV. Most people I see seem to hate the whitespace at first, and then grow to love it.

    I disliked how you had to explicitly pass "this" as a parameter to each method.

    You don't. You have have to explicitly indicate "this" (or "self" in Python) as an argument in the method definition, but you don't pass it as an argument -- Python passes it for you. That being said, Python now allows you to declare a method as a "classmethod", allowing you to call it without an instance; but you still have to have "self" in the method def, Only now, "self" is a handle on the class instead of on an instance.

    I found the regular expression handling in Python to be rather inconvenient

    The regex module has been replaced by the re module. Regular expressions have changed quite a bit in recent releases. Not sure hat your specific gripes are, but things may have changed for the better.

  8. Re:What do you use python for? by Metrol · · Score: 3, Informative

    I've recently started going through O'Reilly's "Learning Python" here myself. I'd spent a healthy bit of time trying to get C++ functionally working in my head, but I just couldn't get it. For someone who wants to code the logic and leave the nit picky stuff to someone else, Python seems to be a better approach.

    Mostly what got me going was an article in Linux Journal recently concerning wxWindows. Just the notion that I could code up a GUI application that is truly cross platform with Python and this windowing kit has got me focused on learning this language. I'm also rather interested in the fact that Python also binds in with KDE's API, as that's my preferred desktop.

    That is what all got me going. What I'm finding interesting as I learn this language is how it approaches various problems. Python is an interpereted language, but upon running a program the program is compiled into bytecode like with Java, except that the compile process is automatic. You can manually compile beforehand as well. Read a blurb in there about being able to convert a standing Python program to C, which then in turn can be compiled into a full executable. Haven't even begun to play with any of this stuff yet, but it is interesting.

    I'm personally impressed with the OOP approach that Python takes. I mostly code in PHP these days, and will most likely continue to do so for web stuff. Still, I never did much care for PHP's approach to OOP. C++'s approach just up and lost me. Python's approach seems to make a lot more sense, and even at this early stage of learning it I can see how I would utilize it in the kinds of stuff I'm looking to write.

    It has a module system similar to Perl's, and there's a LOT of them. Pretty much all the stuff I'm looking to do has some kind of module in play to help me along. I've only coded a little bit of Perl, but every time I did I really didn't care for the language. Too many esoteric symbols in place of where commands should be in play for my taste.

    I know that in every Slashdot thread concerning Python there needs to be at least one person bitching about code indenting as a part of the syntax. I personally love this. I imagine that anyone who has had to follow up behind someone who didn't indent code might just appreciate this. Python's indenting schema is pretty much exactly what I've been doing now in PHP and JavaScript for years now anyway. My eyes are still tuned in to looking for that closing brace that isn't there, but my brain is slowly starting to come around.

    At this early stage, about the only thing I'm finding a little confusing is how variables are handled. This is neither good or bad at this point, just that there's enough concepts I hadn't really dealt with before that there's a learning curve I haven't yet gotten through. From what I can tell, there's an odd mix of C++ style variables that act more like pointers than the scalars that I'm used to working with in PHP.

    This far into it, I'm still having fun going through this beginner's book. Been playing around a bit with the wxPython tutorials, and getting lost in BoaConstructor. I'm still of the opinion that my time being invested here is being well spent. Seems like a pretty cool approach to getting an application slapped together.

    --
    The line must be drawn here. This far. No further.
  9. Re:I can think of one person... by Lulu+of+the+Lotus-Ea · · Score: 4, Informative

    Actually, although this remark lacks modesty, I wrote the book for myself, in a way. That is, whenever I want to remind MYSELF of a particular method in an odd little module I only use occassionally, I turn to my own explication of it. It reminds me of what I found the most important aspect when I investigated that particular feature during writing. So I benefit from having a copy too (or usually the e-copy that you can find on my website).

    Btw. I also have some author copies that I'd like to sell to US buyers who can pay by check. Basically, I get the most money if you do it that way. If that's not convenient, please buy it some other place... but if you want to drop me an email, so much the better.

    David Mertz
    http://gnosis.cx/TPiP/

  10. Re:What do you use python for? by Qbertino · · Score: 5, Informative

    What do Slashdotters use python for?

    Software Agents / Content Syndication 'bots
    Web/Internet Application Server (Zope)
    3D (me: Blender, ILM for Maya and others)

    I've used Python on various things one of the more abitious being, well, actually Text Processing :-). In the wider term that is. A Software Agent for scanning and retrieving certain information from different Inet Sources - a very serial process that's hard to 'objectivise'. Python did/does a great job at keeping things overseeable.

    Zope is the other area I use Python in. Zope I consider the most sophsticated Application Server avaiable. It's GPLd of course :-) (www.zope.org)

    Just as with me Python is very popular within the 3D Field. ILM use it as their prime scripting language and I like Blenders built in Python controlled/based realtime engine.

    What are its strengths and its weaknesses?
    Shurely it's tab-based delimiting of blocks ('whitespace syntax') is a big feature. I can be shure to be able to read *any* code from anybody who did it in Python instantly. Think of how teamwork improves (especially in extreme programming) when bad indentation means your code is broken!
    Python is completely GPLd, which means a lot to me and overall futuresafety of a PL. That's why I don't feel so good about Java (allthough I like it too in a way)
    Python is very easy to learn. "Perl is executable line noise, Python is executable pseudocode" actually sums that one up.
    The only *weakness* that comes to mind is that it's a younger language. But it's catching up rapidly in terms of breadth and width of the 'lib' availability - also due to Python being completely GPLd!

    Why is it worth learning another programming language?
    It's actually one of the most modern and sophisticated. I realizes what developers theorized as ideal some 20 years ago.
    The obligatory famous quote:
    |||We will perhaps eventually be writing only small modules which are identified by name as they are used to build larger ones, so that devices like indentation, rather than delimiters, might become feasible for expressing local structure in the source language
    - Donald E. Knuth,1974|||

    Oh, and, yet again, it's GPLd all the way through. Want a better PL? Use Python.

    --
    We suffer more in our imagination than in reality. - Seneca
  11. Re:Why use Python? by daveaitel · · Score: 4, Informative
    Well, there are 2 major drawbacks to Python:
    1. No good free runtime debugger
    2. No CPAN

    But the major benefits are that you can, with basically NO Python training, sit down at a random Python program and extend it ten times faster than an expert in C could extend THEIR OWN program.
    It's a combination of a lot of things that makes Python great to use - some of these things Perl has as well, but most of these things are very Python specific - you'll see them as you learn it.

    I recommend Wing IDE, btw, for a commercial Python editor and runtime debugger at a reasonable price.

    For what it's worth, CANVAS (http://www.immunitysec.com/CANVAS/) is written entirely in Python, so I put my money where my mouth is.

    -dave

  12. Re:What do you use python for? by 4of12 · · Score: 4, Informative

    it doesn't have the breadth of pre-built modules as older languages like Perl have.

    Maybe not quite as many modules as Perl, but the standard Python library provides interfaces for a lot of different tasks. It's not skimpy, in case any of you potential Python users was worried.

    There's good reason the motto is "Batteries Included".

    I've found Python useful for all kinds of tasks and love the clean, short syntax devoid of punctuation characters.

    If you need more of a recognized authority to recommend how great and wonderful is Python, then listen to Bruce Eckel or Eric Raymond.

    --
    "Provided by the management for your protection."
  13. Re:What about trusty old C? by k8to · · Score: 2, Informative

    The primary difference between Python and C++ is quite simple. C++ is a low-productivity language. By comparison, Python is a very high-productivity language.

    By this I mean that per line, or per time, you're getting far more done in Python. Your programs are accomplished much more quickly, and you can move on to the next job.

    Like many high-producivity languages, Python is a nicer choice than a languages like C++ except for where it's inappropriate to be used at all. Some examples include: an unusually high speed requirement, a machine-implementation oriented program, a requirement that the language meet some logistical job-world issue like available programmers or ISO spec or some such, a program that requires rigid typing in order to get close to reliability.

    The only common case there is the job-world issues.

    Note that it's quite possible to build a program in both C++ and Python using object hierarchies that span both languages.

    If nothing else, Python is excellent for prototyping C/C++ applications. Find the design errors rapidly, then implement without timewasting. I swear this is faster than writing in C/C++ the first time.

    --
    -josh
  14. Answer; Python at least by axxackall · · Score: 2, Informative
    With regard to text processing, is it in a different category altogether from Java/Javascript/PERL/MUMPS/REALbasic?

    For meaningless and arbitrary text (text without syntax/semantic or with a very primitive syntax still no semantic or when you consider text as a arbitrary set of strings despite any syntax or semantic) processing neither of imperative languages is good.

    If you want to work with text as with meaningful set of information, where both syntax and semantic should be taken to consideration and processed as well, then you need other languages. Haskell, ML, Lisp is first what comes to mind for semantic text processing. With some limits I still can include Python to the list of recommended languages for text processing as it has some elements of functional programming plus it's the most advanced scripting language among imperative ones, besides it's OOP is good enogh for the subject.

    Conclusion: if your mind is corrupted by imperative languages than choose at least Python for text processing. But if your mind is still flexible than choose Haskell or Lisp or ML.

    --

    Less is more !
  15. Re:What do you use python for? by msaavedra · · Score: 3, Informative
    ...also due to Python being completely GPLd!

    While I generally agree with your post, you gave this incorrect information several times. Python is not licensed under the GPL. It uses its own unique license that is more similar to the LGPL or BSD than the GPL.

    --
    "Any fool can make a rule, and any fool will mind it."
    --Henry David Thoreau
  16. Re:What do you use python for? by Troll_Kamikaze · · Score: 2, Informative

    <<<At this early stage, about the only thing I'm finding a little confusing is how variables are handled. This is neither good or bad at this point, just that there's enough concepts I hadn't really dealt with before that there's a learning curve I haven't yet gotten through. From what I can tell, there's an odd mix of C++ style variables that act more like pointers than the scalars that I'm used to working with in PHP.>>>

    Python doesn't have variables in the traditional sense, only references to objects.
    x = 1
    causes x to refer to the immutable integer object 1, while
    x += 2
    would then create an immutable integer object with value 3 and cause x to refer the new object. This is unlike C-style languages in that in C, x would be an integer-sized slot into which we're pushing different values; in Python, x is just a name that is subsequently used as a label for various integer objects.

    However, because the objects in the example above are immutable, it's easy to mistake x for a C-style "value slot" rather than a "reference". In Python, the code
    x = 1
    y = x
    x += 2
    will not change y to 3, because y initially refers to the immutable integer object 1, as does x.

    x += 2 does not change the object that y refers to (which is 1) into 3, it just causes x to refer to a different object (the immutable integer object 3).

    Contrast the discussion of immutable objects above with the following code, which works with a mutable object:
    x = [1,2]
    y = x
    x.append(3)

    In this case, the first statement creates a list (a *mutable* sequence) containing 1 and 2, then labels the new object with the name 'x'. The second line applies another label ('y') to the same mutable sequence object.

    Since lists are *mutable*, they can be changed. The third statement changes the single mutable sequence object that both x and y refer to, so both references reflect the change. However, if the third statement had been
    x = [10,11]
    , then it would not have changed the object that y refers to; instead, it would have created a new object and applied the label 'x' to it.

    Think of it this way: If I have a domestic building that I refer to as 'my house' and you live in *the same building* with me, but you refer to it as 'mi casa', then 'my house' is affected if you burn the building that you call 'mi casa', because we're just applying different labels to the same object.

    If you build a new house for yourself and begin to refer to the new house as 'mi casa', then 'my house' (the house in which we previously lived together) would not be affected if you burned 'mi casa'.

  17. Re:Is Python PARTICULARLY good for text processing by Ian+Bicking · · Score: 2, Informative
    No, Python is not particularly good for text processing. Python is very much a general-purpose language, and there's no specific task for which Python was designed.

    Text processing is, after all, only the start of things. Eating and spitting out text gets kind of boring pretty quick (see Awk or XSLT). More often you'll want to do something with that text. You'll process it then present it, email it, perform actions based on it, etc.

    That said, Python is quite good for text processing. For instance, it doesn't have a regex literal, but it does have a special string literal which doesn't parse backslashes. So regexes don't stick out quite as nicely as in Perl, but they aren't painful to write like in PHP (how many backslashes do you need in your string when looking for a backslash?). Python has a few little touches that make it work well, even if there's nothing you can point to and say "that's for text processing."

    Compared to Java, for instance, text processing in Python will be much easier and require much less code. But that holds true for any task. Compared to Perl, code written to do text processing in Python will be much more readable. Like any task. Python is just a good language.