Slashdot Mirror


Text Processing in Python

Ursus Maximus writes "If you have read an introductory book or two about Python programming, but you are far from being an expert, then you will benefit a lot from reading this book. If you are a competent programmer in any other language, you will benefit from this book. If you are an expert Python programmer, you will also benefit from this book." Ursus Maximus's review continues below. Text Processing in Python author David Mertz pages 520 publisher Addison Wesley rating 10 reviewer Ursus Maximus ISBN 0321112547 summary How to use Python to process text.

As you probably know, there are many good introductory texts about Python. This is not one of them, for this is an advanced book, but not an inaccessible one. David Mertz has a unique style and focus that we have become familiar with from his series of articles on the IBM Developer Network. Dr. Mertz is more interested in facilitating our learning process than in lecturing us, and rather than fill his pages with impressive examples designed to illustrate his expertise, he gently guides us by offering subtle yet important examples of code and analysis that makes us think for ourselves.

He has a special talent for programming in the functional style, and this is a great introduction to that style of Python programming. Thus, this is also a good guide to using the newer features introduced into Python in the last few revisions, which often facilitate the functional style of programming.

The text includes, in an appendix, a 40 page tutorial covering the basic Python language. This tutorial is, like the book, unique in its approach and is worthwhile even for experienced Pythonistas, as it sheds light on some of the underlying ideas behind the syntax and semantics, and it also illustrates the functional style of programming, which is sometimes quite useful when doing text processing. And, despite its many other virtues, this is a book about text processing.

Chapter 1 covers the Python basics, but with a particular eye towards those features most critical and useful for text processing. Chapter 2 covers the basic string operations as found in the string module and the newer built-in string functions. Chapter three is about Regular Expressions, and, although I am shy about regexes because of their relative complexity, I am very glad to have read this chapter and will no longer be intimidated when regexes are the correct approach to take! Chapter 4 is on Parsers and State machines, which are important for processing nested text, as in everyday HTML, XML and the like. This chapter is not as esoteric as its title may sound to relative newbies (like myself), as it does offer useful ideas and principles for dealing with HTML. How much more useful can a topic be than that? It is true that a deep understanding of this subject may be beyond myself and other relative duffers, but this chapter has much to offer those like me and I am sure much more to offer professionals.

Chapter 5 is on Internet tools and techniques, and this a good example of how text processing touches every important area of computer programming. We manipulate text for email, newsgroups, CGI programs, HTML and many other aspects of net programming. A good summary of XML programming is included, as well as useful synopses of other Python internet modules, from a text processing point of view.

Appendix A is the aforementioned selective and short review of Python basics. Appendix B is a ten page Data Compression primer that is quite educational. Appendix C offers the same good service for Unicode, and Appendix D covers the author's own software, a state machine for adding markup to text, which is backed up by his extensive web site that has a lot of free software to support those doing extensive text processing. Lastly, Appendix E is a Glossary for technical terms from the book. This is very much an educational book, and would be suitable for classroom work at the University level, beyond the introductory programming level; in fact, as part of a curriculum to teach programming using Python at the University level, this would be an excellent text for the second course.

One of the highlights of the book is that each chapter is concluded with a problem and discussion section. These are of the highest quality I have encountered in computer texts. Rather than overwhelming the reader with a large number of problems, the author has obviously given a lifetime of thought in coming up with a few key problems that are meant to stimulate thought, creativity, and ultimately understanding and growth in the reader. I will be coming back to the problems often, as they cannot be absorbed quickly anyway; they require thought. These would be most useful in a classroom environment; but as they are accompanied by excellent discussion material, and backed up by the author's web site, the individual reader will be well served also.

The book is more than the sum of its parts. It will be a most useful reference source for when I am doing various text related tasks for some time to come, and it was also a delightful and educational quick read in the here and now. It also amply illustrates the centrality of text processing in all areas of computer science, and I am confident that the book will be useful and educational for all programmers, whatever their area of expertise.

To sum it all up, this book is educational. It is also beautifully bound and printed, and excellently written. I rate it five stars, my highest rating, and heartily recommend its purchase.

You can purchase Text Processing with Python from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

26 of 215 comments (clear)

  1. Great Intro by GoofyBoy · · Score: 3, Interesting


    Exactly who wouldn't benefit from reading this book?

    --
    The surprise isn't how often we make bad choices; the surprise is how seldom they defeat us.
    1. Re:Great Intro by odaiwai · · Score: 2, Interesting

      You know, if someone goes to the trouble of reviewing a book, what's wrong with having an affiliate link to purchase the book?

      It doesn't cost you anything extra, and it might make the reviewer a few cents. This seems a reasonable return on the work involved in writing a review.

      One of the most searched items on my site is a picture of a Rolex. I want Rolex to have an affiliate program so I can get some of that hefty green goodness.

      dave

  2. Slashbot book review? by rkz · · Score: 4, Interesting

    This one is a great addition to the book shelf, I know how to do certain things in Python by using the docs, but this book clarifies nicely why you are actually doing it and provides better language specific ways of doing things that might now occur to you. Also, it introduces nice Python concepts in a clear and easy way which scripters might not have come across before.

  3. What do you use python for? by ACK!! · · Score: 3, Interesting

    I have not really used the language much but I have used a few programs like Redhat config tools that are python driven.

    What do Slashdotters use python for?

    What are its strengths and its weaknesses?

    Why is it worth learning another programming language?

    Just being curious and all that.

    --
    ACK /ak/ interj. 2. [from the comic strip "Bloom County"] An exclamation of surprised disgust, esp. i
    1. Re:What do you use python for? by Pedro_thewondermwnke · · Score: 4, Interesting

      What do Slashdotters use python for? "Fire & Forget" scripts, ie quickly fixing entries in databases as one offs. System monitors, checking the computers on our network is ok. as a calculator ;) & as a tool to unencode base64 encoded text. (I want to know that htaccess username & password ;) What are its strengths and its weaknesses? Quick to code something VERY powerfull, but slow to execute. Why is it worth learning another programming language? It's not, you already have leaned python, its just that you don't know you have! Just being curious and all that.

    2. Re:What do you use python for? by Pheersome · · Score: 3, Interesting

      I use python for everything more complex than a couple lines of shell. ~/proj/assorted_hacks/ contains stuff like a parser for libpcap dumps of AIM sessions, a script that pulls quotes out of a quotefile, a script which (using a module I wrote to parse a certain flavor of XML document) pretty-prints my bookmark URLs... I've also written a converter from the contact list format of my IM client of choice to '.blt', and at work I've written a substantial amount of CGI and some moderately tricky security-related scripts.

      Strengths: Much, much nicer to look at than perl; think "executable pseudocode" as opposed to "executable line noise". Object oriented if you want it to be. Very easy to learn, at least for someone with my background. (It took me one workday to go through the tutorial and play enough to have a decent clue what was going on; at the time I had two semesters of undergrad CS classes under my belt.) Has a good deal of the "do what I mean" quality. Development is typically very fast.

      Weaknesses: The canonical python weakness is speed, or lack thereof. I don't notice. If you're coding up something performance-intensive, don't use python. Some people don't like the indentation-as-syntax thing.

      It's worth learning another language because it'll take you just a few hours, and it's really fun.

      --
      Better to light a candle than to curse the darkness.
    3. Re:What do you use python for? by jodonoghue · · Score: 2, Interesting

      What do I use Python for?
      Pretty much anything which doesn't require real-time performance (which means most things).

      To expand, I work in the Mobile Telecomms realm, so most end-user code is real-time embedded C which tends to be heavily optimised for both speed and size.

      Python is great for writing simulations, tools for processing logfiles, regression test suites (you do test, right!), and GUIs (which almost never need to have very high performance).

      Strengths:

      * I'm surprised that few people have mentioned that Python is much more expressive than C, C++ or Java - you simply get the job done in fewer lines of code, and the code is exceptionally easy to read.

      * There is a rich set of built-in data types, and good support for basic Object Orientation - it's not the most OO implementation out there, but it's more than good enough for most designs. The fact that lists and dictionaries are part of the language means you can concentrate on expressing the problem, rather than implementing yet another linked list class.

      * Very simple, regular, syntax. BTW, I'm neither especially for nor against the whitespace thing. It does mean that most Python code is stylistically similar, so it's easy to read other people's code, but it is a pain if you use different editors (or differently configured editors) to work on modules written by others: if I edit code where spaces were used for indenting and I use tabs, the code will behave unexpectedly, because Python sees a tab as equal to a single space. This can occasionally be very annoying, and difficult to track down)

      * The dynamic typing and reflection are a joy to use - simple yet powerful.

      * The ability to use a functional programming style when appropriate, without enforcing it (I personally find using FP all the time a little too rigorous - I just don't always think recursively).

      * The library covers most things - and there are excellent, really easy to use hooks to GUI libraries (I mainly use WxWindows and some Qt).

      * Easy to call C/C++ modules using SWIG - in fact it's almost trivial, so you can prototype in Python and replace the speed bottlenecks with C or C++ code to get good system performance. The profiler is quite helpful in doing this.

      * Code is usually extremely portable between Linux, other Unices and Windows.

      Weaknesses:

      * I've yet to find a really good debugger. About the best I've found is the one in Boa Constructor, but it's some way behind using, say, DDD on C++ code.

      * Performance isn't the best for serious number crunching, but it's adequate for most things.

      * It's painful to package up finished code into a 'product'. If I use Python + WxWindows + WxPython to implement a GUI for a performance analysis tool, I'll need to deliver three installers along with my code. Fortunately my end-users for this stuff are also software engineers, so they generally get the install right!

      Why learn another programming language?

      This is more a philosophical question. You can do anything in any Turing complete programming language (that'll be all of them, then), if you must.

      However, different languages tend to engender different ways of thinking about problems, so by leaning a new language, you learn new ways of thinking which can often help you in other languages you know.

      I try to learn a new language each year (so far: C, C++, Java, Shell scripting, Python, Erlang, SDL, Lisp, Perl and more assemblers than I care to remember). I've gained the most from learning C, Python and Erlang, as they each represent very different approaches to a problem.

      It's still C which earns my bread and butter - nothing else really comes close for hard real-time work - but some of the techniques I've found natural in Python have proven to translate surprisingly well into C - I'd probably not have thought of doing things that way if I didn't learn Python.

      I'd recommend any programmer who works primarily in C or C++ to learn a scripting

  4. So how does Python compare to perl? by Cryofan · · Score: 2, Interesting

    for text processing? Does it have the same libraries? I know it is less complicated, or that is what I hear....

    --
    eat shiat and bark at the moon
  5. Python Jobs by Line_Fault · · Score: 5, Interesting

    Strangely enough, there seem to be a lot of jobs, at least where I am, where the only major language requirement is Python.
    I'm not sure if this is maintaining legacy apps, but it certainly scared me!

  6. It makes more sense to review good books. by hding · · Score: 3, Interesting

    Actually I think it's considerably less useful to review a bad book. Why? There are many, many times more books written than I will read. Therefore a bad review is most likely to warn me away from a book that I wasn't going to read anyway. And chances are (given the limited number of reviews) that no review will appear of a bad book that I planned to read.

    However, a good review may point me to a useful or interesting book that I would have otherwise overlooked.

    The obvious exception to this is when one can give a bad review to a book that is expected to have a very wide readership (and thus can warn many people away from a bad book), but how many technical books fall into this category?

  7. Re:PHP vs Python by Anonymous Coward · · Score: 1, Interesting
    in a high-load situation, the db is probably going to be the limiting factor.


    How many times have you seen slashdotted sites where the site is non-functional because of odbc or mysql errors?


    Unless you're using oracle or db2, php/python speed won't be a problem.

  8. Weakness/Strength: dynamic typing by DeadVulcan · · Score: 3, Interesting

    The type of object that an identifier points to cannot be declared; it's established at run-time. This is either a strength or a weakness depending on your philosophical leanings.

    It's a strength in that it makes prototyping very fast. If you want some function to operate on a class that it wasn't originally intended to operate on, then you just have to make the new class interface-compatible and jam it in there. No worrying about subclassing or prototypes or anything.

    It's a weakness for maintenance, because, when you're debugging this function, all you know is something has been passed in, and you're calling GetValue() on it. And cripes, you've got fifty six classes that have a GetValue() method! Which one is it getting? You have to run the program to find out.

    If you're doing scripting, then dynamic typing can be a godsend. If you're doing larger scale development, it can be a pain in the butt, because all of your developers need to be very disciplined.

    In general, Python is almost too powerful for its own good. If you have any undisciplined or "cowboy" programmers on your team, Python gives them enough rope to hang themselves... and everyone else... and their managers.

    But I love it. Treat it with respect, and Python will work wonders for you.

    --
    Accountability on the heads of the powerful.
    Power in the hands of the accountable.
    1. Re:Weakness/Strength: dynamic typing by Waffle+Iron · · Score: 2, Interesting

      Has anyone ever done a study to find out if the time saved by not debugging dynamic type problems is greater than the time wasted by developers worrying about compiler typing rules? In my experience, dynamic type issues are somewhat rare in a language like Python, but when programming in a language like C++ it seems that a large fraction your time can be consumed with trying to get the compiler happy with your type declarations. (Or structuring your code in an unnatural way to match someone else's type declarations. Or writing adapter layers between your type system and some library's type system.)

  9. Is Python PARTICULARLY good for text processing? by dpbsmith · · Score: 4, Interesting

    On taking a lightning-quick skimming of the text at gnosis I'm still don't quite get the point.

    SNOBOL was a mind-opener for me, because it really had a radically different approach to text processing. And it was genuinely useful. I haven't used it recently enough to know how I would feel about it today.

    Many languages now are more convenient for text processing than, say, C++ with STL or MFC. The traditional BASIC's at least recognize strings as good citizens and make it easy to do the fundamental operations. MUMPS improves on BASIC incrementally, as do PERL, Java, Javascript, etc., mostly to the degree that their standard libraries provide a useful suite of string functions. More and more languages have a Regex feature (e.g. REALBasic) and this is a really nice thing to have.

    So, I just read the review, and, as I say, took a lightning-quick browse through the online text of the book, and neither of them bothers to tell me how Python fits in.

    Both of them seem to assume from the beginning that I have already decided that Python is the language I want to use.

    Is there anything about Python that renders it especially appropriate for text processing? With regard to text processing, is it in a different category altogether from Java/Javascript/PERL/MUMPS/REALbasic?

    Or is it just a good language with string primitives and a decent string library?

  10. Perl is executable line noise, ... by Qbertino · · Score: 4, Interesting

    ...Python is executable Pseudocode.

    I have a stack of Perlbooks since something like 3 years ago and haven't gotten around to studying them thouroughly.
    Now that I've done some stuff in Python I actually think I'll never will. Everything that Perl can do Python can do better by now. Unless you're used to Unix CLI and syntax quirks Python will get you farther in a shorter period of time - and you'll be able to read your code in a year from now.
    Allthough the anual Perl obfuscation contest actually can be somewhat funny. :-)

    --
    We suffer more in our imagination than in reality. - Seneca
    1. Re:Perl is executable line noise, ... by Junks+Jerzey · · Score: 2, Interesting

      The "executable line noise" criticism has gotten to be a standard knee-jerk reaction, and as such it has lost all meaning.

      Perl has built-in syntax for various common tasks, such as regular expression matching and common file operations (Does this file exist? What is the size of this file?). This drives the purists crazy. But if you think about it, putting the syntax directly into the language has some benefits. You can check if a file exists with a single operator. In Python, you have to remember the name of the function *and* which module it is located in, then you have to import that module. This adds up to a lot of extra mental noise.

      Or consider regular expressions. In Perl you don't have to precompile regular expressions. The compiler can see that an expression doesn't contain variables and deal with it once up front. Or if you use a variable, you can give the "o" option to an expression, indicating "compile once." In Python, you have to manually compile all expressions and reference them by id, unless you don't mind the overhead of the expression being parsed every time it is used.

      To clarify, this is not a knock on Python. Python has many of its own advantages. But simple-minded Perl bashing makes me immediately think the poster is a newbie programmer, or at least a programmer who is not well-rounded.

    2. Re:Perl is executable line noise, ... by Simon · · Score: 2, Interesting
      But if you think about it, putting the syntax directly into the language has some benefits. You can check if a file exists with a single operator. In Python, you have to remember the name of the function *and* which module it is located in, then you have to import that module. This adds up to a lot of extra mental noise.

      Which is small price to pay for readable code (and that's assuming that -f and -d, etc is easier to remember than os.path.isfile() and os.path.isdir().) I can't believe that people still think that reducing keystrokes somehow equates to improved programmer effectiveness. It's readability that counts since code is read so much more often than it is written. Hell, even a non-python programmer can read "os.path.isfile()" and guess what it does. I can't say the same about Perl's -f, -d and -e.

      --
      Simon

  11. GUI programming!! by Balinares · · Score: 4, Interesting

    Python and Qt are the killer combo. I once coded during a break, just for fun (and as an example for the management, alright), a complex widget that took our head VB programmer *three days*. Only the Python/Qt widget was dynamically resizable (the VB one wasn't) and could hold any subwidget (the VB one could only hold buttons).

    Now I use Python for a variety of tasks ranging from things just a little too complex to be cleanly done in Perl, to large things that usually belong in Java's sphere but are much faster coded in Python. But GUI programming is an area where it particularly shines.

    --

    -- B.
    This sig does in fact not have the property it claims not to have.
  12. Here's what I like by truthsearch · · Score: 4, Interesting

    I just started learning Python a few weeks ago, with my background being C++, Java, and Visual Basic. As a side note I have to point out that Python is an absolutely fantastic option for someone wanting to switch from VB to something more modern, useful, and platform independant.

    These are the benefits of Python (mostly over C++) I personally like:
    - It's a very forgiving language; i.e. you don't need to be overly concerned about string lengths or list bounds, no pointers and simple garbage collection
    - List notations built into the syntax are extremely handy for referring to portions of the list and making changes; far less code needed for working with lists
    - The OO parts are sufficient without being complex; everything is public; multiple inheritance
    - Modules are compiled as needed and compiled version is used when available, so it's pretty quick
    - Lots of runtime information easily available

  13. Re:Python is the Lord by Tack · · Score: 2, Interesting
    Additionally, Python is an embodiment of Open Source, because the code is actually readable and concise enough to lower the barrier of reading it.

    At the risk of being redundant, I have to emphatically agree with this. A few years ago I started a project that required me to wrap a C library as a python module. (The project was ORBit-Python.) Having done a lot in perlXS before that, I was quite prepared to struggle with the Python/C API.

    But it wound up being truly a breath of fresh air. There are a few sticky points to get hung up on, like what functions return borrowed or new references, but the syntax is elegant and consistent, and the Python code itself is completely intuitive and a pleasure to read.

    Jason.

  14. Generally, you need some negative reviews... by Jerf · · Score: 4, Interesting

    Why are you so focused on negativity? With the nightly news pushing out stories left and right about what's wrong with the world, can't we at least keep our Slashdot book reviews a good positive example of what's right with the world?

    For a given reviewer, you need both positive and negative reviews so you can get a feel for what the reviewer is looking for, and how closely it matches what you are looking for. In something as subjective as books or video games, this is critical. This allows you to align your views with the reviewer.

    In this environment, where it's a different reviewer is reviewing each time, it's much less useful. Reviews are really only useful in the context of knowing something about the reviewer. (I just thought of this, and after I post this I intend to shut off reviews from my Slashdot feed, since they are uniformly useless to anybody seriously looking to use them due to this overwhelming flaw in the process.)

    In fact, the bad reviews are typically far more informative then the good ones. Most good reviews can be boiled down to "It's great!" with little loss of content, where the bad reviews have actual criticisms of the reviewed product. What you do then is read the criticisms and see if you might agree with them. If you're reading a video game review (which I use because it has great examples), and it says "Game X has far too many little numbers to keep track of for your characters", and you're old-skool and you like fiddly little numbers, then the negative review may actually boost your opinion. A lot of what appears in reviews is that sort of opinion, relatively little is concerned with universal things like "I couldn't get this game to run stably for more then 5 minutes on any of the four computers I tried it on here."

    For a book review, such negative comments really go a long ways towards clarifying what the book is. "This book didn't give any examples on how to process XML" tells you more about the book's focus then "This book is great for anyone who programs and uses text!".

    The point of "The Power of Positive Thinking", IIRC, wasn't to be unremittingly positive in every way; that's actually counterproductive and can take you out of touch with the real world. In fact, IIRC, it can best be summarized as "Don't be negative; that's bad." ;-)

  15. Woe to XHTML by fm6 · · Score: 2, Interesting
    The GTP site naturally links to the Open Books Project site. Here things get sort of depressing. The HTML includes a reference to the XHTML DTD at w3.org. If you try to open this page with Internet Explorer, it tries to download and parse the DTD, with unfortunate results:
    Parameter entity must be defined before it is used. Error processing resource 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'. Line 85, Position 2
    IE behaves correctly if you give it an out-of-band indication that this is HTML (such as copying the text to a file with an .html extension). Netscape seems to ignore the DTD reference, even if you feed it the code in a file with an XML extension.

    This is frustrating. I'm beginning to be a fan of XHTML and CSS. The specification are much better thought out than they use to be. There's even support for using XHTML for hard copy! But what's the point of creating content in these formats if it's inaccessible to 90% of web users?

  16. As the reviewer of this book, and many on my site by Ursus+Maximus · · Score: 2, Interesting

    I'd like you to know that I am *not* an affiliate of any company's, and you can not link to Amazon or anywhere else from my site giving me a commission. I do it for love, or fun, or whatnot, but the 35 or so book reviews on my site and the rest of my site, do not earn any money anyway. www.awaretek.com/plf.html

  17. Re:Python is the Lord by King+Babar · · Score: 2, Interesting
    You really have no clue about OOP before you have tried one of the dynamic OOP languages: Python, Smalltalk, or Ruby. Smalltalk has fallen to a legacy role these days, while Ruby is much less mature and has a smaller community than Python. Additionally, Ruby is less "tasteful", in that it borrows more heavily from perl, but that is a matter of controversy ;-).

    I am not sure what you mean by Ruby being "less mature"; as a language in this niche, it appears (to me) to be among the most mature. It does have a smaller user community right now, and there are some library and documentation gaps, but nothing that could not be fixed.

    Back on the topic of text-wrangling, I should point out that Ruby is also *very* well-suited for this. So well-suited I'm not sure you'd ever want or need a big book about the subject. Do check out the Ruby Language Home Page.

    --

    Babar

  18. Python file processing by Ian+Bicking · · Score: 2, Interesting

    Just on the topic of file processing, the path module for Python is really cool. I'd like to see it become a part of the standard library, actually. I think it makes Python code much more on-par with Perl for that task (and I fully admit that Python's os.path functions are not very pretty).