Slashdot Mirror


Text Processing in Python

Ursus Maximus writes "If you have read an introductory book or two about Python programming, but you are far from being an expert, then you will benefit a lot from reading this book. If you are a competent programmer in any other language, you will benefit from this book. If you are an expert Python programmer, you will also benefit from this book." Ursus Maximus's review continues below. Text Processing in Python author David Mertz pages 520 publisher Addison Wesley rating 10 reviewer Ursus Maximus ISBN 0321112547 summary How to use Python to process text.

As you probably know, there are many good introductory texts about Python. This is not one of them, for this is an advanced book, but not an inaccessible one. David Mertz has a unique style and focus that we have become familiar with from his series of articles on the IBM Developer Network. Dr. Mertz is more interested in facilitating our learning process than in lecturing us, and rather than fill his pages with impressive examples designed to illustrate his expertise, he gently guides us by offering subtle yet important examples of code and analysis that makes us think for ourselves.

He has a special talent for programming in the functional style, and this is a great introduction to that style of Python programming. Thus, this is also a good guide to using the newer features introduced into Python in the last few revisions, which often facilitate the functional style of programming.

The text includes, in an appendix, a 40 page tutorial covering the basic Python language. This tutorial is, like the book, unique in its approach and is worthwhile even for experienced Pythonistas, as it sheds light on some of the underlying ideas behind the syntax and semantics, and it also illustrates the functional style of programming, which is sometimes quite useful when doing text processing. And, despite its many other virtues, this is a book about text processing.

Chapter 1 covers the Python basics, but with a particular eye towards those features most critical and useful for text processing. Chapter 2 covers the basic string operations as found in the string module and the newer built-in string functions. Chapter three is about Regular Expressions, and, although I am shy about regexes because of their relative complexity, I am very glad to have read this chapter and will no longer be intimidated when regexes are the correct approach to take! Chapter 4 is on Parsers and State machines, which are important for processing nested text, as in everyday HTML, XML and the like. This chapter is not as esoteric as its title may sound to relative newbies (like myself), as it does offer useful ideas and principles for dealing with HTML. How much more useful can a topic be than that? It is true that a deep understanding of this subject may be beyond myself and other relative duffers, but this chapter has much to offer those like me and I am sure much more to offer professionals.

Chapter 5 is on Internet tools and techniques, and this a good example of how text processing touches every important area of computer programming. We manipulate text for email, newsgroups, CGI programs, HTML and many other aspects of net programming. A good summary of XML programming is included, as well as useful synopses of other Python internet modules, from a text processing point of view.

Appendix A is the aforementioned selective and short review of Python basics. Appendix B is a ten page Data Compression primer that is quite educational. Appendix C offers the same good service for Unicode, and Appendix D covers the author's own software, a state machine for adding markup to text, which is backed up by his extensive web site that has a lot of free software to support those doing extensive text processing. Lastly, Appendix E is a Glossary for technical terms from the book. This is very much an educational book, and would be suitable for classroom work at the University level, beyond the introductory programming level; in fact, as part of a curriculum to teach programming using Python at the University level, this would be an excellent text for the second course.

One of the highlights of the book is that each chapter is concluded with a problem and discussion section. These are of the highest quality I have encountered in computer texts. Rather than overwhelming the reader with a large number of problems, the author has obviously given a lifetime of thought in coming up with a few key problems that are meant to stimulate thought, creativity, and ultimately understanding and growth in the reader. I will be coming back to the problems often, as they cannot be absorbed quickly anyway; they require thought. These would be most useful in a classroom environment; but as they are accompanied by excellent discussion material, and backed up by the author's web site, the individual reader will be well served also.

The book is more than the sum of its parts. It will be a most useful reference source for when I am doing various text related tasks for some time to come, and it was also a delightful and educational quick read in the here and now. It also amply illustrates the centrality of text processing in all areas of computer science, and I am confident that the book will be useful and educational for all programmers, whatever their area of expertise.

To sum it all up, this book is educational. It is also beautifully bound and printed, and excellently written. I rate it five stars, my highest rating, and heartily recommend its purchase.

You can purchase Text Processing with Python from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

22 of 215 comments (clear)

  1. The book in full by TheRoss · · Score: 5, Informative

    is here, as a series of text files. This is official.

    1. Re:The book in full by jellomizer · · Score: 5, Funny

      Why do I have a sense of fear whenever I see a link that starts with g and ends with .cx

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
  2. You will benefit from this book.... by revery · · Score: 5, Funny

    "If you have read an introductory book or two about Python programming, but you are far from being an expert, then you will benefit a lot from reading this book. If you are a competent programmer in any other language, you will benefit from this book. If you are an expert Python programmer, you will also benefit from this book."

    If you are a practitioner of voodoo and merely handle large pythons, you will benefit from this book.

    If you are a undersea explorer but have heard of pythons....

    --

    Was it the sheep climbing onto the altar, or the cattle lowing to be slain,
    or the Son of God hanging dead and bloodied on a cross that told me this was a world condemned, but loved and bought with blood.

  3. Slashbot book review? by rkz · · Score: 4, Interesting

    This one is a great addition to the book shelf, I know how to do certain things in Python by using the docs, but this book clarifies nicely why you are actually doing it and provides better language specific ways of doing things that might now occur to you. Also, it introduces nice Python concepts in a clear and easy way which scripters might not have come across before.

  4. Another... by Pinguu · · Score: 4, Informative
    --
    --
    1. Re:Another... by Mister+Furious · · Score: 5, Informative

      yeah, this is a good book. also it's released under the GNU Free Documentation License and is available to download in various formats here.

  5. benefits by tmark · · Score: 4, Funny

    If you have read an introductory book or two about Python programming, but you are far from being an expert, then you will benefit a lot from reading this book. If you are a competent programmer in any other language, you will benefit from this book. If you are an expert Python programmer, you will also benefit from this book

    And if you're the website posting this glowing review, and collecting affiliate fees, you will also benefit from this book.

  6. Python Jobs by Line_Fault · · Score: 5, Interesting

    Strangely enough, there seem to be a lot of jobs, at least where I am, where the only major language requirement is Python.
    I'm not sure if this is maintaining legacy apps, but it certainly scared me!

  7. Re:What do you use python for? by Pedro_thewondermwnke · · Score: 4, Interesting

    What do Slashdotters use python for? "Fire & Forget" scripts, ie quickly fixing entries in databases as one offs. System monitors, checking the computers on our network is ok. as a calculator ;) & as a tool to unencode base64 encoded text. (I want to know that htaccess username & password ;) What are its strengths and its weaknesses? Quick to code something VERY powerfull, but slow to execute. Why is it worth learning another programming language? It's not, you already have leaned python, its just that you don't know you have! Just being curious and all that.

  8. Re:What do you use python for? by tuffy · · Score: 5, Informative
    What do Slashdotters use python for?

    I use it for data management, system administration chores and CGI programming.

    What are its strengths

    Python has a nice clean syntax that tends to re-use language constructs, which makes it easy to learn and read. It makes good use of objects and exceptions and it has a solid standard libarary of goodies. And, it has no shortage of additional modules to use. Plus, the whole of it is highly malleable.

    and its weaknesses?

    It's not the fastest language out there, some don't like its whitespace-based syntax and it doesn't have the breadth of pre-built modules as older languages like Perl have.

    Why is it worth learning another programming language?

    It is if you have problems to solve and don't particularly care for the tools you're using now.

    --

    Ita erat quando hic adveni.

  9. Re:Great Intro by orthogonal · · Score: 4, Insightful

    You know, if someone goes to the trouble of reviewing a book, what's wrong with having an affiliate link to purchase the book?

    In all seriousness (unlike my original post), it's a conflict of interest: the reviwer who gets compensated when readers of the review purchase the book has a great incentive not to pan the book, even if it deserves panning, because a bad review means fewer buyers means less pay-off to the "affliate" linker.

    "Affiliate" programs also drive up the cost of the books (or Rolexes), both because the affiliate must be paid off, and to cover the administrative costs of the affiliate program.

    It also means a slightly slower response time when I click the link, as the server, besides displaying the page, has to access a database to credit the affiliate -- and possibly track me all the way to purchase to see if the affiliate is to be compensated. In the case where compensation only comnes on purchase, it means another layer of tracking, and probably a web site that wants to send me cookies to identify which affiliate should get paid if I do decide to purchase. Cookies, of course, lead to individualized customer profiles and possibly higher prices when and if the tracking software decides I'll be willing to pay more than the average Joe, based on that customer profile.

    So we have conflict of interest, slightly higher costs, and customer and referer tracking. None of these things benefit me as a customer, and I prefer to avoid them.

  10. Text processing in Python by jdavidb · · Score: 4, Funny

    A good programmer can write Perl in any language. :)

    (Just kidding. ;) )

  11. Python is the Lord by ultrabot · · Score: 5, Informative

    I'm not sure if this is maintaining legacy apps, but it certainly scared me!

    Python jobs are hardly for legacy app maintenance. More like rapid development of cutting edge stuff, prototyping, exploring, enterprise application integration... and Agile development in general. I introduced Python to my previous workplace, and after the guys there learned it, they didn't switch back (even though their chief python advocate/fascist, i.e. your truly, left :-).

    Python can be used for very large problems (hundreds of modules, and much more classes), in addition to trivial scripts (0 functions). It is *fun* as hell. Python programmer is always an architect, there is very little monkey-level "grunt work", which tends to form most of your day-to-day C++/Java programming.

    You really have no clue about OOP before you have tried one of the dynamic OOP languages: Python, Smalltalk, or Ruby. Smalltalk has fallen to a legacy role these days, while Ruby is much less mature and has a smaller community than Python. Additionally, Ruby is less "tasteful", in that it borrows more heavily from perl, but that is a matter of controversy ;-).

    Additionally, Python is an embodiment of Open Source, because the code is actually readable and concise enough to lower the barrier of reading it. In fact I have taken a look at the source code of several Open Source projects that use Python "just for kicks", while I hardly bother in case of e.g. C programs. One line of Python is equivalent of 10-20 lines of C++, so you can digest more with the typical geek attention span (i.e. borderline ADD ;-).

    --
    Save your wrists today - switch to Dvorak
  12. Is Python PARTICULARLY good for text processing? by dpbsmith · · Score: 4, Interesting

    On taking a lightning-quick skimming of the text at gnosis I'm still don't quite get the point.

    SNOBOL was a mind-opener for me, because it really had a radically different approach to text processing. And it was genuinely useful. I haven't used it recently enough to know how I would feel about it today.

    Many languages now are more convenient for text processing than, say, C++ with STL or MFC. The traditional BASIC's at least recognize strings as good citizens and make it easy to do the fundamental operations. MUMPS improves on BASIC incrementally, as do PERL, Java, Javascript, etc., mostly to the degree that their standard libraries provide a useful suite of string functions. More and more languages have a Regex feature (e.g. REALBasic) and this is a really nice thing to have.

    So, I just read the review, and, as I say, took a lightning-quick browse through the online text of the book, and neither of them bothers to tell me how Python fits in.

    Both of them seem to assume from the beginning that I have already decided that Python is the language I want to use.

    Is there anything about Python that renders it especially appropriate for text processing? With regard to text processing, is it in a different category altogether from Java/Javascript/PERL/MUMPS/REALbasic?

    Or is it just a good language with string primitives and a decent string library?

  13. Perl is executable line noise, ... by Qbertino · · Score: 4, Interesting

    ...Python is executable Pseudocode.

    I have a stack of Perlbooks since something like 3 years ago and haven't gotten around to studying them thouroughly.
    Now that I've done some stuff in Python I actually think I'll never will. Everything that Perl can do Python can do better by now. Unless you're used to Unix CLI and syntax quirks Python will get you farther in a shorter period of time - and you'll be able to read your code in a year from now.
    Allthough the anual Perl obfuscation contest actually can be somewhat funny. :-)

    --
    We suffer more in our imagination than in reality. - Seneca
  14. GUI programming!! by Balinares · · Score: 4, Interesting

    Python and Qt are the killer combo. I once coded during a break, just for fun (and as an example for the management, alright), a complex widget that took our head VB programmer *three days*. Only the Python/Qt widget was dynamically resizable (the VB one wasn't) and could hold any subwidget (the VB one could only hold buttons).

    Now I use Python for a variety of tasks ranging from things just a little too complex to be cleanly done in Perl, to large things that usually belong in Java's sphere but are much faster coded in Python. But GUI programming is an area where it particularly shines.

    --

    -- B.
    This sig does in fact not have the property it claims not to have.
  15. Re:I can think of one person... by Lulu+of+the+Lotus-Ea · · Score: 4, Informative

    Actually, although this remark lacks modesty, I wrote the book for myself, in a way. That is, whenever I want to remind MYSELF of a particular method in an odd little module I only use occassionally, I turn to my own explication of it. It reminds me of what I found the most important aspect when I investigated that particular feature during writing. So I benefit from having a copy too (or usually the e-copy that you can find on my website).

    Btw. I also have some author copies that I'd like to sell to US buyers who can pay by check. Basically, I get the most money if you do it that way. If that's not convenient, please buy it some other place... but if you want to drop me an email, so much the better.

    David Mertz
    http://gnosis.cx/TPiP/

  16. Here's what I like by truthsearch · · Score: 4, Interesting

    I just started learning Python a few weeks ago, with my background being C++, Java, and Visual Basic. As a side note I have to point out that Python is an absolutely fantastic option for someone wanting to switch from VB to something more modern, useful, and platform independant.

    These are the benefits of Python (mostly over C++) I personally like:
    - It's a very forgiving language; i.e. you don't need to be overly concerned about string lengths or list bounds, no pointers and simple garbage collection
    - List notations built into the syntax are extremely handy for referring to portions of the list and making changes; far less code needed for working with lists
    - The OO parts are sufficient without being complex; everything is public; multiple inheritance
    - Modules are compiled as needed and compiled version is used when available, so it's pretty quick
    - Lots of runtime information easily available

  17. Re:What do you use python for? by Qbertino · · Score: 5, Informative

    What do Slashdotters use python for?

    Software Agents / Content Syndication 'bots
    Web/Internet Application Server (Zope)
    3D (me: Blender, ILM for Maya and others)

    I've used Python on various things one of the more abitious being, well, actually Text Processing :-). In the wider term that is. A Software Agent for scanning and retrieving certain information from different Inet Sources - a very serial process that's hard to 'objectivise'. Python did/does a great job at keeping things overseeable.

    Zope is the other area I use Python in. Zope I consider the most sophsticated Application Server avaiable. It's GPLd of course :-) (www.zope.org)

    Just as with me Python is very popular within the 3D Field. ILM use it as their prime scripting language and I like Blenders built in Python controlled/based realtime engine.

    What are its strengths and its weaknesses?
    Shurely it's tab-based delimiting of blocks ('whitespace syntax') is a big feature. I can be shure to be able to read *any* code from anybody who did it in Python instantly. Think of how teamwork improves (especially in extreme programming) when bad indentation means your code is broken!
    Python is completely GPLd, which means a lot to me and overall futuresafety of a PL. That's why I don't feel so good about Java (allthough I like it too in a way)
    Python is very easy to learn. "Perl is executable line noise, Python is executable pseudocode" actually sums that one up.
    The only *weakness* that comes to mind is that it's a younger language. But it's catching up rapidly in terms of breadth and width of the 'lib' availability - also due to Python being completely GPLd!

    Why is it worth learning another programming language?
    It's actually one of the most modern and sophisticated. I realizes what developers theorized as ideal some 20 years ago.
    The obligatory famous quote:
    |||We will perhaps eventually be writing only small modules which are identified by name as they are used to build larger ones, so that devices like indentation, rather than delimiters, might become feasible for expressing local structure in the source language
    - Donald E. Knuth,1974|||

    Oh, and, yet again, it's GPLd all the way through. Want a better PL? Use Python.

    --
    We suffer more in our imagination than in reality. - Seneca
  18. Re:Why use Python? by daveaitel · · Score: 4, Informative
    Well, there are 2 major drawbacks to Python:
    1. No good free runtime debugger
    2. No CPAN

    But the major benefits are that you can, with basically NO Python training, sit down at a random Python program and extend it ten times faster than an expert in C could extend THEIR OWN program.
    It's a combination of a lot of things that makes Python great to use - some of these things Perl has as well, but most of these things are very Python specific - you'll see them as you learn it.

    I recommend Wing IDE, btw, for a commercial Python editor and runtime debugger at a reasonable price.

    For what it's worth, CANVAS (http://www.immunitysec.com/CANVAS/) is written entirely in Python, so I put my money where my mouth is.

    -dave

  19. Generally, you need some negative reviews... by Jerf · · Score: 4, Interesting

    Why are you so focused on negativity? With the nightly news pushing out stories left and right about what's wrong with the world, can't we at least keep our Slashdot book reviews a good positive example of what's right with the world?

    For a given reviewer, you need both positive and negative reviews so you can get a feel for what the reviewer is looking for, and how closely it matches what you are looking for. In something as subjective as books or video games, this is critical. This allows you to align your views with the reviewer.

    In this environment, where it's a different reviewer is reviewing each time, it's much less useful. Reviews are really only useful in the context of knowing something about the reviewer. (I just thought of this, and after I post this I intend to shut off reviews from my Slashdot feed, since they are uniformly useless to anybody seriously looking to use them due to this overwhelming flaw in the process.)

    In fact, the bad reviews are typically far more informative then the good ones. Most good reviews can be boiled down to "It's great!" with little loss of content, where the bad reviews have actual criticisms of the reviewed product. What you do then is read the criticisms and see if you might agree with them. If you're reading a video game review (which I use because it has great examples), and it says "Game X has far too many little numbers to keep track of for your characters", and you're old-skool and you like fiddly little numbers, then the negative review may actually boost your opinion. A lot of what appears in reviews is that sort of opinion, relatively little is concerned with universal things like "I couldn't get this game to run stably for more then 5 minutes on any of the four computers I tried it on here."

    For a book review, such negative comments really go a long ways towards clarifying what the book is. "This book didn't give any examples on how to process XML" tells you more about the book's focus then "This book is great for anyone who programs and uses text!".

    The point of "The Power of Positive Thinking", IIRC, wasn't to be unremittingly positive in every way; that's actually counterproductive and can take you out of touch with the real world. In fact, IIRC, it can best be summarized as "Don't be negative; that's bad." ;-)

  20. Re:What do you use python for? by 4of12 · · Score: 4, Informative

    it doesn't have the breadth of pre-built modules as older languages like Perl have.

    Maybe not quite as many modules as Perl, but the standard Python library provides interfaces for a lot of different tasks. It's not skimpy, in case any of you potential Python users was worried.

    There's good reason the motto is "Batteries Included".

    I've found Python useful for all kinds of tasks and love the clean, short syntax devoid of punctuation characters.

    If you need more of a recognized authority to recommend how great and wonderful is Python, then listen to Bruce Eckel or Eric Raymond.

    --
    "Provided by the management for your protection."