Text Processing in Python
As you probably know, there are many good introductory texts about Python. This is not one of them, for this is an advanced book, but not an inaccessible one. David Mertz has a unique style and focus that we have become familiar with from his series of articles on the IBM Developer Network. Dr. Mertz is more interested in facilitating our learning process than in lecturing us, and rather than fill his pages with impressive examples designed to illustrate his expertise, he gently guides us by offering subtle yet important examples of code and analysis that makes us think for ourselves.
He has a special talent for programming in the functional style, and this is a great introduction to that style of Python programming. Thus, this is also a good guide to using the newer features introduced into Python in the last few revisions, which often facilitate the functional style of programming.
The text includes, in an appendix, a 40 page tutorial covering the basic Python language. This tutorial is, like the book, unique in its approach and is worthwhile even for experienced Pythonistas, as it sheds light on some of the underlying ideas behind the syntax and semantics, and it also illustrates the functional style of programming, which is sometimes quite useful when doing text processing. And, despite its many other virtues, this is a book about text processing.
Chapter 1 covers the Python basics, but with a particular eye towards those features most critical and useful for text processing. Chapter 2 covers the basic string operations as found in the string module and the newer built-in string functions. Chapter three is about Regular Expressions, and, although I am shy about regexes because of their relative complexity, I am very glad to have read this chapter and will no longer be intimidated when regexes are the correct approach to take! Chapter 4 is on Parsers and State machines, which are important for processing nested text, as in everyday HTML, XML and the like. This chapter is not as esoteric as its title may sound to relative newbies (like myself), as it does offer useful ideas and principles for dealing with HTML. How much more useful can a topic be than that? It is true that a deep understanding of this subject may be beyond myself and other relative duffers, but this chapter has much to offer those like me and I am sure much more to offer professionals.
Chapter 5 is on Internet tools and techniques, and this a good example of how text processing touches every important area of computer programming. We manipulate text for email, newsgroups, CGI programs, HTML and many other aspects of net programming. A good summary of XML programming is included, as well as useful synopses of other Python internet modules, from a text processing point of view.
Appendix A is the aforementioned selective and short review of Python basics. Appendix B is a ten page Data Compression primer that is quite educational. Appendix C offers the same good service for Unicode, and Appendix D covers the author's own software, a state machine for adding markup to text, which is backed up by his extensive web site that has a lot of free software to support those doing extensive text processing. Lastly, Appendix E is a Glossary for technical terms from the book. This is very much an educational book, and would be suitable for classroom work at the University level, beyond the introductory programming level; in fact, as part of a curriculum to teach programming using Python at the University level, this would be an excellent text for the second course.
One of the highlights of the book is that each chapter is concluded with a problem and discussion section. These are of the highest quality I have encountered in computer texts. Rather than overwhelming the reader with a large number of problems, the author has obviously given a lifetime of thought in coming up with a few key problems that are meant to stimulate thought, creativity, and ultimately understanding and growth in the reader. I will be coming back to the problems often, as they cannot be absorbed quickly anyway; they require thought. These would be most useful in a classroom environment; but as they are accompanied by excellent discussion material, and backed up by the author's web site, the individual reader will be well served also.
The book is more than the sum of its parts. It will be a most useful reference source for when I am doing various text related tasks for some time to come, and it was also a delightful and educational quick read in the here and now. It also amply illustrates the centrality of text processing in all areas of computer science, and I am confident that the book will be useful and educational for all programmers, whatever their area of expertise.
To sum it all up, this book is educational. It is also beautifully bound and printed, and excellently written. I rate it five stars, my highest rating, and heartily recommend its purchase.
You can purchase Text Processing with Python from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
Exactly who wouldn't benefit from reading this book?
The surprise isn't how often we make bad choices; the surprise is how seldom they defeat us.
is here, as a series of text files. This is official.
Ah, I see you reviewed the book that goes BING!
Good quote, too many chars. Seriously, the slashdot 120 char limit sucks!
the other question is why use C# or Python for Text Processing while there is Perl !
"If you have read an introductory book or two about Python programming, but you are far from being an expert, then you will benefit a lot from reading this book. If you are a competent programmer in any other language, you will benefit from this book. If you are an expert Python programmer, you will also benefit from this book."
If you are a practitioner of voodoo and merely handle large pythons, you will benefit from this book.
If you are a undersea explorer but have heard of pythons....
--
Was it the sheep climbing onto the altar, or the cattle lowing to be slain,
or the Son of God hanging dead and bloodied on a cross that told me this was a world condemned, but loved and bought with blood.
c# what are you talking about. Im hard core Visual Basic 6 my friend. Text Process that!
Everyday You see me is the worst day of my life -Office Space
This one is a great addition to the book shelf, I know how to do certain things in Python by using the docs, but this book clarifies nicely why you are actually doing it and provides better language specific ways of doing things that might now occur to you. Also, it introduces nice Python concepts in a clear and easy way which scripters might not have come across before.
There is no god
good book
--
Language war! Language war!
I better play the trump card and end this now: Choose the *real* red pill
Java is the blue pill
Choose the red pill
You wouldn't. Unless you live in your parent's basement and can't afford a real development platform.
"BSD: Free as in speech. Linux: Free as in beer. Windows 10: Free as in herpes." --Man On Pink Corner in #52607549.
Maybe it would be useful to review some BAD books. First, it would steer people away from them. Second, it would provide good examples of where a lot of tech writing goes wrong. Finally, it's just fun to read someone bash the sh!t of out something.
Why? Because then one would have to program in and maintain Perl code.
Ita erat quando hic adveni.
If you have read an introductory book or two about Python programming, but you are far from being an expert, then you will benefit a lot from reading this book. If you are a competent programmer in any other language, you will benefit from this book. If you are an expert Python programmer, you will also benefit from this book
And if you're the website posting this glowing review, and collecting affiliate fees, you will also benefit from this book.
I have not really used the language much but I have used a few programs like Redhat config tools that are python driven.
What do Slashdotters use python for?
What are its strengths and its weaknesses?
Why is it worth learning another programming language?
Just being curious and all that.
ACK
for text processing? Does it have the same libraries? I know it is less complicated, or that is what I hear....
eat shiat and bark at the moon
Strangely enough, there seem to be a lot of jobs, at least where I am, where the only major language requirement is Python.
I'm not sure if this is maintaining legacy apps, but it certainly scared me!
I think you missed the word "competent." And "programmer." HTH.
Isn't python 'just another language'? What 'space' does it fill, what purpose does it serve?
For all the articles I've read about Python, it seems that it was written primarily as a replacement for Perl - i.e, more readable, usable, etc.
But, seeing as Perl has been around for so long, has tons of support, online documentation, available code already written, why would I use Perl?
If I'm coding a web-based application, I'll use PHP; if I'm coding Linux/Unix scripts of any degree of complexity, I'll use Perl; if I'm coding a GUI-based app, or server-side application where PHP can't do the job, I'll use Java.
So, where the heck does Python come into play? Does the tech world really need another friggin language that emulates Perl's functionality?
Does anyone know the relative speed of Python vs PHP, in a "manipulation of information in databases under high load", type situation?
Editors: please... if there's not much going on, please don't post not-worthy front page material to shove the decent stories down and out of sight.
QUANTITY is not better than QUALITY.
"If you have read an introductory book or two about Python programming, but you are far from being an expert, then you will benefit a lot from reading this book. If you are a competent programmer in any other language, you will benefit from this book. If you are an expert Python programmer, you will also benefit from this book."
= No matter what, you will benefit from this book.
Do I hear a "best thing since sliced bread" coming?
Must-not-watch TV!
Am i one of the only people that actually DOES use VB6 on a regular basis??! (and, although i do live with my parents and can't afford REAL development tools, i'm only 17 so i think it is reasonable!)
Actually I think it's considerably less useful to review a bad book. Why? There are many, many times more books written than I will read. Therefore a bad review is most likely to warn me away from a book that I wasn't going to read anyway. And chances are (given the limited number of reviews) that no review will appear of a bad book that I planned to read.
However, a good review may point me to a useful or interesting book that I would have otherwise overlooked.
The obvious exception to this is when one can give a bad review to a book that is expected to have a very wide readership (and thus can warn many people away from a bad book), but how many technical books fall into this category?
A good programmer can write Perl in any language. :)
(Just kidding. ;) )
Secession is the right of all sentient beings.
I'm not sure if this is maintaining legacy apps, but it certainly scared me!
:-).
;-).
;-).
Python jobs are hardly for legacy app maintenance. More like rapid development of cutting edge stuff, prototyping, exploring, enterprise application integration... and Agile development in general. I introduced Python to my previous workplace, and after the guys there learned it, they didn't switch back (even though their chief python advocate/fascist, i.e. your truly, left
Python can be used for very large problems (hundreds of modules, and much more classes), in addition to trivial scripts (0 functions). It is *fun* as hell. Python programmer is always an architect, there is very little monkey-level "grunt work", which tends to form most of your day-to-day C++/Java programming.
You really have no clue about OOP before you have tried one of the dynamic OOP languages: Python, Smalltalk, or Ruby. Smalltalk has fallen to a legacy role these days, while Ruby is much less mature and has a smaller community than Python. Additionally, Ruby is less "tasteful", in that it borrows more heavily from perl, but that is a matter of controversy
Additionally, Python is an embodiment of Open Source, because the code is actually readable and concise enough to lower the barrier of reading it. In fact I have taken a look at the source code of several Open Source projects that use Python "just for kicks", while I hardly bother in case of e.g. C programs. One line of Python is equivalent of 10-20 lines of C++, so you can digest more with the typical geek attention span (i.e. borderline ADD
Save your wrists today - switch to Dvorak
You sir are my hero
Not only have you managed to get a positive karma post down to an easy formula but you manage to harvest karma with the SAME post time and time again. I often wonder if you get modded up and I get modded down simply because people don't want to believe they have been fooled so many times. I think the realization would damage their ego.
I've been curious about learning Python for awhile now. But, seriously, what is the great advantage of using Python vs. C++? All I really even know about it is that it is object oriented, just like C++, but that you have to be very particular about your whitespace.
;)
Not sure how significant one could take this to be, but over at meetup.com, the C/C++ group looks to be a dying breed while a relative many are flocking to the Python meetings. Oh well. At least the the D&D meeting is still going strong.
Quod scripsi, scripsi.
Always with the negativity.
Woof, woof, woof.
It seems from this review that this book can do everything including curing the common cold. But how does the book taste? Can I eat it?
Hello? Is anybody there? Can the reviewer be bothered to say anything at all about the actual subject of the book?
"Text processing" could mean ANYTHING AT ALL. Consider the humble Turing machine...
That's "Mr. Soulless Automaton" to you, Bub.
Rack him, he's out.
You sir are my hero
Not only have you managed to get a positive karma post down to an easy formula but you manage to harvest karma with the SAME post time and time again. I often wonder if you get modded up and I get modded down simply because people don?t want to believe they have been fooled so many times. I think the realization would damage their ego.
The type of object that an identifier points to cannot be declared; it's established at run-time. This is either a strength or a weakness depending on your philosophical leanings.
It's a strength in that it makes prototyping very fast. If you want some function to operate on a class that it wasn't originally intended to operate on, then you just have to make the new class interface-compatible and jam it in there. No worrying about subclassing or prototypes or anything.
It's a weakness for maintenance, because, when you're debugging this function, all you know is something has been passed in, and you're calling GetValue() on it. And cripes, you've got fifty six classes that have a GetValue() method! Which one is it getting? You have to run the program to find out.
If you're doing scripting, then dynamic typing can be a godsend. If you're doing larger scale development, it can be a pain in the butt, because all of your developers need to be very disciplined.
In general, Python is almost too powerful for its own good. If you have any undisciplined or "cowboy" programmers on your team, Python gives them enough rope to hang themselves... and everyone else... and their managers.
But I love it. Treat it with respect, and Python will work wonders for you.
Accountability on the heads of the powerful.
Power in the hands of the accountable.
On taking a lightning-quick skimming of the text at gnosis I'm still don't quite get the point.
SNOBOL was a mind-opener for me, because it really had a radically different approach to text processing. And it was genuinely useful. I haven't used it recently enough to know how I would feel about it today.
Many languages now are more convenient for text processing than, say, C++ with STL or MFC. The traditional BASIC's at least recognize strings as good citizens and make it easy to do the fundamental operations. MUMPS improves on BASIC incrementally, as do PERL, Java, Javascript, etc., mostly to the degree that their standard libraries provide a useful suite of string functions. More and more languages have a Regex feature (e.g. REALBasic) and this is a really nice thing to have.
So, I just read the review, and, as I say, took a lightning-quick browse through the online text of the book, and neither of them bothers to tell me how Python fits in.
Both of them seem to assume from the beginning that I have already decided that Python is the language I want to use.
Is there anything about Python that renders it especially appropriate for text processing? With regard to text processing, is it in a different category altogether from Java/Javascript/PERL/MUMPS/REALbasic?
Or is it just a good language with string primitives and a decent string library?
"How to Do Nothing," kids activities, back in print!
...Python is executable Pseudocode.
:-)
I have a stack of Perlbooks since something like 3 years ago and haven't gotten around to studying them thouroughly.
Now that I've done some stuff in Python I actually think I'll never will. Everything that Perl can do Python can do better by now. Unless you're used to Unix CLI and syntax quirks Python will get you farther in a shorter period of time - and you'll be able to read your code in a year from now.
Allthough the anual Perl obfuscation contest actually can be somewhat funny.
We suffer more in our imagination than in reality. - Seneca
Python and Qt are the killer combo. I once coded during a break, just for fun (and as an example for the management, alright), a complex widget that took our head VB programmer *three days*. Only the Python/Qt widget was dynamically resizable (the VB one wasn't) and could hold any subwidget (the VB one could only hold buttons).
Now I use Python for a variety of tasks ranging from things just a little too complex to be cleanly done in Perl, to large things that usually belong in Java's sphere but are much faster coded in Python. But GUI programming is an area where it particularly shines.
-- B.
This sig does in fact not have the property it claims not to have.
You sir are my hero
Not only have you managed to get a positive karma post down to an easy formula but you manage to harvest karma with the SAME post time and time again. I often wonder if you get modded up and I get modded down simply because people don?t want to believe they have been fooled so many times. I think the realization would damage their ego.
You sir are my hero
Not only have you managed to get a positive karma post down to an easy formula but you manage to harvest karma with the SAME post time and time again. I often wonder if you get modded up and I get modded down simply because people don't want to believe they have been fooled so many times. I think the realization would damage their ego.
Great reference
Too bad the slashbots won't get it.
Am i one of the only people that actually DOES use VB6 on a regular basis??! (and, although i do live with my parents and can't afford REAL development tools, i'm only 17 so i think it is reasonable!)
:P
You are in luck, my young friend. You can afford Python because it's free, and you are young enough that any brain damage caused by VB will be entirely reversible!
You sir are my hero
Not only have you managed to get a positive karma post down to an easy formula but you manage to harvest karma with the SAME post time and time again. I often wonder if you get modded up and I get modded down simply because people don't want to believe they have been fooled so many times. I think the realization would damage their ego.
You need to learn the Slashdot Book Rating System.
Anything above a "9" is a good book.
A "9" is an average book. Read it only if you are particularly interested in the subject.
Anything below a "9" is a bad book. Avoid like the plague.
"Dr. Mertz is more interested in facilitating our learning process ..."
What the hell does that mean?
You sir are my hero
Not only have you managed to get a positive karma post down to an easy formula but you manage to harvest karma with the SAME post time and time again. I often wonder if you get modded up and I get modded down simply because people don't want to believe they have been fooled so many times. I think the realization would damage their ego.
I just started learning Python a few weeks ago, with my background being C++, Java, and Visual Basic. As a side note I have to point out that Python is an absolutely fantastic option for someone wanting to switch from VB to something more modern, useful, and platform independant.
These are the benefits of Python (mostly over C++) I personally like:
- It's a very forgiving language; i.e. you don't need to be overly concerned about string lengths or list bounds, no pointers and simple garbage collection
- List notations built into the syntax are extremely handy for referring to portions of the list and making changes; far less code needed for working with lists
- The OO parts are sufficient without being complex; everything is public; multiple inheritance
- Modules are compiled as needed and compiled version is used when available, so it's pretty quick
- Lots of runtime information easily available
Developers: We can use your help.
Why are you so focused on negativity? With the nightly news pushing out stories left and right about what's wrong with the world, can't we at least keep our Slashdot book reviews a good positive example of what's right with the world?
;-)
For a given reviewer, you need both positive and negative reviews so you can get a feel for what the reviewer is looking for, and how closely it matches what you are looking for. In something as subjective as books or video games, this is critical. This allows you to align your views with the reviewer.
In this environment, where it's a different reviewer is reviewing each time, it's much less useful. Reviews are really only useful in the context of knowing something about the reviewer. (I just thought of this, and after I post this I intend to shut off reviews from my Slashdot feed, since they are uniformly useless to anybody seriously looking to use them due to this overwhelming flaw in the process.)
In fact, the bad reviews are typically far more informative then the good ones. Most good reviews can be boiled down to "It's great!" with little loss of content, where the bad reviews have actual criticisms of the reviewed product. What you do then is read the criticisms and see if you might agree with them. If you're reading a video game review (which I use because it has great examples), and it says "Game X has far too many little numbers to keep track of for your characters", and you're old-skool and you like fiddly little numbers, then the negative review may actually boost your opinion. A lot of what appears in reviews is that sort of opinion, relatively little is concerned with universal things like "I couldn't get this game to run stably for more then 5 minutes on any of the four computers I tried it on here."
For a book review, such negative comments really go a long ways towards clarifying what the book is. "This book didn't give any examples on how to process XML" tells you more about the book's focus then "This book is great for anyone who programs and uses text!".
The point of "The Power of Positive Thinking", IIRC, wasn't to be unremittingly positive in every way; that's actually counterproductive and can take you out of touch with the real world. In fact, IIRC, it can best be summarized as "Don't be negative; that's bad."
This happened a couple years ago. This is no longer a reason to prefer Perl.
I haven't succumbed to Ruby for the same reason most Java-heads haven't succumbed to Python yet. I am not a Java-head because I like my programming languages free as in liberty.
microsoftword.mp3 - it doesn't care that they're not words...
You sir are my hero
Not only have you managed to get a positive karma post down to an easy formula but you manage to harvest karma with the SAME post time and time again. I often wonder if you get modded up and I get modded down simply because people don't want to believe they have been fooled so many times. I think the realization would damage their ego.
darn filters!
There is nothing Perl cannot do!! Nothing!!
For meaningless and arbitrary text (text without syntax/semantic or with a very primitive syntax still no semantic or when you consider text as a arbitrary set of strings despite any syntax or semantic) processing neither of imperative languages is good.
If you want to work with text as with meaningful set of information, where both syntax and semantic should be taken to consideration and processed as well, then you need other languages. Haskell, ML, Lisp is first what comes to mind for semantic text processing. With some limits I still can include Python to the list of recommended languages for text processing as it has some elements of functional programming plus it's the most advanced scripting language among imperative ones, besides it's OOP is good enogh for the subject.
Conclusion: if your mind is corrupted by imperative languages than choose at least Python for text processing. But if your mind is still flexible than choose Haskell or Lisp or ML.
Less is more !
The book in question has the highly positive review
because it is really good. I do not understand
those talks about reviewer getting more money
for good review: the book is excellent and
far from average computer book in it's quality.
Roman.
Frost Pist\n
Preface: Why Python? I have no idea. ...
Introduction: After reading the Preface, you'll come to the same conclusion as I have.
Chapter 1: use perl
Chapter 2: use perl
Chapter 3: use perl
Chapter 4: use perl
Conclusion: use perl, if you must use python, write your code in perl, and exec it from python.
Does anyone know what solutions exist for quick/diry embedding python inside HTML, ala embPerl?
Tweet, tweet.
wasting so much time posting to slashdot - but worrying about your karma too much is just pathetic.
I'd like you to know that I am *not* an affiliate of any company's, and you can not link to Amazon or anywhere else from my site giving me a commission. I do it for love, or fun, or whatnot, but the 35 or so book reviews on my site and the rest of my site, do not earn any money anyway. www.awaretek.com/plf.html
This book though is good and shows one of Pythons strengths. It is just a pity that Python didn't have such a library.
Gee, um, perhaps the point of the book is to show python programmers how to get the best results when processing text?
I mean, just an idea picked up from the title of the book. I don't see why you want to read it as "Why you should use python for all your text processing".
From Python.org Python is named after the BBC show 'Monty Python's Flying Circus' and has nothing to do with nasty reptiles. Making references to Monty Python skits in documentation is not only allowed, it is encouraged! Also, many of google's engineers use Python, and I hear they are constantly looking for more people with skills in this language. From Python.org
What's in a sig?
You are not an affiliate. Slashdot is. I'll give them this, slashdot wears the conflict of interest on its sleeve, as they've stated since they began doing reviews how most reviews were going to be glowing because of the affiliation.
One might imagine that a little integrity would spur more buying of books that were well-reviewed, because the review would mean something, but apparently for now it's worth just getting mentioned on slashdot.
Slashdot used you.
I've finally had it: until slashdot gets article moderation, I am not coming back.
Both are languages which NEED advocacy in order for them to be widely used.
Languages like c, c++, and perl are widely used because they're cool, crude, and powerful, not because they have some huge advocacy group to promote them.
This must be one of the reasons why it is bad for IE to be ignoring mime types while mime types are simultaneously the same reason why some people say they don't like non-IE browsers.
It is bad for IE to ignore mime types. But that's not the problem here. The problem seems to be that IE is trying to validate the markup against its DTD. Which would be a silly thing to do even if it didn't make the page undisplayable.
Text processing is, after all, only the start of things. Eating and spitting out text gets kind of boring pretty quick (see Awk or XSLT). More often you'll want to do something with that text. You'll process it then present it, email it, perform actions based on it, etc.
That said, Python is quite good for text processing. For instance, it doesn't have a regex literal, but it does have a special string literal which doesn't parse backslashes. So regexes don't stick out quite as nicely as in Perl, but they aren't painful to write like in PHP (how many backslashes do you need in your string when looking for a backslash?). Python has a few little touches that make it work well, even if there's nothing you can point to and say "that's for text processing."
Compared to Java, for instance, text processing in Python will be much easier and require much less code. But that holds true for any task. Compared to Perl, code written to do text processing in Python will be much more readable. Like any task. Python is just a good language.
Just on the topic of file processing, the path module for Python is really cool. I'd like to see it become a part of the standard library, actually. I think it makes Python code much more on-par with Perl for that task (and I fully admit that Python's os.path functions are not very pretty).
I've learned a bit about Python from some of my students who like it. Funny thing is, having worked on AI, I cannot help but feel that Python is Lisp reinvented, but *slow*. Lisp got a lot of bad fame from misinformed people who thought it was interpreted (whereas all modern Lisps have very good compilers, and compile on the fly), that predefined types were essential (as opposed to built-in polymorphism even without OO) and that garbage collection was slow. Things are achanging,
and Python is helping that. Search for Norvig in the web, you'll find Python is about *ten* times slower than any modern Lisp, over a wide range of sophisticated AI applications.
Nowadays I program mostly in C, because it was faster than Lisp (by a factor of about 2) and also I did need manual memory management for the high performance I need for the combinatorial (i.e. exponential complexity) problems I usually deal with. Now that I really need object orientation, I'm struggling with C++ (it really sucks for *lots* of things which are extremely simple in Lisp, e.g closures or many of the uses of the STL) and I'm reconsidering going back to Lisp, since I cannot afford the performance hit by Python.
The great plus for Python as opposed to Lisp is, I think (as said I'm not really a python programmer) is the libraries for web programming and text processing, the easiness of programming guis (or so I'm told). Perhaps that might change for Lisp, if the various implementations would address the problem. I started using Lisp again when I found perl totally unreadable for things like hash tables of hash tables (which I needed for something as simple as parsing some experimental logs). It is amazing how a few well-defined utility functions available in the net (e.g split and the likes) can make Lisp enormously productive , and it *is* fast.
My recommendation: do by all means learn python, it will teach new and very productive ways to program and think about problems. Then if you need performance while keeping the same style of programming, do take a serious look at Lisp.
Rocks. Muchly rocks.
Government of the people, by corporate executives, for corporate profits.