Slashdot Mirror


Ask Slashdot: Best PDF Handling Library?

New submitter Fotis Georgatos (3006465) writes I recently engaged in a conversation about handling PDF texts for a range of needs, such as creation, manipulation, merging, text extraction and searching, digital signing etc etc. A couple of potential picks popped up (PDFBox, itext), given some Java experience of the other fellows. And then comes the reality of choosing software as a long term knowledge investment! ideally, we would like to combine these features:
  • open source, with a community following ; the kind of stuff Slashdotters would prefer
  • tidy software architecture; simple things should remain simple
  • allow open API allowing usage across many languages (say: Python & Java)
  • clear licensing status, not estranging future commercial use
  • serious multilingual & font support
  • PDF-handling rich features, not limiting usage for invoicing, e-commerce, reports & data mining
  • digital signing should not go against other features

I'd like to poll the collective Slashdot crowd wisdom about if/which PDF related libraries, they have written software with, keeps them happy for *all* the above reasons. And if not happy with that all, what do they thing is the best bet for learning one piece of software in the area, with great reusability across different circumstances and little need for extra hacks? I'd really like to hear the smoked out war stories. It is easy to obtain a list of such libraries, yet tricky to understand whethe people have obtained success with them!

35 of 132 comments (clear)

  1. ReportLab by Anonymous Coward · · Score: 4, Informative

    Python only, but I've used it successfully.

  2. PDFLib by Anonymous Coward · · Score: 2, Insightful

    Well, there's PDFLib, which is hideously expensive and not open source, but if you're after a professional package for serious purposes that just works I can only recommend it.

    1. Re:PDFLib by Anonymous Coward · · Score: 4, Informative

      PDFlib is cheap compared to licensing Adobe's libraries from DataLogics. (speaking as one who switched from the latter to the former).... A full source license for pdflib and tetlib were much less that Adobe/DataLogics non-source license... less than 1 FTE. Then again, your milage may vary.

      PDFLib happens to be the cleanest and best PDF code solution I've ever worked with.

    2. Re:PDFLib by DJ+Jones · · Score: 2

      TCPDF Open-source PDF-reader built in PHP
      FPDF Combine with TCPDF above to create a PDF-writer using PHP
      SetAssign Not open-source but this company offers both free and paid libraries that combine with the libraries above to allow PDF encryption / decryption using PHP.- The paid versions support more complex ciphers and I swear by them personally


      Not sure if you meant desktop software or...

    3. Re:PDFLib by Skuld-Chan · · Score: 2

      You know whats funny is PDFLib is what Adobe calls the set of core tech libraries to generate PDF files inside their own apps. (source: used to work at Adobe - on Acrobat no less)

  3. Re: Why? by Anonymous Coward · · Score: 2, Insightful

    To be fair the OP does say "ideally."

  4. pdf.js by AntiTuX · · Score: 5, Informative

    pdf.js is great for parsing and manipulating pdfs.

    I can't go into great detail as to how I've used it (Still under NDA), but it's rendering and manipulating of pdfs is pretty darn good.

    As for converting office formats to pdf, your best bet is to use office automation. It can be built to scale up, but it needs a lot of work to do so.

    1. Re:pdf.js by PhrostyMcByte · · Score: 2

      Office Automation is problematic -- because it literally opens up a hidden window of your Office app and simulates clicking around the UI to do what you need, if something unexpected happens it can unhide the window to show the user a message. This might be good enough for a desktop app, but if you're running it on a server it'll just freeze up your process with noone there to click it.

      For Office->PDF conversion of word docs, Aspose.Words has a fairly easy API and generally very accurate rendering. I highly recommend it.

    2. Re:pdf.js by Auction_God · · Score: 3, Informative

      No...it does not simulate clicking. It uses the underlying COM representation to perform its functions. That said, it does not work well in a multi-threaded environment, nor where you can't setup user (e.g. restricted web server credentials). So you either have to impersonate or use COM+ configuration to run the office tool under a different user name. So if you're just starting out...do not use Office automation in a server environment unless you're willing to deal with these issues. Try Aspose as suggested instead.

    3. Re:pdf.js by Jaime2 · · Score: 2

      I wouldn't recommend Office Automation on a server if there is any alternative. For beginners, there's too many gotchas and for advanced users, there's plenty of alternatives that will do what you want without too much difficulty. Office with .Net is especially problematic because the COM components run as out-of-process servers and due to .Net's garbage collection and COM interoperability, they are difficult to get to shut down properly.

  5. I've found these tools useful by jab · · Score: 5, Informative

    I've found these tools useful, with an honorable mention to gnupdf. I've never used it personally, but the code looks pretty solid. That said, when I really needed to produce great multilingual PDF I pulled out the PDF spec, gritted my teeth, and generated it directly.

    leptonica - turn images into PDF
    tesseract - turn images into searchable PDF
    qpdf - linearize PDF for random access over HTTP
    jhove - basic validation
    jhove-pdf-a - validation with better compatibility guarantees
    pdftk - command line tool for splicing pages together or apart
    ttx/FontTools - tool for modifying custom fonts
    reportlab - python library, easy to use but works best with Latin scripts

    1. Re:I've found these tools useful by Anonymous Coward · · Score: 3, Interesting

      sudo apt-get install wkhtmltopdf
      wkhtmltopdf www.google.com google.com.pdf

    2. Re:I've found these tools useful by Qzukk · · Score: 2

      I have no idea if it supports data: URIs but I've used HTMLDOC to turn html tables into PDF (since every PDF library I've ever used is absolutely shit at tables compared to HTML). It supports inline styles and <style type="text/css"> tags. It's not quite dead, but this year's update was the first since 2006.

      --
      If I have been able to see further than others, it is because I bought a pair of binoculars.
    3. Re:I've found these tools useful by kbg · · Score: 3, Interesting

      Yes the Flying Saucer Java library. It is one of the best XHTML to PDF converstion tool.

  6. Prince XML by Eponymous+Coward · · Score: 2

    The best PDF software I've ever used is Prince XML.

    For years, we got by with HTMLDoc but finally dumped it because we absolutely needed unicode support.

    After trying many different packages, we settled on Prince. Our main constraints were performance related which you apparently aren't worried about, so maybe it's overkill for what you need.

  7. "Slashdot Crowd Wisdom" ! by RobotRunAmok · · Score: 5, Funny

    See, now thanks to you I have to clean all this coffee off my monitor...

    1. Re:"Slashdot Crowd Wisdom" ! by jones_supa · · Score: 3, Funny

      I'm sure one of the requirements would be an open source napkin.

    2. Re:"Slashdot Crowd Wisdom" ! by funwithBSD · · Score: 2

      I suggest librag.

      --
      Never answer an anonymous letter. - Yogi Berra
  8. I'm convinced there is no elegant PDF library by PhrostyMcByte · · Score: 4, Informative

    At least on the C# side of things, the three libraries I've used (iTextSharp, PdfSharp, and Aspose.Pdf) are all a bit of an unintuitive mess with inconsistencies all over the place and very little documentation. In the case of iText, their revenue stream is putting all their documentation into a book for people to buy, so it's not uncommon to get an intentionally vague response when asking for help.

    I cycle between each depending on what I need to do, because they all have their own quirks and supported features. I've even piped from one to another to get certain parts of the process working.

    Good luck.

    1. Re:I'm convinced there is no elegant PDF library by codemachine · · Score: 2

      It could be that iText is just what he needs though. iTextSharp is the C# port of the original iText Java library. At times, it is easier to find code examples for iText than iTextSharp. Since the iTextSharp folks did their best to use C# conventions, the Java call names aren't always the same as the C# ones.

  9. try... by Anonymous Coward · · Score: 4, Informative

    sudo apt-get install ghostscript pdftk poppler-utils

    ghostscript: /usr/bin/dvipdf
    ghostscript: /usr/bin/pdf2dsc
    ghostscript: /usr/bin/pdf2ps
    ghostscript: /usr/bin/pdfopt
    ghostscript: /usr/bin/ps2pdf
    pdftk: /usr/bin/pdftk
    poppler-utils: /usr/bin/pdffonts
    poppler-utils: /usr/bin/pdfimages
    poppler-utils: /usr/bin/pdfinfo
    poppler-utils: /usr/bin/pdfseparate
    poppler-utils: /usr/bin/pdftocairo
    poppler-utils: /usr/bin/pdftohtml
    poppler-utils: /usr/bin/pdftoppm
    poppler-utils: /usr/bin/pdftops
    poppler-utils: /usr/bin/pdftotext
    poppler-utils: /usr/bin/pdfunite

  10. Re:Alternative to iText by Daniel+Hoffmann · · Score: 2

    iText has some problems with licensing for commercial applications and government projects. I am looking for an alternative.

  11. Re:Why? by kelemvor4 · · Score: 2

    open source, with a community following ;

    Why? So if you find one that fits every other requirement but this one you will refuse to use it?

    Derp,

    Probably because if there is no community following it there is not going to be much in the way of development going on.

  12. Re:to the free as in "all you can steal" crowd by PopeRatzo · · Score: 2

    I use a cordless drill and brush bit to make my PDF files, too. It's slow, and it doesn't really scale well, but at least it's not from Adobe.

    --
    You are welcome on my lawn.
  13. Re:Why? by Wycliffe · · Score: 2

    Indeed. I'm curious why is not "closed source, with a strong industry support" an option?

    Because both "open source" and "strong industry support" when put together like that pretty
    much means that they don't want to get stuck holding the bag if the company goes out of business.
    With "strong industry support" the odds of a company going out of business is minimized and
    with "open source" even if it does go out of business then you can still continue to use the
    software indefinitely while you look for a replacement.

  14. One word: PDFLib by Qbertino · · Score: 5, Informative

    PDFLib GmbH (german LLC) build exactly one product: PDFLib. And they've been doing that since 1997. AFAIK the company was run by one guy - the initial developer - alone for most of the time. Now it's probably a shop of 5 or so.

    So it's not FOSS - yeah, that's a real shame. But the devs get to eat, you can demand service and response if you run into a bug and you can expect a good product and with PDFLib you're probably going to get it too.

    I haven't come across a single project doing non-trivial PDF stuff that doesn't use PDFLib. I've used it myself a little, and the cookbook that comes with the product was very good, so it comes recommended.

    My 2 cents.

    --
    We suffer more in our imagination than in reality. - Seneca
  15. BFO by the_mice · · Score: 2

    Java-only. Commercial. Feature-rich. Great company name.

  16. Yeah! Why would anyone want it maintained? by Anonymous Coward · · Score: 5, Insightful

    open source, with a community following

    Why?

    Because maybe it's not his first project? Fine, let me ask you: how many times did you get burned by totally unmaintainable third-party dependencies, before you vowed "NEVER AGAIN will I get so utterly fucked over?"

    Was your fifth project the one where you couldn't ever port to a new architecture or OS, or was it the one where the only company who had the source, went into bankruptcy and it took years for the liquidation to happen and you never really figured out where the assets are? No wait, your fifth project was the one where they just withdrew it from the market for "strategic reasons" and you never found out why and there was no replacement. Ah, then there was the race condition that you knew you could find if only you could read through the code, but the sole developer didn't even know what "race condition" means so he ignored your bug report. And the time the DRM server incorectly said the API key had expired so you didn't get any sales that day. Then there was that time you had the source but weren't allowed to change some parts of it: I loved the comment "by reading this you are violating the License Agreement" followed by the base64 string of dynamically interpreted code. Of course you violated the agreement, and decoded it: finding a bug you weren't allowed to fix. And of course let's not forget the time the developer might have actually hypothetically allowed the code to be maintained or might have even done it himself, but he had lost it, the one and only copy in the entire world, which had been used to compile the code that literally tens of thousands of people were depending on. That one's a classic, almost right up there with the vendor who died, taking all his customers' hopes of maintenance with him to the grave.

    Holy crap. I get why the public doesn't know to demand Free Software. Even smart people can be uninformed or lack expertise outside their areas. But developers, really? You have to be LITERALLY STUPID to not see "open source" as at least a major advantage, if not necessarily always the winner. Maybe it's not always a solid requirement, but if you don't always at least start your searches that way and try to get something that at least can be maintained, then yes, you're a moron.

    "Oh no, I'm not a moron," you explain, "I just happen to think that some large projects aren't ever going to need maintenance, because surely it's simple enought that a good programmer will get everything right the first time." You're right: you're not a moron; you're an imbecil. Sorry about the mistake.

    1. Re:Yeah! Why would anyone want it maintained? by Dragon+Bait · · Score: 2

      Because maybe it's not his first project? Fine, let me ask you: how many times did you get burned by totally unmaintainable third-party dependencies, before you vowed "NEVER AGAIN will I get so utterly fucked over?"

      This. Wish I hadn't run out of mod points -- and frankly I'm tired of some bottom of the barrel programmer who's attitude is "we can just rewrite everything every 5 years" get promoted into management and then tie our code to whatever proprietary crap the next cute sales person brings.

      Separate. Isolate. Defend. Treat every piece of third-party code that you don't have source for as an enemy whose only goal is to financially rape you. I don't care if that enemy goes by Oracle, Microsoft, or Joe's Discount Software.

  17. Re:Why? by Rob+Y. · · Score: 3, Interesting

    I'm using a non-free, but source-provided library called Clib-PDF. It's a pretty nice library with a pretty easy API, and even has PHP bindings (so it must've been a viable mainstream choice at one point). But somehow the company (or was it just a single guy) disappeared years ago. Luckily, we paid for and got the source, and I've been able to keep using it (and even fixing things in the source) without any ongoing support. So not quite open source, but not quite the disaster of discontinued closed source.

    I suspect that the author of this library sold it to one of the commercial companies who proceeded to shut down a viable competitor. But who knows...

    --
    Posted from my Android phone. Oh, I can change this? There, that's better...
  18. Re:Why? by Chrisq · · Score: 2

    Some people have principles.

    He could be working on an open source project

  19. WeasyPrint by t4k1s · · Score: 2

    I started with Reportlab (the open source parts), found it to low level so I considered using the commercial edition because it has a templating language. As I was not very fond of investing time in learning yet another templating language, I reconsidered, and gave HTML with CSS a try for printing. I used wkhtmlpdf for a while but switched to WeasyPrint in the end: it was created for using HTML with CSS for printing, seemed to be more actively developed when compared to wkhtmlpdf.

  20. Re:Why? by NJRoadfan · · Score: 2

    Adobe pushes PDF as a method of data collection. People make fancy PDF forms and e-mail them out. Inside of the form is a button that says "when complete, click here to submit form" which attaches the filled out form to an e-mail and sends it back to the publisher. From there, folks somehow extract the fields from the file and dump it into a database, which seems like a messy and complicated process. Honestly a web form would be easier to implement in many cases.

  21. FOP? by sgrover · · Score: 2

    Not sure how current it is, but when I was looking for the same a few years back all that was really available for PHP was HTML->PDF libraries which were not sufficient for anything but the most basic forms. A decent invoice form was hard to get right with these tools. Then I came across FOP. Or more specifically XML-FOP. Combine that with a little XSL and the output was amazing, and could do more than the HTML converters. The only problem is that the FOP tool was a Java based program so PHP would need to execute a shell command to call it. With tight control of what info was passed to that shell command, it seemed an appropriate trade-off for the job at hand. You can still get FOP in the ubuntu repos - apt-get install fop. The learning curve for FOP is a little steep to begin, but no more than any other XML dialect. And being XML, you have a lot of options in building the required FOP file. I opted to put my data into my own XML file, then utilize an XSL file to convert it if/when needed. More details here: http://xmlgraphics.apache.org/...

  22. Re:Why? by kefalonia · · Score: 2

    OP here. Somebody is listening ;-) Nope. It's not how you described it. But thanks to your nice language I need not explain you either :-P