Ask Slashdot: Best PDF Handling Library?
New submitter Fotis Georgatos (3006465) writes I recently engaged in a conversation about handling PDF texts for a range of needs, such as creation, manipulation, merging, text extraction and searching, digital signing etc etc. A couple of potential picks popped up (PDFBox, itext), given some Java experience of the other fellows. And then comes the reality of choosing software as a long term knowledge investment! ideally, we would like to combine these features:
- open source, with a community following ; the kind of stuff Slashdotters would prefer
- tidy software architecture; simple things should remain simple
- allow open API allowing usage across many languages (say: Python & Java)
- clear licensing status, not estranging future commercial use
- serious multilingual & font support
- PDF-handling rich features, not limiting usage for invoicing, e-commerce, reports & data mining
- digital signing should not go against other features
I'd like to poll the collective Slashdot crowd wisdom about if/which PDF related libraries, they have written software with, keeps them happy for *all* the above reasons. And if not happy with that all, what do they thing is the best bet for learning one piece of software in the area, with great reusability across different circumstances and little need for extra hacks? I'd really like to hear the smoked out war stories. It is easy to obtain a list of such libraries, yet tricky to understand whethe people have obtained success with them!
open source, with a community following ;
Why?
So if you find one that fits every other requirement but this one you will refuse to use it?
Python only, but I've used it successfully.
Well, there's PDFLib, which is hideously expensive and not open source, but if you're after a professional package for serious purposes that just works I can only recommend it.
pdf.js is great for parsing and manipulating pdfs.
I can't go into great detail as to how I've used it (Still under NDA), but it's rendering and manipulating of pdfs is pretty darn good.
As for converting office formats to pdf, your best bet is to use office automation. It can be built to scale up, but it needs a lot of work to do so.
I've found these tools useful, with an honorable mention to gnupdf. I've never used it personally, but the code looks pretty solid. That said, when I really needed to produce great multilingual PDF I pulled out the PDF spec, gritted my teeth, and generated it directly.
leptonica - turn images into PDF
tesseract - turn images into searchable PDF
qpdf - linearize PDF for random access over HTTP
jhove - basic validation
jhove-pdf-a - validation with better compatibility guarantees
pdftk - command line tool for splicing pages together or apart
ttx/FontTools - tool for modifying custom fonts
reportlab - python library, easy to use but works best with Latin scripts
You might try IText:
http://en.wikipedia.org/wiki/IText
pdftk uses it.
The best PDF software I've ever used is Prince XML.
For years, we got by with HTMLDoc but finally dumped it because we absolutely needed unicode support.
After trying many different packages, we settled on Prince. Our main constraints were performance related which you apparently aren't worried about, so maybe it's overkill for what you need.
libpoppler works; it just only meets the requirements that libpoppler was designed for. It correctly displays most PDFs, but fails with esoteric features used only in a small subset.
In that sense, libpoppler is like a swiffer mop: it handles most normal dirt, dust, and general cleaning needs for tile and hardwood; but you will need a mop, or potentially nylon or bristle scrubbers and power tools, to clean some deep-set grime from linoleum or porcelain tile. I've had mops fail to clean traffic grime from kitchen linoleum, at all; stuck a drill brush in a 3000RPM 600W output cordless drill and blasted that shit right off.
Support my political activism on Patreon.
See, now thanks to you I have to clean all this coffee off my monitor...
It works.
It seems iText is it, can anyone vouch for a good alternative
Make this becomes a requirement: support for making PDF/A.
At least on the C# side of things, the three libraries I've used (iTextSharp, PdfSharp, and Aspose.Pdf) are all a bit of an unintuitive mess with inconsistencies all over the place and very little documentation. In the case of iText, their revenue stream is putting all their documentation into a book for people to buy, so it's not uncommon to get an intentionally vague response when asking for help.
I cycle between each depending on what I need to do, because they all have their own quirks and supported features. I've even piped from one to another to get certain parts of the process working.
Good luck.
Why deal with hard coding libraries for PDF rendering?
With Flying Saucer you can make XHTML and CSS 2.1(with some 3 features), and then you don't have to deal with the hard coding of the report.
Plus you can generate the report on the web right to your client.
I have using Flying Saucer for over a year, and has real made creating PDF reports a lot easier.
Before finding Flying Saucer I did research on JasperReport and Birt.
And Flying Saucer was the only one that gave a 98% rendering of TinyMCE HTML tags.
http://code.google.com/p/flying-saucer/
sudo apt-get install ghostscript pdftk poppler-utils
ghostscript: /usr/bin/dvipdf /usr/bin/pdf2dsc /usr/bin/pdf2ps /usr/bin/pdfopt /usr/bin/ps2pdf /usr/bin/pdftk /usr/bin/pdffonts /usr/bin/pdfimages /usr/bin/pdfinfo /usr/bin/pdfseparate /usr/bin/pdftocairo /usr/bin/pdftohtml /usr/bin/pdftoppm /usr/bin/pdftops /usr/bin/pdftotext /usr/bin/pdfunite
ghostscript:
ghostscript:
ghostscript:
ghostscript:
pdftk:
poppler-utils:
poppler-utils:
poppler-utils:
poppler-utils:
poppler-utils:
poppler-utils:
poppler-utils:
poppler-utils:
poppler-utils:
poppler-utils:
My favorite solution to this problem is to leverage a 3rd party solution to do the heavy lifting that creates the actual PDF document and tie into it with printing code. When you setup the printer device and print mode correctly you can render your output in a print document as you would to the screen. The only changes are in calculating DPI because the screen and printer will have different dots per inch. I have done this on Windows using 2 techniques. The first was long ago and that was to use an installed post-script printer driver to "print to a file" and then have ghost script convert PS to PDF. As far as I know this is still a free and open solution. The makers of Ghostscript only clause to freedom of usage is that the software can't be for commercial usage. The second manner is to have your software look for a virtual PDF printer driver and "print to file". When you install Foxit PDF it installs a virtual PDF printer driver that can be used by code.
Buile a xelatex file and then compile ...
We tested different things here and we ended up generating xelatex files and compiling them as needed (from Java).
(Pretty much, with minimal issues) pixel perfect generation of a PDF file from HTML.
IIRC they import and export pdf.
I know this is about as edge case as it gets but non-native PDF handling completely ignores PDF layering.
I use a cordless drill and brush bit to make my PDF files, too. It's slow, and it doesn't really scale well, but at least it's not from Adobe.
You are welcome on my lawn.
Just wanted to mention the Author forgot to evaluate ICEpdf.
ICEpdf can be used as standalone open source Java PDF viewer, or can be easily embedded in any Java application to seamlessly load or capture PDF documents. Beyond PDF document rendering, ICEpdf is extremely versatile, and can be used in a multitude of innovative ways, including: image conversion, annotation editing tools, text/image extraction, search and printing.
http://www.icesoft.org/java/projects/ICEpdf/
itext may be one of those. It comes under AGPL and under a commercial license if you buy support from the company.
You can't handle the truth.
PDFLib GmbH (german LLC) build exactly one product: PDFLib. And they've been doing that since 1997. AFAIK the company was run by one guy - the initial developer - alone for most of the time. Now it's probably a shop of 5 or so.
So it's not FOSS - yeah, that's a real shame. But the devs get to eat, you can demand service and response if you run into a bug and you can expect a good product and with PDFLib you're probably going to get it too.
I haven't come across a single project doing non-trivial PDF stuff that doesn't use PDFLib. I've used it myself a little, and the cookbook that comes with the product was very good, so it comes recommended.
My 2 cents.
We suffer more in our imagination than in reality. - Seneca
I'd avoid pdftk. It uses a fork of the iText library that is buggy and old and apparently no longer actively maintained. A secondary issue is that it has terrible error handling; when iText reports an error, pdftk just throws away the information about what the error really is. Qpdf can do everything I used to do with pdftk, and is much better engineered and supported.
Have you looked at EssentialObjects? We use it for one of our internal applications and its been very easy to work with.
Java-only. Commercial. Feature-rich. Great company name.
Comment removed based on user account deletion
Because maybe it's not his first project? Fine, let me ask you: how many times did you get burned by totally unmaintainable third-party dependencies, before you vowed "NEVER AGAIN will I get so utterly fucked over?"
Was your fifth project the one where you couldn't ever port to a new architecture or OS, or was it the one where the only company who had the source, went into bankruptcy and it took years for the liquidation to happen and you never really figured out where the assets are? No wait, your fifth project was the one where they just withdrew it from the market for "strategic reasons" and you never found out why and there was no replacement. Ah, then there was the race condition that you knew you could find if only you could read through the code, but the sole developer didn't even know what "race condition" means so he ignored your bug report. And the time the DRM server incorectly said the API key had expired so you didn't get any sales that day. Then there was that time you had the source but weren't allowed to change some parts of it: I loved the comment "by reading this you are violating the License Agreement" followed by the base64 string of dynamically interpreted code. Of course you violated the agreement, and decoded it: finding a bug you weren't allowed to fix. And of course let's not forget the time the developer might have actually hypothetically allowed the code to be maintained or might have even done it himself, but he had lost it, the one and only copy in the entire world, which had been used to compile the code that literally tens of thousands of people were depending on. That one's a classic, almost right up there with the vendor who died, taking all his customers' hopes of maintenance with him to the grave.
Holy crap. I get why the public doesn't know to demand Free Software. Even smart people can be uninformed or lack expertise outside their areas. But developers, really? You have to be LITERALLY STUPID to not see "open source" as at least a major advantage, if not necessarily always the winner. Maybe it's not always a solid requirement, but if you don't always at least start your searches that way and try to get something that at least can be maintained, then yes, you're a moron.
"Oh no, I'm not a moron," you explain, "I just happen to think that some large projects aren't ever going to need maintenance, because surely it's simple enought that a good programmer will get everything right the first time." You're right: you're not a moron; you're an imbecil. Sorry about the mistake.
Big Faceless Org
Not open source. Java only.
Great friendly support
We used PDFClown (open source - Java) for production applications that were generating PDF's on the fly with hundreds of pages using Java and it performed very well.
The Truth is a Virus!!!
iText meets some of your criteria.
* open source, with a community following ; the kind of stuff Slashdotters would prefer = yep. Including several commercial books .NET. Not a good fir for dynamic languages.
* tidy software architecture; simple things should remain simple = It's big and complete.
* allow open API allowing usage across many languages (say: Python & Java) = Native Java. iText# is a port to
* clear licensing status, not estranging future commercial use = AGPL + commercial license. Clear but not free as in beer for commercial.
* serious multilingual & font support = yep.
* PDF-handling rich features, not limiting usage for invoicing, e-commerce, reports & data mining = yep
* digital signing should not go against other features = has several singing modules. Excellent.
I started with Reportlab (the open source parts), found it to low level so I considered using the commercial edition because it has a templating language. As I was not very fond of investing time in learning yet another templating language, I reconsidered, and gave HTML with CSS a try for printing. I used wkhtmlpdf for a while but switched to WeasyPrint in the end: it was created for using HTML with CSS for printing, seemed to be more actively developed when compared to wkhtmlpdf.
Does anyone have experience with MuPDF? http://www.mupdf.com/ It's open-source, but requires license for commercial use. It appears to offer the best performance and portability. Its top level application(MuPDF) is highly rated across most platforms: Google Play, Apple Store and Ubuntu.
Is there anything that can handle the gruesome CT600 forms that the UK Tax authority require us to fill in every year? These have lots of embedded scripting and can only be read with Acrobat Reader. However, this year, Adobe have stopped releasing Acrobat for Linux.
(An added bonus, the internal logic of the CT600 is buggy: for example if a particular tax option does not apply, it is fussy about the distinction of 0 vs empty, and this leads to subsequent validation errors (naturally with confusing messages). It also has about 20 pages of irrelevant data required, in order to reach a single number, which we have already calculated.)
I've seen commercial programs actually do this to support PDF report generation. They just leverage the existing code they have for printing reports and redirect it to a virtual printer. I think it was the Amyuni libraries which are clearly closed source. One thing I can say is that a virtual printer that directly generates PDF files from the GDI output (we're talking Windows here) tends to create cleaner output files (smaller size, less rendering errors) than the Postscript printer output to PDF route.
I needed to layout a novel in a PDF. I've previously worked with iText and prefer not to construct a PDF one element at a time. I wanted an HTML to PDF workflow. I then tried wkhtmltopdf, but it doesn't support most of the hardcore design needs: hyphenation, widow/orphan control, alternating page margins, and page headers and footers based on the section of a document. PrinceXML supports all that. Writing only CSS, and based on html content, you'll be able to replicate anything a designer can do in InDesign. It's incredibly powerful and the time it saves in development effort more than pays for the high cost of the software.
Not sure how current it is, but when I was looking for the same a few years back all that was really available for PHP was HTML->PDF libraries which were not sufficient for anything but the most basic forms. A decent invoice form was hard to get right with these tools. Then I came across FOP. Or more specifically XML-FOP. Combine that with a little XSL and the output was amazing, and could do more than the HTML converters. The only problem is that the FOP tool was a Java based program so PHP would need to execute a shell command to call it. With tight control of what info was passed to that shell command, it seemed an appropriate trade-off for the job at hand. You can still get FOP in the ubuntu repos - apt-get install fop. The learning curve for FOP is a little steep to begin, but no more than any other XML dialect. And being XML, you have a lot of options in building the required FOP file. I opted to put my data into my own XML file, then utilize an XSL file to convert it if/when needed. More details here: http://xmlgraphics.apache.org/...
We've used this in our application for several years to generate a huge number of PDFs for digital presses. It can also do extraction and rendering as well as generation.
https://www.dynaforms.com/
dpkg -l | grep pdf | grep lib | grep ii
ii libqpdf13:i386 5.1.1-1 i386 runtime library for PDF transformation/inspection software
When all you have is a hammer, every problem starts to look like a thumb.
Are both quite good, as they names suggest these are Perl modules though.
libHaru is a free C library (install libhpdf-dev in Debian) which supports generation, annotation, compression, encryption. See http://libharu.org/
At least on the rendering side, I have found MuPDF to be really good and stable: http://mupdf.com/
[RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
I'll be honest that I don't have a broad range of experience with libraries. I've used a couple of html-to-pdf implementations and PDFSharp. The licensing for PDFSharp is very permissive, support can be paid for if required and the library is quite fast. As an aside, it has a cousin, MigraDoc, which produces abstract documents which you can finalise to Office formats, if you need that too.
IMO, there is no perfect tool, but PDFSharp has served me well.
I think its a most urgent for lot of it and internet related, we wait for more update information in next
I like to use PDFCreator for this purpose. It is on sourceforge here: http://sourceforge.net/project...