Tools for Publishing in Multiple Formats?
Truist asks: "What are the best tools (windows or *nix) to use to publish a single source document in multiple formats, specifically plain text, multi-page HTML, and PDF? I'm trying to publish a (60-page+) NetBSD installation guide/documentary online, and I want plain text for easy download and 'less'-ability, HTML for easy browsing and search engine indexing, and PDF or Postscript for easy printing. It's currently a Word document (I know, I know - I'm happy to manually convert it to something else) with multiple styles, including regular text, lists, internal links, external (web url) links, code, and notes, and I'd like to preserve as much as possible of each in the final output. Some additional notes: there are no graphics, and I expect to update this document periodically, or to split it into parts and maintain the parts (think master document / subdocuments). It won't be updated too often, but if re-publishing could be scriptable, that would be fantastic."
I have seen a variation of this question at least two times posted here. The unanymous answer is usually docbook and in this case is more relevennt, since the document is technical in nature.
good pick is DocBook: The Definitive Guide written by Norma Walsh (who chairs the Oasis DocBook Technical Committee) and published by O'Reilly that. Of course the book is also available in HTML, PDF and plain text.
Come on... this has to be a planted question ;)
TeXinfo
texi2html
The subject says it all. Apparently, it's the standards-based, open-source-conforming way to do it. I've heard paeans sung to FOP but I haven't used it, yet.
Open Office can save to all of those formats. It isn't scriptable, but all those outputs are done easily enough. Has that been taken into consideration or am I off base on what you are looking for?
That's scary.
I actually worked on a ~500pg. documentation project with a couple of other developers a couple of years back, and after about six months of debate, they finally agreed to let me recode the thing from TeX to DocBook XML.
The conversion was a PITA, but once that was finished, we had about 40 source XML files which were independently version-controlled, some minor customizations to the standard DocBook XSLT stylesheets, and slick, easily-updated HTML, plain text, and PDF versions of the document being produced straight out of CVS by a cron job.
A nice benefit of the conversion was that we were actually able to add another few hundred pages of documentation that was automatically generated from grammar definitions and source code to the batch build, and they could be integrated into the style and distribution methods we worked out for the hand-generated docs.
Lyx is how I do what you're talking about. It's a WYSIWYM (What You See Is What You Mean) document processor, and it's great. I use it to write term papers, HOWTO documents, and lots of other stuff. You can export your document to many different formats, including HTML, PDF, plain text and Postscript. You should try it out, I really like it.
My blog
Specifically latex, and more specifically pdflatex for pdf output and tex2page for html. With some hacking you should be able to script tex2page into outputting text as well.
To some extent the texinfo folks have solved this problem as well. The DocBook stuff mentioned elsewhere might be very nice but I have no experience with that.
I am currently publishing several several-hundred-page technical manuals using the following workflow:
All documentation is edited using an ordinary plaintext editor.
The documents are marked-up using ReStructured Text conventions. This has satisfied 99% of my needs. I've decided the convenience of ReST outweighs the need for the remaining 1% of the frills I want.
I use CVS for revision control. There may be an RCS involved in the backend; I don't operate the server that hosts my repository.
The ReST documents are converted to XML using DocUtils. The project coordinator, by the way, has proven himself a superlative programmer. DocUtils rocks, and will also transform ReST to HTML or Latex.
The XML is converted using XSL templates that I've created. Saxon then transforms the DocUtils XML to XML:FO, and FOP transforms that into PDF.
Pretty fucking spiffy, if I do say so myself.
I also currently use HT2HTML to transform ReST to HTML. I use it in preference to DocUtil's native HTML transformation because it allows me to do a few nice tricks. In the future I plan to migrate entirely to another set of custom XSL tranformations.
This system has proven extremely productive. At any time I could pop a few bucks for a commercial XSL:FO->PDF engine and stomp the few gripes I've had with FOP (my number one issue is lack of keep-with-next functionality; however, FOP is under a complete refactoring, and will emerge with full functionality). Saxon has been superb, DocUtils has been wonderful (and I've been able to contribute to the overall design), and ReST is quite pleasant to read and write.
Overall, I highly recommend this workflow.
Your source material becomes extremely reusable, eminently accessible, and free from commercial encumberances.
(footnote: if you do go this route, please don't flood the DocUtils developers with suggestions and ideas. Work out your idea in detail, consult the developers' mailing list archives, and make full consideration of side-effects. Only then suggest it. They've been at this so long, and had so many discussions, that they've become a little short of patience with loud-mouthed newbies. I suspect most popular open-source projects get that way...)
--
Don't like it? Respond with words, not karma.
I've written some extensive docs in texinfo and moved it rather easily to pdf, html and plain text.
I've tried doing the same for docbook and it plain sucked. While the DocBook format itself is nice, the tools for transforming are too complex (for me?), esp. if you want to customize conversion to HTML or PDF. This definitely goes for DocBook/SGML, and by what I've seen so far DocBook/XML too to some extend.
Thus I'd rather say "texinfo", at least unless someone comes up with a foolproofed suite of tools for DocBook->PDF+HTML.
My $0.02.
- Hubert
It's highly likely that OO is scriptable.
I've seen some posts here on XML, but most seem misleading. I've found that the most expressive and most flexible format is manual XML -- as in, your own dialect.
That is, you define your own tags, and define what they mean. Then you create stylesheets to convert them to other things. Because the original XML contains your intention, not the eventual formatting, it makes it easy to convert, or to make broad, sweeping changes to presentation (as presentation is detached from content).
The simplest example is, suppose you started out just using some HTML-like <pre> tag for code. You could easily search and replace that with something else, like say <div id="code"> for an HTML-style CSS implementation. That would actually be a step towards XML. The problem is, if you start out this way, and then you've got your hundreds of pages, you may have occasionally used <pre> for other things, like ASCII art.
The right way to do this is create an XML tag <code> and stylesheets that convert it to, say, a little padded space for text, a <pre> tag for HTML, or something literally to do with fonts for postscript. That way you can change any detail of presentation in any format in one place (the stylesheet).
There's also the fact that XML, especially encoded in UTF-8, is immortal, whereas say a Word upgrade (or any software upgrade, for that matter) could make your document obsolete. And even if there is never an XML parser left in the world, it is human-readable.
Don't thank God, thank a doctor!
DocBook SUCKS. However, it's probably the best thing out there for the job.
The problem with DocBook might also be considered its strength - basically it was designed by a committee, and evolved several humps. Each influential party behind it pushed the features that they wanted to see into it. Each individual feature set is a pretty good coherent package which will let you create documents just like [insert-project-name-here]'s own documentation - pretty neat! However, the different feature sets clash _horribly_, and if you pick and chose beween them you'll end up with an inside-out baboon.
(And to be honest I've not discovered _any_ feature that the various admonitions don't look out-of-place near with! Most of the list types are pretty, erm, special too.)
I've just taken on the role of producing documentation for a small OpenSource project, and I came _very_ close to regretting my choice of DocBook. However, once you've decided what coherent subset of the features you actually need, you'll probably end up with something that looks OK in all formats.
(I was using the default Debian Jade configuration, perhaps I could tweak some of the stylesheets to look less quirky.)
YAW.
Your head of state is a corrupt weasel, I hope you're happy.
Too bad they are such a large investment just to publish. Applescript allows for doing very complicated workflows using multiple applications. You can even 'compile' it as a droplet for drag and drop functionality.
If I was doing this on my Mac I would create a script to, in order: Save my Word file as Plaintext, Save it as HTML, Print it as PDF (OS X can print to PDF from any and all applications), use the ColorSync Utility to regenerate the PDFs with your desired compression settings, then use an HTML cleaner such as HTML Tidy to eliminate all the crappy MS HTML markup. With Applescript it's a point and click operation to create the script, just hit record and go through the motions described above, hit stop and save as a droplet. You can drag and drop any number of Word docs onto it when ever you need to 'publish'. You could add an FTP action or save to an iDisk as part of the workflow just as easily.
The only thing you have to worry about is some of Word's [table] markup as it seriously blows when you try to convert to normal html.
There are plenty of tools for XML/XSLT transforms that could be scripted as well but it could be overkill... or maybe not.
If you had a Mac it would be easy.
A fool throws a stone into a well and a thousand sages can not remove it.
The other solutions presented so far suffer w.r.t # 4 - document maintenance. After all, if someone created their document in a visually rich editor like Word, it was probably because of ease of use and they will find it easiest to maintain it there. However, two conditions must be met for the system we're discussing:
(A) It should be possible to constrain all aspects of the document, so its has a defined, machine-comprehensible structure
(B) The maintenance application must integrate with a version-control and access-control system (a nice to have - so that document-maintenance is transparent)
Regarding (A) - it seems useful to emulate an ability from the new "Pro" version of Word 2003 - custom XML schemas that constrain document content. From what I've read about it, its like the old document field macros and templates, but more powerful and using XML Schemas for validation. I'm *guessing* (not sure - can someone with OpenOffice expertise chip in?) OpenOffice could be made to do the same thing. For example - could a 'document schema' be defined that the document is *forced* to have it's title in the middle of the first page, it's index auto-generated on the second, text in a particular style, chapter and subchapter headings in other per-selected styles, a list of figures and list of tables auto-generated as Appendices 1 and 2... ?
Regarding (B) - the scriptablity of OpenOffice should support invocation of CVS/Subversion clients.
If you write your own major mode for Emacs, all these issues just go away.
However, if you are lacking lisp-fu and absolutely must have a GUI-based
WYSIWYG editor, OpenOffice may be a possible solution. You'll have to avoid
workflows that result in creating styles with meaningless names. (For example,
you can't just highlight some random text and start formatting it in various
arbitrary ways. Instead, define your styles properly with names using the
style catalog, and then apply your named styles to blocks of text. The styles
you define should represent things that have meaning independent of the format,
so that you can sensibly convert them into the various target formats.)
Getting the XML out of an OO document is a simple matter of renaming it from
foo.sxw to foo.zip and unzipping it. You can then use any XML parser you
like (e.g., one of the XML modules off of CPAN) to transform it into DocBook
or whatever. All of this can be automated.
Cut that out, or I will ship you to Norilsk in a box.
I use roff. It is a very simple document formater. The plus is that you automatically get unix style man pages for free. Use it with make to simplify your life as well.
h tm l
Here is a concrete example. I create a roff file rwlock.man as the source. Say I want a postscript doc, then I add the following to a Makefile.
rwlock.ps : rwlock.man
groff -man rwlock.man > rwlock.ps
This uses GNU troff, on other systems you might use the troff included with your system and pipe through dpost.
If I need a pdf file, that is easy from the postscript file.
rwlock.pdf : rwlock.ps
ps2pdf -dCompatibilityLevel=1.1 rwlock.ps rwlock.pdf
You can use all sorts of other options to ps2pdf just do a 'man ps2pdf' to learn more. You can install ps2pdf in the usual ways for your system, it is a common package. If you are on MacOS X 10.3 you already have pstopdf which is similar in functionality.
Say I want a plain text file of the documentation, then I add something like this to the Nakefile.
rwlock.txt : rwlock.man
nroff -man rwlock.man | col -bx > rwlock.txt
If your system includes GNU nroff, then you can use something like gtroff -man rwlock.man | grotty -buo in the command instead. Some nroff's and their 'an' macro files are a bit old fashioned and do 66 characters per line and 66 lines per page so you may have to experiment a bit on your system. Solaris nroff is a pain with respect to this while FreeBSD gets it right.
Then if you want html, GNU troff is the best way to go.
rwlock.html : rwlock.man
groff -man -Thtml rwlock.man > rwlock.html
It generates fairly lean html with some comments that is easy to follow if you ever need to look at the source.
If you have a whole bunch of files to process, then you can use suffix rules in make to simplify that job. There numerous troff, nroff, and man page HOWTOs on the web that you can read that make it a breaze to get started with roff. There shuld be standards if you want to conform to say FreeBSD or Linux man pages.
You can go here to see the results from the 'rwlock' examples from this comment:
http://www-bd.fnal.gov/controls/micro_p/rwlock.
(Slashcode may break-up that url, I did this post in text because I did not want to deal with the lameness filter for all of the make rules.)
You can do a man on any of the commands above to learn more about them. Also roff can do a whole lot more because you can have it run various other processors as it formats. In this way you can get tables, figures, pictures, references, and even primitive equations.
Once you start doing that though, man pages cannot really look right anymore on a text console and xman is a kludge so often gets this wrong. If you start getting into more sophisticated equations for example, I would recommend latex. It is straight forward to get ps and pdf out of latex and you can try an add-on such as latex2html to get html output.
Hope this helps.