Is Free Software Ready For E-publishing?
johanneswilm writes "Over more than 3 years I have been writing my PhD thesis on the politics of Nicaragua. Being the most professional system for PDF generation, I went with LaTeX, and, to make the text accessible for the editors, I used the LyX editor. Now that the publication date comes near, I found I had to spend considerable time creating a script to convert the manuscript to formats such as Epub as none of the available tools were quite ready to do it automatically. Is LaTeX only good for writers in the natural sciences? Is the open source community boycotting ebook formats, as Richard Stallman has proposed? Are there better tools to do the same?"
Being the most professional system for PDF generation, I went with LaTeX
Now that the publication date comes near, I found I had to spend considerable time creating a script to convert the manuscript to formats such as Epub
It sure sounds the like most professional system!
The truth is, if you want your job done, you look at the merits of every possible program without considering if it's open source or not. There are good software like Apache that are mostly good for web hosting (unless you have certain requirements). Then there is lots of shit. The same is true for proprietary software tho. But if you want to get something real done, it's just stupid to limit yourself to only open source OR proprietary software. Pick the best tool for the job.
Google+ vs. Facebook, and why Google+ will fail
The best way to go seems LaTeX->HTML->ePUB. I guess many of your problems do not come from LaTeX itself, but from the fact that the LaTeX code that LyX outputs is... well... not meant for human editing and for further work. (haven't worked with LyX in a while, though -- maybe the quality of the TeX it produces has considerably improved in the meantime).
My first program:
Hell Segmentation fault
calibre is a free and open source e-book library management application that can convert to and from most of ebook formats. And does a pretty good job at it.
http://calibre-ebook.com/
Stallman complains about DRM and a lack of anonymity with eBooks. It seems to me that this story relates very closely to legally acquired music. While it is still difficult to legally acquire digital music anonymously, it is easy to get it without DRM. I suspect books will follow this same path if consumers value it as a feature. In practice there is in fact little anonymity in the purchase of real books as everyone wants you to swipe your "club" card and use your debit card to make the purchase but his point is well taken. The option to buy an unpopular book in secret is nice.
With time and interest from consumers we will have DRM free books.
Anonymity is dead and gone and I didn't even get an invitation to the funeral. We should all mourn it's passing.
"Is the open source community boycotting ebook formats?"
Hardly. Calibre is an excelent converter, library manager and it's compatible with most of the readers out there for syncing. You could try converting from pdf to e-pub with it, although PDF is a lousy input format.
Going through PDF is horrible. LaTeX contains a lot of semantic markup. ePub is XHTML, which is a form of semantic markup. PDF is a presentation format. So, you start with semantic markup, discard it all, and then try to generate it again by magic.
You end up with something that looks vaguely like the PDF, but loses most of the semantic information (e.g. section / chapter breaks). Worse, you often don't want the ePub version to look like the PDF - they're aimed at different form factors.
I am TheRaven on Soylent News
I've found pandoc (here: http://johnmacfarlane.net/pandoc/) to be very useful for generating PDF/ePub/LaTeX/etc from Markdown formatted text files.
Others have told me that the financial gain of publishing an academic book may be up to 700 USD. In comparison to current Scandinavian wages that really means very little, so I don’t think that earning another 700 USD should be a motive to restrict the access to one’s thoughts.
First of all I would like to commend you and thank you for this sentiment.
Is the open source community boycotting ebook formats, as Richard Stallman has proposed?
I don't understand, Stallman decries e-book formats that aren't open. There are many open e-book formats--including ePub. Granted, there are tools out there that allow you (to varying degrees of success like Calibre) to crack and convert to these formats but why bother? As you can see in that table, most everyone supports PDF. You are misunderstanding Stallman's gripe. It's not that we are boycotting e-books, it's that e-book makers are trying to carve out their own proprietary section of the electronic market, reader and creators included. So let them take their ball and play elsewhere. As you noted in your blog, this isn't the only problem:
Most ebook-readers out there so not implement the Epub-standard perfectly. That means that although one has an Epub that follows all the standards, one can be quite sure that it will not display properly on all the readers. Kovid Goyal, the creator of the Calibre ebook management software has done a good job in creating conversion scripts that create Epubs for all the different readers. Unfortunately they do this by breaking compatibility with the standard, and many distribution sites will only check whether your Epub complies to the standards and not whether the book will actually look good in the reader.
Most readers handle PDF, I would just stick to the output of LaTeX. I might suggest that your expectations are misdirected at the open source community and might be better directed at the makers of readers that apparently force you to break standards. It's the IE6 conundrum all over again.
Stallman didn't suggest boycotting ebook formats, just the DRM associated with them (big surprise there). The problem you are experiencing is that sometimes it's difficult to go from one open standard to another. The tools are lacking in maturity and I'm guessing that since my Android phone can easily display PDFs for me that there's not a lot of people demanding this ePub support that apparently needs multiple flavors for each device (and Calibre helps you with this). The tools exist but they'll only get you so far and I think the really special stuff that LaTeX does well is what you'll find yourself needing to fine tune in the end product. Look at how long it's taken LaTeX to get that beautiful and I think you'll discover that making a magical cure-all converter to ${random format} can be a non-trivial task.
If you start a kickstarter and get your university to donate hosting to making an open free market for any academic papers in any open format, I'd definitely throw in $20 (I've spent about $200 on kickstarter in the past two years). Either that or maybe throw your lot in with arxiv and work with them to fund more format support?
My work here is dung.
The solution to your problems is Pandoc which can convert LaTeX to EPUB if you like. Now, it will probably take some fiddling on your part with the output but it very much smooths the process.
I've not published anything in a journal for a couple of years, but in computer science every journal worth reading accepts PDF submissions and either provide a LaTeX style, asks for your LaTeX source to edit themselves, or tells you which standard LaTeX style to use. It's a good first check for a journal - if they don't encourage LaTeX submissions, they probably suck. Apparently the same is true in mathematics and physics, but less so in other subjects. In the humanities it's common for journals to require MS Word documents (and place insanely strict requirements on the formatting of the bibliography that are trivial with BibTeX and very hard with MS Word, from what I've heard).
I am TheRaven on Soylent News
1. Realise no scripts exist for problem
1,1 Realize that someone writing a thesis on Nicaraguan politics may not know how to program
1.2 Begin learning to program
1.3 Spend more time learning to program
2. Write scripts
2.1 Divert time from PhD thesis to write scripts
2.2 Spend more time (diverted from PhD program) learning to program sufficiently to write workable scripts to solve stated issue
3. Release scripts as open source
3.1 Fail to complete PhD thesis in time due to time spent programming
This mind intentionally left blank.
The trouble is, PDF is a pretty rotten format for e-readers, because it's all page-layout oriented and so produces output that doesn't scale well for different screen formats and text sizes. It's the wrong format for the job. And DVI has pretty much the same problems. The problem isn't that free software isn't ready for ePublishing -- Calibre and Sigil do the job well. The problem is that there's a disconnect between the assumptions laTeX makes about a document and the assumptions that are valid for ePublishing, Sorry if it's restating the blindingly obvious, but you didn't want the best system for PDF generation, you wanted the best system for PDF and EPUB generation, and that probably isn't laTeX.
Quidnam Latine loqui modo coepi?
While he states "We must reject e-books until they respect our freedom." He also outlines 7 things amazon's e-books do that violate this freedom. Fortunately epub is the most widely accepted e-book format and it has none of these 7.
RMS isn't against e-books. He's against amazon's approach to e-books.
But if you want to get something real done, it's just stupid to limit yourself to only open source OR proprietary software. Pick the best tool for the job.
Be careful: sometimes, especially in cases of works under a "copyleft" or "share-alike" license, a work's copyright license limits which tools for the job are lawful. For example, some licenses require works to be made available in an editable format that isn't Java-trapped.* See, for example, sentences containing "Transparent" in the GNU Free Documentation License and sentences containing "technological" in CC BY-SA. You can use proprietary tools yourself, but you also have to make sure that the work can be edited with free tools.
*Term's original is historical, prior to IcedTea.
If graphics and equations were required, I would have moved to a generic HTML + css method.
Most web browsers that I've seen are based on the model of rendering a web page to a scroll that is 960px wide by infinitely tall. But in the real world of print, the codex has replaced the scroll. The paged media module in CSS3 is still only a Working Draft. So which web browser would you recommend that has thorough support for MathML and for CSS paged media?
4. Realize this is exactly what happened to Knuth.
4.1 Take consolation in the fact that at least it's just a thesis, not the next volume of TAOCP.
Short answer; none want epub formats as submissions. But this doesn't mean to say that there is not a desire to produce them from submissions. Lots of scientists and academics want to read articles on the go, without having to carry around lots of paper.
My own experience, however, is that the big move up is from PDF to HTML. This improves the reading experience enormously. EPUB on the other hand is limited. Many ebook readers don't work that well for academic content: mathematics is dealt with badly with non-scalable fonts, graphs and images are poor, citations are not well supported. I haven't see a huge use case for epub yet.
This is fine. Then if I purchase an e-Book, I only need the PDF version specific to the device I'm currently using (a Nook Classic)... oh, and any device I might ever want to use for the rest of my life. A proper eBook format cannot be tied to a specific page format.
I like PDF for computer use, but the parent is right... it's definitely "rotten" for e-Readers. I've tried converting PDF to ePub to use on my Nook and it's a hit-or-miss proposition, with much more "miss" than "hit".
You are in a maze of twisty little passages, all alike.
Because PDF to ePub conversion generally gives you pretty awful results. Nothing against Calibre. I use it. But most PDFs I've tried to convert for my Nook Classic have had less than stellar results: readable if you're lucky, but not nicely formatted. And if there are embedded images, all bets are off.
You are in a maze of twisty little passages, all alike.
Who defines the "common" devices? How do you handle something like an Android or iDevice, where the orientation can be changed? PDFs are not a good format for anything destined for a screen instead of paper. That computer monitors are (mostly) large enough to display most of the common paper size (letter/A4) is fortunate, but should not be relied upon.
No, the results won't look better. I have poor eyesight, so I choose a large text size on my eReader. If the original is EPUB then the text reflows smoothly and it's all nicely readable. If it's PDF it doesn't, and the results look like crap. The solution you propose means anticipating the individual requirements of every potential user, and producing a customised PDF for that user. What's more, if I'm in bright light then I can move to a smaller text size to see more at one time, but doing it your way I'd need two copies of the file (and some way of synchronising the bookmarks and annotations). We've moved beyond the age of one-size-fits-all, but PDF hasn't. LaTeX doesn't seem to have, either. Essentially, you need to separate content from presentation, which neither PDF nor laTeX does, although there is work on moving laTeX towards that.
Quidnam Latine loqui modo coepi?
I don't think any of you twits realize how much work goes into any PhD thesis.
A little programming overhead is not going to be that much of a burden really.
This is why most stuff gets invented. It's really not that much of a tragedy when people who don't specialize in selling a particular technology to others have to develop solutions for themselves involving that technology.
If real people thought like you weenies then we never would have had the original killer app for the PC.
A Pirate and a Puritan look the same on a balance sheet.
Congratulations, you've just earned the "I don't know what I'm talking about!" achievement! All modern e-readers, when using their proper formats (generally ePub for pretty much anything worth using) handle line-breaking and hyphenation just fine, and unless you're reading from some badly OCRed plaintext copy, will look as good as the paper version. PDF is a bad format for e-readers, and you're a bad person for suggesting it.
Today is red jello day - all workers must eat all of their red jello. Failure to comply will result in five demerits.
This would only be relevant if we were discussing templates and packages to be embedded as part of the document
In the case of documents under the GFDL, the copyright license requires that those who distribute copies of the document also make copies available in a "Transparent" form, one editable using free software. So those who make derivative works have to make derivative works available in a "Transparent" form. Or perhaps I misunderstood "Transparent" in the GFDL; what am I missing
Yes, you've pointed out the drawbacks with deploying as pdf, which I agree are real. And I think I've pointed out the advantages of pdf, and the drawbacks of ebook formats as they are currently implemented. If you're an author and you care or want to control what the document actually looks like, then pdf (or a bunch of images) is your only option.
No, it isn't an option. PDF doesn't do that unless you also control the device on which it is displayed. When I view a PDF with large text on my eReader I'm damn sure that what I see isn't what the author intended (not all authors can be that demented, surely). If you are trying to do that then you have overstepped your role as an author. If you think you have succeeded then you should try talking to your users (especially ones with visual impairment). PDF does have it's uses, but that isn't one of them.
But one of the main reasons for LaTeX is exactly to separate content from presentation, so I think you're misinformed about that
I used to use laTeX a lot, and was a member of the TUG. LaTeX is better than raw TeX, in terms of separation of content and presentation, but most raw TeX is still there in LaTeX, and LaTeX commands such as \textwidth, \baselineskip, \raisebox (everything to do with boxes, in fact), \vspace, \textbf and so many other laTeX constructs are about presentation, not content. You can write laTeX that separates content from presentation, but tools that claim to process laTeX can't assume that you have; they need to accept all legal laTeX, including all the presentation stuff.
and that point doesn't apply to pdf, which is for consumption only, not for writing.
Internally, PDF is quite like DVI in terms of how it structures a page, and the content and presentation have been well and truly merged. PDF puts blocks in defined positions on the page, and the order of the blocks doesn't necessarily match the order of the content. That's why when you select text in a PDF you often get bits you don't want. And it's why it's hard to go from PDF to EPUP; it's not a simple translation, the software needs to understand the significance of relative positions of blocks of text, which is very far from trivial. Yes, it's a presentation format, but that means that you have lost information needed to make a robust EPUB file from it. A far better option is to start with EPUB and generate your PDF from it and a stylesheet. The only downsides are that free EPUB editing tools are not well developed (unless somebody can point me to one that I've missed) and that EPUB enforces a linear reading sequence (but you're going to have to deal with that anyway if you're going to produce EPUB).
Quidnam Latine loqui modo coepi?
The store has 87 CCTV cameras, all linked to face recognition software.
Sent from my ASR33 using ASCII
Or you could build yourself a PHP tool that does it, as I did here. (Disclaimer: intended for my personal use, so basically no user-friendliness at all.)
Creating ePub's is surprisingly easy from a programing perspective.
FanFictionRecs.net
Do you have a pdf reader that tries to increase the font size, or make font substitutions, rather than just zooming the whole page?
Yes, my eReader, because it only has the facility to turn the page, not to move around the page. And as far as I'm concerned, it's not the fault of the reader for trying to do that, it's the fault of publishers supplying content in a format that tries to stop me, which I view in the same light as DVDs that won't let me pause the main feature while I take a comfort break. I don't want to have to read a broadsheet newspaper through a letter box by asking somebody on the other side of the door to move it around.
Some documents are typographically complex, or convey their meaning partly through layout and typography, and these elements will be destroyed by typical e-book software and are not preserved in e-book formats.
And won't be displayed on my eReader, even if you use PDF. Yes, PDF is a good format for such documents, but not for an eReader because it won't work.
Quidnam Latine loqui modo coepi?
No, it doesn't.
If all you want to do is download a file and print it on Letter-sized paper (or A4, assuming the PDF is in A4), then PDF is great.
However, if you want to view it on a screen, especially a screen that's smaller than letter-size, it sucks. Maybe you haven't noticed, but ebook readers are all smaller than letter-size paper, so it's physically impossible to view a PDF page on an ebook reader without either panning, or shrinking it. Panning around to read a page is annoying, and shrinking it will make it difficult or impossible to read (depending on the font size and the ereader's algorithm), plus it's even worse if the viewer has poor eyesight and prefers larger fonts.
This is the entire reason that ebook formats were invented, so that readers could dynamically resize and re-flow text, instead of being stuck with a fixed page size. Of course, with PDF, instead of defaulting to Letter size, you could format your document for a page size equal to the ebook reader's screen size, and make it look great on that ebook reader, but only that one. They don't all have the same size screen, so you'll need different PDFs for every single ebook reader out there, which flies in the face of the "Portable" aspect that PDFs tout. Plus you'd still need one in Letter size for anyone who wants to print out the document.
That depends on how you use it. In the XHTML that I generated for my last book, for example, every code listing was marked up using libclang. Each token was in a span element with the class set to the token type. This meant that the XHTML version contained information about whether something was a macro instantiation, a language keyword, a reference to a variable, and so on. The styling for all of this was then done in CSS. I sent my publisher a rough version and they could then tweak it so that it matched their house style a bit better.
I initially tried using tex4ht, but it generated presentation markup for the code listings. They looked in the output how the listings package made them look in the PDF, but it was impossible to style them properly from the CSS.
I am TheRaven on Soylent News