HTML Tags For Academic Printing?
meketrefi writes "It's been quite a while since I got interested in the idea of using html (instead of .doc. or .odf) as a standard for saving documents — including the more official ones like academic papers. The problem is using HTML to create pages with a stable size that would deal with bibliographical references, page breaks, different printers, etc. Does anyone think it is possible to develop a decent tag like 'div,' but called 'page,' specially for this? Something that would make no use of CSS? Maybe something with attributes as follows: {page size="A4" borders="2.5cm,2.5cm,2cm,2cm" page_numbering="bottomleft,startfrom0"} — You get the idea... { /page} I guess you would not be able to tell when the page would be full, so the browser would have to be in charge of breaking the content into multiple pages when needed. Bibliographical references would probably need a special tag as well, positioned inside the tag ..." Is this such a crazy idea? What would you advise?
You seem to be talking about LaTex. It already exists. Don't reinvent it.
Congratulations, you're the 5,134,978th person to suggest a change to HTML which will prevent it from being reflowable!
Please step up to the spiked door in front of the acid pit to claim your prize.
As much as I hate Adobe, there's a reason why PDF files dominate acadamia..
LaTeX: it has everything that you are looking for and can be easily compiled to ps, dvi, pdf, and (I am told but haven't used) html. It even plays nicely with version control, bibliography management (BiBTeX), etc.
As a bonus you can run it on linux via command line.
I am wondering if the whole concept of CSS modifying a set of stock tags is unwieldly, and if a simpler Html might be one that allows you to first specify a page schema with custom tags, then, renders those using CSS to define custom tags. So, instead of having pages with div class = "menu", we might have , etc.
This is my sig.
This is exactly what CSS is designed for, presentation. The CSS3 Paged Media module already defines a number of the properties and settings you're going for. It even includes positions such as @bottom-center to allow you to position footnotes and references. The only thing missing is a way to mark this up in HTML, which could easily be done with anchors and the longdesc attribute, coupled with the CSS content: property. What you're looking for is a CSS3 enabled browser, not a new specification.
LaTeX already got mentioned, and probably makes more sense.
If you really want an unreadable super-general XML-based format, use ODF.
I don't seem to understand why you couldn't simply change the properties of standard HTML tags to fit your needs with a simple CSS sheet. HTML, after all, was designed with the explicit purpose of representing a document.
Otherwise, if you want special tags, use LaTEX.
Otherwise, I'm sorry, its really a crazy idea.
Is there a reason you don't want to use CSS? Because, there are already CSS extensions that do exactly what you want. The book Cascading Style Sheets - Designing for the web, was written using only HTML and CSS and prepped for printing using PrinceXML. The PrinceXML web site has a bunch of HTML+CSS similar samples, including academic papers.
Something that would make no use of CSS?
Given that CSS does this already, what's the advantage of adding another way of doing it without CSS?
What you want (being able to define pages) is wrong in many many ways.
You should, as an authoring tool, never define a page, or its dimensions, especially academic works, which will be printed in different formats, on different paper (A4/Letter/Tradeback/etc/etc)
At most, whatever markup you have, many define things like page breaks, but even then, they are more a typesetting issue.
What you want is either LaTeX or DocBook.
I use to have a funny sig, but slash cut it off, and I forgot what the punchline was.
Static configurations are available already, not the intelligent ones being requested. Has sufficed for what I needed:
To have print page break add: <p style="page-break-before: always">
Also, to hide odd font and underline for links:
<STYLE TYPE="text/css" MEDIA=print> <!-- A { text-decoration: none; color: black } --> </STYLE>
Yes, they have to be massaged a little.
Here's to losing my Karma Bonus again....
create yet another little-used and poorly supported document format...
the document size will increase.
normal text like ("this is my file and my image and my link and my e-mail")
will need more tag and element to let the browser speak with it.
e.g ("this is mymy file") and so on.
this is just an unnecessary waste for bandwidth and time :).
especially when there is an alternative solutions, e.g PDF, DOC, OpenOffice.
cheers
HTML describes a document. "Document" used to imply printed pages, but it doesn't anymore. HTML doesn't have anything to represent the notion of a page because documents don't have pages.
Actually this makes a great deal of sense to me. I'm not sure on this but I think HTML5 contains tags for many of the things needed. I don't think css is the answer though as it is for presentation only. HTML is for hierarchy and structure of information as is XML. The part that makes sense with this is that it would be standardized (if you can keep Microsoft out of it) and could easily be transitioned back and forth between the web, ebooks and whatever device came next. PDF is widely used but truly it is a pain to convert into a structured document. Word is a nightmare with all of the jumbled up MS proprietary tags. I've yet to see an online editor that will clean up that mess with a simple copy and paste. The real issue is standardization in the way we store textual information. It's a huge issue and frankly Microsoft needs to be called on the carpet for manipulating and at the very least getting in the way of standards. It's refusal to recognize standards has caused needless expense to anyone that publishes information on the web. Few people realize the damage MS has caused on the web. Everyone bitches and moans about their operating system but only those directly involved in creating content for the web seem to complain about IE and their corruption of a standardized open document format. The damage they have done in this arena will haunt us longer than windows will, in my humble but sincere opinion.
Seriously. It's pretty bad. You can, however, use Docbook (or your own schema or Docbook extended with your own stuff) and XSLT it into XTHML (or something entirely different) at the end.
Most likely you just want to use Latex though.
If you want to save the source form or markup, use a language designed for it: LaTeX. LaTeX lets you represent all the things you would want to represent in an academic paper, it's fairly readable, very widespread, and has tons of tools. And LaTeX converts to both HTML and PDF.
If you want to display on the web, use HTML. It's meant for the web. It's not a good representation for paged media. If you must represent paged media, you need to use CSS or XSL, but you probably don't want to.
If you want archival quality paged representations, PDF is the only game in town really. HTML with CSS doesn't come close. But it doesn't make sense to save your own papers only in PDF because PDF is not really editable and doesn't have the semantic information.
From JavaScript Site Page breaks with css
You wouldn't want to use HTML for something like this, especially with newer versions of HTML. There has been a steady transition in HTML away from specification of the aesthetic appearance of a page. For this reason tags like and are considered nonstandard anymore, mostly because CSS does a way better (and cleaner) job of it.
The only way to tell the difference between a hamster and a gerbil is that the hamster has more white meat.
I use XML/XSL to render my content as needed - including images and SVG graphics where needed. Then I use the FOP project to convert the generated XML-FO into PDF. Works great and can be scripted easily. But the learning curve is kinda steep. Luckily there are a few tutorials out there.
Yeah, seriously? This is not a valid slashdot article. PDF and numerous other formats exist for a reason. Why reinvent the wheel, there was no reasons stated in this article why any of the other, very popular open standards for documents couldn't be used.
Ugh... who submits these articles?
I used netscape communicator to write all my papers for uni, mainly because it was available under windows and unix (IRIX in our case) and could be read by anyone on any platform.
It was a reasonably easy to use editor, without all the useless crap most others have.
A few lecturers were quite impressed with the idea, the portability and cost were big factors.
...
There's a little-used standard that came out of the W3C along with XSLTs called XSL:FO. You write your document in XSL:FO markup, and then one of any number of processors like XEP to convert it into PDF or what have you.
http://www.w3schools.com/xslfo/default.asp
One of the original purposes of it was so that you could use XSLTs to transform the same XML data into both XHTML or XSL:FO for publishing. The standard never took off though. XSL:FO just doesn't have enough options to be typographically interesting, compared to SVG.
Of course, the right answer is LaTeX, but you might want to give XSL:FO a try for familiarity's sake.
I think writing papers using XHTML and CSS 2.1 or 3 is a good idea. Then you can use Prince XML to convert it to PDF. Their site has a nice sample or two of journal articles / conference papers. The quality of the renderer is great. It was even used to create a professional book, Cascading Style Sheets: Designing for the Web.
It has exactly what you need, an html-like format, but tagged by meaning, not presentation. The project has tools to convert it to printable formats.
The spec: http://www.docbook.org/
The tools: http://docbook.sourceforge.net/
Someone mentioned XML/XSL/FO. Don't try to write your content in XSL-FO. You'll hate every minute of it.
I'd look in to using DITA (Darwin Information Typing Architecture). It's a set of canned XML structures, plus a specification for how to process and customize those structures. It includes tags for stuff like footnotes...I bet it covers a lot of your use cases. There are some good intros to how these XML structures work here: http://dita.xml.org/book/dita-wiki-knowledgebase
As DITA is XML, you can convert it to HTML and whatever else you feel like, pretty easily. There's an open-source implementation of the DITA spec called the DITA Open Toolkit (http://sourceforge.net/projects/dita-ot/). The DITA Open Toolkit includes stylesheets/scripts to publish HTML and PDF, among other things. PDFs are published via XSL-FO. Just like HTML needs a web browser to render something useful, XSL-FO requires a FO processor to create a PDF. So, in the end you write DITA, XSLT and other scripts transform that DITA to XSL-FO, the a FO processor consumes the XSL-FO and spits out a PDF. The DITA Open Toolkit comes with an open-source FO processor (Apache FOP). FOP doesn't fulfill everyone's needs, but it might work very well for you.
Unfortunately, working with the Open Toolkit and customizing its output can be a bit unwieldy. http://groups.yahoo.com/search?query=dita+users is a pretty good place to look for help.
Don't reinvent, as so many have already said. CSS works for print media, LaTeX works wonderfully, pdfs work wonderfully. RDFa lets you really define the semantics of anything - People, Businesses, Biliographic data in a workable way.
LyX + LaTeX ... DUH!
It even makes it easy to take public domain OCR'ed books and reset them into something extremely nice.... *quickly*
leather-dog muksihs
Blog: @muksihs
yes, latex is nice, but it would be even better, if basic TeX would
be understood by browsers. About 10 years ago, IBM had a cool plugin called texexplorer.
The plugin would compile latex on the fly. No need to publish a PDF. It worked
pretty well for basic documents which would not rely on macros.
Still, to address the question of the submitter, it would be nice to have something like
<latex>
$\int_0^1 \frac{\sqrt{\sin(x)}}{1+x^2} \; dx$.
</latex>
It would not have to be the full latex stack but the ability to place mini latex pages into
HTML documents. Its a pity techexplorer technology seems have disappeared. If IBM would
opensource it, it could become an add-on for firefox.
What is it you're trying to accomplish? Non-standard HTML is certainly not a solution for whatever printing problem you're having, and it eliminates the benefits of HTML. Listen to everyone else that's responded. LaTeX solves most gripes people have with word processors, stick with CSS if you have a compelling reason to use HTML, and look into Docbook XML if you're not happy with the first two options.
If you want to use HTML just to prove it can be done, go for it if you think it sounds fun. But if you're serious about using it for publishing, forget it. No one's going to accept a homegrown HTML file for printing.
What size paper would we all agree upon? You listed "A4", I like "Letter". Close in size, but different. Get the world to agree, and maybe you have you wish one step closer. I'd not vote for "Business Card" sized.
5'16" is easy math, so why do so many miss it?
This is exactly what HTML was *not* intended to be. We're talking about viewing of a document, with different browsers. No standard display is guaranteed, no matter what you try. For academic documents use software like LaTeX, and create a PDF file, or, use MIcrosoft and create doc file, or whatever. I remember reading somewhere discussion why LaTeX cannot be mapped exactly to HTML (may be it was TeX faq, not sure), and that was pretty much it. Different goals in either case.
As mentioned by everyone else in this thread, LaTeX is exactly what you're looking for. HTML is absolutely not, and should never be made into, a page description language.
The editor of this Slashdot summary should be ashamed for not being familiar with LaTeX, one of the greatest open source projects.
See, as someone has already pointed out, there's at least one such tool that's in wide use already: TeX and LaTeX. If you don't like that one, it turns out that HTML, with CSS and a little bit of Javascript, is perfectly capable of doing all the things you want, too. You just have to learn how. Have a look at Lie's Cascading Style Sheets: Designing for the Web (written and typeset in HTML/CSS) and at Prince XML for detailed examples.
If you want to print HTML, Prince is the way to go. It even makes our end-user-generated TinyMCE documents look good.
I had the same idea as the OP, while looking I found LaTeX and I find it quite perfect for writing pretty much anything, however there is one point which makes it mostly unusable for normal people: themes.
While writing in LaTeX is easy and powerful, in order to theme (typeset?) a document you have to suffer quite a bit: read docs, learn lots of stuff etc. I believe what the OP wants is to be able to easily write documents (HTML) but also, easy to create a presentation (CSS), think about it: CSS is easy, simple and clean and it could be an awesome companion to something like LaTeX or any other markup language. There are a lot of styles for LaTeX that allow to create a bunch of document kinds, however when you want to customize some part of the presentation (like: add a section with a little image to the right and a yellow border) you are in a world of hurt.
I have yet to find an easy way to create print documents and have a good control over the presentation. So far the closest thing are word processors, but I hate the broken visual editing (I prefer to stick with good old code syntax).
The ability to cite an HTML document is something that would indeed be useful. The ability to hard code page numbers into an HTML document isn't. The reason why academia and the press have been so resistant to HTML, historically, is that you don't get any control over page layout. Which means that you can't refer to things by page number.
The solution isn't to fix HTML so that you can number pages. It is to fix the bibliographic references to not use page numbers. Generally speaking, it's not hard to number documents by section, and you can make the numbering fine-grained enough for bibliographic references. Then refer to the chapter and section, rather than the page number in your bibliography, and you're done. No need to "fix" HTML.
It might make sense to ID paragraphs in HTML, so that you could simply refer to the paragraph ID in your bibliography. If this were simply document metadata, and didn't have anything to do with layout, it would work pretty well. As a bonus, you wouldn't need to renumber, because the ID would just be an arbitrary cookie, and wouldn't need to make sense to a human.
Of course, with hypertext, there's really no need for a bibliography anyway. Just link to the text you're referencing... But I realize that that's impractical in academia at the moment. I'm just saying...
Behold Anti-Slash, the jihad HQ for the holy war against the Slashdot hive-mind. See our extensive documented failures of Slashdot, and make today the last day of being a robot.
Since HTML wont add new tags for you, you could write your paper as XML, and use a stylesheet to display it in whatever fashion you want. That way you could have "one column stylesheet", "two column stylesheet" etc formatting the same XML document in your favourite way of presenting it :)
You can embed CSS in HTML pages; that should do what you want, if you have another way of dividing up the right amount of information per page.
Although this is slightly more complicated, I'd look to an XML/XSLT/CSS solution instead. It would enable you to take a source document, split it into pages by paragraph or size, and then format those pages, all while keeping the raw data in XML in the case the user wanted to use another reader.
SGML pre-dated HTML, in fact, HTML is (in many ways) a subset of SGML.
I suspect the poster never heard of SGML, or it's predecessor GML
Here's a link to a good book on the subject in Google Books: The SGML Book
There is also DOCBOOK and LaTex..
Ken
Why are obvious trolls being posted as if they were serious questions?
Some versions of Microsoft Word have a really cool feature called "Save as HTML". Saves simple one-page documents as fantastically redundant HTML in less than a terabyte --and you might even get a cute little paperclip to help you through the process!
HTH.
Ask me about my sig!
"Anonymous Cowardon"? What the devil are you talking about?
When our name is on the back of your car, we're behind you all the way!
Obviously the only sensible robust solutions to this problem are either LaTeX or Docbook. The main problem with both of those is they're kind of painful to author. What I've switched to for any quick documents I write is reST. It's easy to learn for quick documents, you can edit with just about anything (its rudimentary tables support is best handled with emacs), includes features like footnotes, and is easy to render into HTML and PDF. After a few months of writing docs for some projects I work on in reST, I've found myself even writing all my random notes in that form, so that I can generated nicely printed versions of them at any time.
... you want \LaTeX
Aeroespacio.org
As the previous poster said: in the context of HTML + CSS have a look at 'Prince XML' (http://www.princexml.com/overview//). From the website ('why type if you can copy & paste' ;-) ): "Prince is a computer program that converts XML and HTML into PDF documents. Prince can read many XML formats, including XHTML and SVG. Prince formats documents according to style sheets written in CSS. Prince is available for several platforms and is easy to download and install. We offer a free Personal license for interactive use on a single computer."
I have used it successfully for some personal projects and if you're already somewhat familiar with HTML & CSS it's real easy to get into. Don't forget to check out the examples on the site.
<html>
<head>
<title>Abstract of a usable design</title>
<style type="text/css">
@media print {
body { margin: 2.5cm; }
}
@media screen {
body { margin: 50px; width: 50%; }
}
body { font-family: sans-serif; font-size: 12pt; }
</style>
</head>
<body>
<h1>It's so crazy it just might work</h2>
<h2>and other html inspired musings</h2>
<p>Why not just use css?</p>
<p>Also, don't worry about page numbering. that's the browser's job.</p>
</body>
</html>
To echo some of what's already been said, if you really want a format that will look the same for all clients, HTML is not the answer. The problem is that HTML gives too much formatting control to the VIEWER, allowing one to change the font size, change the screen (or paper!) size (think everyone prints on A4, or on US Letter? Think again!!), or even the entire font. If you really want your report to look the same, export to PDF or use a real typesetting language.
That said, if you really want to use HTML, look more closely at the "orphan control" CSS options. Used properly on your p or div elements, they can help ensure that your paragraphs or sections line up nicely on separate pages, no matter what sizes those pages end up being, or what font they end up being rendered in. If what you really want is to keep your writing from becoming visually fragmented, this may very well do the trick for you.
Why the OP specified not using CSS and then suggested an HTML element that looks almost exactly like CSS?
CSS has a method of creating pages, for printing and more. It's no more difficult to learn than HTML is. You could use XML, create all the custom tags you want, and use XSL (oh look stylesheets again) to style the XML however you want.
HTML5 is coming out in the near or distant future, if you have suggestions for tags and functions, you might want to try to get involved with the W3.
Julie Moult is an idiot.
I agree with you that the OP doesnt seem to have thought this through.
I think the ODF way is still the best for archiving, keeping content and markup data separated (separate XML) and together (single ZIP). And it is an open standard with freely available software - not just OpenOffice.org.
There is a Single OpenDocument XML File variant for ODF, where everything is a single XML file (root tag office:document) if it really has to be.
I'm not sure if this is precisely what you are looking for, but you might want to check out the various "structured text" systems. For example, "reStructuredText":
http://docutils.sourceforge.net/rst.html
These different browsers render HTMLdifferently.
That's entirely counter to the philosophy of HTML. It's meant to be independent of presentation so it can be presented in different ways. There's good reason for CSS being separate and for this sort of thing being in there. If you want a way to link to the point where a page break occurs in a particular print copy, I'd suggest adding a elements at the locations of the page breaks like this <a id="page1" />
is made for what you're talking about
Documents are written in human-readable text format - good for storing in version control and using for diffs
Python Docutils is used to convert to HTML and/or LaTeX and a few other formats
rst2pdf is a tool that converts to beautiful PDF (easier than using Docutils + LaTeX)
one problem with html is that images are kept as separate files. you would also need to have them embedded in the file i guess.
The Opera CTO Håkon Wium Lie had a talk at our university and I think just to prove a point, he wrote one of his books in HTML. Go look into that.
HTML doesn't work well with pages, rather instead use generous anchors with a consistent naming scheme. For example, using headers to separate sections, you could label the anchors by section, subsection, paragraph numbers.
But as others have said, use LaTeX and/or Lyx.
Why do you want to force the page breaks? It's stupid. HTML is intended to render correctly independent of the resolution, so independent of number-of-characters-that-fit-onto-a-page. Suppose someone gets your academic paper, but he is a bit blind. So he sets character size to 15pt, and prints it (for reading during a hypothetical train ride).
Someone else is concerned with the environment, has good eyes, and prints double sided with a 9 pt font.
Generating documents that handle this well, means you have to take care that you refer to "fig 3" and not to "the figure on page 2", things like that.
You can write your document in plain text then transform it into HTML or PDF etc using docutils.
http://docutils.sourceforge.net/docs/user/rst/quickstart.html
The entire Python documentation is written in it.
span.by:before {
content: ' ';
}
It is ironic that HTML, originally developed precisely to make it easy to mark up academic and technical information for publication, has never moved beyond the extremely bare-bones specification of heading, list, term, and paragraph tags. I would have expected some elaboration over time, but HTML seems frozen in time.
What happened, I think, is that people basically ignored HTML and went straight for word processing, a far more complex beast from a specifications point of view. For the past 20 years, we have been letting HTML languish while we attempt to come up with a document specification. ODF is the just the most recent cycle on this effort.
It is unfair of the parent to pin the blame for this on Microsoft, however. Word processing was one of the "killer apps" at the time of the birth of the web, and Word was just a niche player at the time. No, we went straight for the jugular of word processing because we all wanted to print to paper.
The OP is absolutely correct to think about revisiting HTML as a specification. What I hate about all reader-dependent formats (DOC, ODF, PDF, ...) is they force the user to completely leave the context of the web page just to view some data. The browser is the only "reader" I should need . If you can't at least embed on the page, fuggetaboutit. The gold standard is compound content with full document flow. Why oh why can't we come up with a simple way to blend content without drawing frames and putting scrollbars within scrollbars!?
Personally, I'd love to see some formality and general adoption of richer semantic markups such as the microformats hCard, hCalendar, etc. I'd also love to see some richer hierarchical markup; simple lists only take you so far! I'm imagining something with the hierarchy of XML but without the complexity of full extensibility, and all the definitional parts of a specification needed to support that (schemas).
The Atom Publishing Protocol is the perfect example of what I'm talking about: extensible, but easy to use because it comes with a well-chosen set of standard elements and attributes.
"We receive as friendly that which agrees with, we resist with dislike that which opposes us" - Faraday
XML allow you to have a very-very-(very-)long time support. Only problem: you have to create the reader... but the reader can be extend with no limitation and humans from futur will get an easy way to use this document!
Use latex!
It's much more productive to write a document.
XML is like violence: if it doesn't solve the problem, you're not using enough.
Meta will eat itself
HTML, especially XHTML, can already do what the OP describes, but browsers don't support all the bells and whistles needed for paper-like paged rendering. CSS goes some way towards meeting the deficiencies, but the end user still retains sufficient control to (perhaps unwittingly) defeat almost any attempt to force pagination and placement. It is tedious, but by no means impossible, to write documents of considerable complexity in HTML, as I pointed out long ago, but page support requires browser cooperation.
The only reliable answer at the moment is to provide multiple formats generated from a single source. An XML master (DocBook, TEI, whatever) can be used with XSLT to generate LaTeX source code for making a PDF, and the pagination data can be re-used in a subsequent XSLT script to generate paged HTML. The problem is the XML and LaTeX editors, which are unsuited for writing unless you learn about XML or LaTeX markup, and even the relatively smart ones don't implement a lot of the features needed for complex structured writing (<plug>come to Balisage to find out why</plug>).
LyX and similar editors (Scientific Word, Textures) provide synchronous typographic interfaces to LaTeX, and TeX4ht provides excellent conversion to web pages and other formats. Even Word and OpenOffice, when used with named styles (with utter rigour) can be converted reliably to HTML, LaTeX and other outputs.
The last thing on earth we need is to increase the size of the HTML tagset: HTML5 is already suffering from bloat.
You are trying to reinvent docbook. Not only is everything you want done, it is implemented in several tools (XMLMind and oXygen are two I know of), has a standard method of converting it to any form you want (XSL, XSLT, XSL-FO), and there are tools that are already written to take advantage of those standards (Apache FOP being a FLOSS one). The latest version of DocBook uses XML namespaces, so you can mix in other markup languages as well; the canonical example is DocBook + MathML + SVG, which covers 99.9% of the math/science based literature out there. BTW, if you DO plan on going down this path, I suggest picking up a copy of XSLT, 2nd edition by Doug Tidwell. The latest version of the DocBook book is supposed to be out in August; don't buy the version currently on sale, it is 10 years old, and does NOT cover the current version of DocBook.
Hacked up HTML as our universal word processing format? That has to be the most naive -- no, the dumbest fucking idea I have ever heard in my entire life. For a full explanation of why, take 50 cents and go buy an education. It should be obvious to anybody that HTML was a complete failure in interoperability and it is one of the clumsiest protocols to try and use when it comes to content presentation. As others have already pointed out repeatedly, there already exist better mousetraps for this problem anyway.
On a very practical note, you'd literally be laughed out of any print shop where you showed up with your earnest little smile and a USB key with an HTML file, and an expectation of getting an accurate print-out.
Vintage computer games and RPG books available. Email me if you're interested.
if http://en.wikipedia.org/wiki/Rich_Text_Format is to be believed, RTF owes more to TEX than to SGML, and it doesn't look like SGML to me at all.
RTF is a pure Microsoft "standard", and its versions reflect the respective capabilities of the current version of MS Word.
That notwithstanding, RTF is implemented relatively well in most word processors. If you restrict yourself to relatively simple formatting, there shouldn't be a lot of problems.
http://www.gnu.org/software/groff/
http://en.wikipedia.org/wiki/Groff_(software)
The reason why academia and the press have been so resistant to HTML, historically, is that you don't get any control over page layout. Which means that you can't refer to things by page number.
LaTeX is the same, more-or-less. You can refer to page numbers in the PDF though, which is the "final product" you actually exchange. Also, academic works tend to be very structured into sections and subsections (something LaTeX does nicely for you) which in some ways eliminates the need to refer to page numbers.
At university in the 1990s, I found that most of my professors and TAs preferred typewritten manuscript or a good approximation. So I edited non-paginated, single-spaced ASCII text in Emacs, using the Emacs text justification mode to reflow paragraphs with line breaks. I used enscript to print to school printers while inserting header/footer data and page numbering. I used the enscript options to also spread the text out into a double-spaced print format with a fixed-pitch courier font. The teaching staff liked it, and I didn't waste time on silly presentation issues.
For cross-references, I found most professors accepted any reasonably standard reference variant, so I chose a liberal arts format with end notes and "(AuthXY)" references to author name (abbreviated) and year of publication. I opened the same text file in Emacs with two Emacs windows so I could keep one editing cursor where I was in the main text, and easily insert new references and see both the entry and the reference key in the main text at the same time. Since they were not numbered sequentially, there was no issue of relabling them every time I inserted more references. This left me with a nice editable "source file" that looked good on screen, and totally functional print formatting. Once in a while, I inserted a literal ^L character to force pagination boundaries.
I started using latex when I needed to work with lots of quasi-mathematical notation in both Computer Science and Philosophy, including many subscripts and superscripts on deeply nested bracket forms. I later learned to insert vector graphics, and applied my computer knowledge to have a nice Makefile to rebuild my full latex document from source files. You haven't lived until you do this under a version-control system and then start collaborating with multiple authors checking out and incrementally modifying the source files.
I found it liberating to work with textual source formats and ignore formatting for the most part. And this is after growing up with WordStar on CP/M and then learning proper use of paragraph styles with AmiPro (?) on Windows 3.0/3.1. I am aware that my education and career has changed my whole way of thinking about information workflows. Unfortunately, I have yet to see how to bring the fruits of my labors to the masses, as it seems you have to undergo this wholesale cognitive conversion before you can start to appreciate better ways of managing your information. It is too easy to take an "easy" way out and then find yourself trapped with poor quality tools and methods.
What you want to do already exists. The W3C released media specific stylesheets, which allow you to create an HTML page with CSS optimized for the specific media you're using. It's most frequently used to create "printer friendly" versions of webpages without having to maintain two separate files. There's even an author who used HTML/CSS to create a book.
Practical information about using media specific stylesheets can be found at these articles:
Printing a Book with CSS: Boom! by Haåkon Wium Lie, Bert Bos
ALAs New Print Styles by Eric Meyer
W3C information:
CSS Print Profile
CSS3 Module: Paged Media
XHTML-Print
The purpose of HTML is to display adequately (optimally?) across different display sizes (and resolutions). If you want the opposite (fixed size), there are other formats better suited like Postscript, PDF and Latex, among others. Do not reinvent the wheel.
One the above post talking about LATEX is right. but if you don't want to learn a hole new standard. you could read this http://www.alistapart.com/articles/boom it all about printer style sheets. One cool thing about HTML that latext does not have is the auto type media extensions. you can redefine the look of you content with a specific style sheet. And while I can do all that with LATEX I have to run it through a processor first. bot that multi media marrett and if you want to get into really complex type setting issues LATEX is the way to go. if you just want to use it as a word processor I would suggest using some sort of WYSIWYG for it and have a pdf printer.
Our technical order system uses chapter/section/paragraph numbers in addition to page numbers; the page numbers are completely useless as they can change with new updates but the content always stays under the same paragraph.
So you have chapter 7, section 2, paragraph 5: 7.2.5, How to do this task, and then 7.2.6, How to inspect this task, etc. If an update comes up for the task, the page is replaced and the new paragraph is added inline: 7.2.5.1, Something we forgot about this task.
With electronic documents becoming more and more widespread, I would assume that this numbering would become more popular. It's easy to use, fine-grained (there are multiple paragraphs per page=more precise notations), and you can expand on it in place. No more, "Ok class turn to page 119, except for those of you who have the textbook with the lion on the cover, you'll need to read the first paragraph on page 117 and the middle paragraph on page 120."
My 2 cents.
-b
No offense, but I've stopped responding to AC's.
I have a sneaking suspicion that when the OP is saying things like "no CSS" and doesn't mention LaTeX, s/he is actually giving specifications in a very obfuscated way -- specifications that need to be deduced. What I take from the post is that the OP wants
Write everything in DocBook XML, and add an XSLT to have it automatically transformed into valid (X)HTML.
There are programs to convert it into things like PDF and RTF, but you can keep things in straight DocBook for the canonical version.
The trend (and not an unjustified one) is to separate structure from presentation in HTML so THE WEB can be browsed and searched in a a more meaningful way so CSS is used for display and tags are used to markup structure
Since what you are suggesting is a mix of structure and display needs, it doesn't make a lot of sense to introduce it all in HTML, it would be like going back to the 90's, nevertheless it can already be achieved by separating the components you suggest in structural and display and surely you will find solutions for what you intend using a combination of existing HTML - CSS techniques.
Also HTML was not created and is not primarily oriented for printed documents, it's native purpose is for on-screen display so it's not surprising that these features are not naturally supported in the core of the language.
while you can mark up your HTML with CSS for print media, why bother? when i send documents around i almost always send PDF's since they'll look the same in just about every reader. if it's something somebody else needs to edit, then i usually go with an MS Word document, which is a very portable format these days.
CSS3 does all you want: http://www.w3.org/TR/css3-page/
In pmwiki, you can include LaTeX math with this:
http://www.pmwiki.org/wiki/Cookbook/JsMath
I've used it for some time and highly recommend it
HTML is about semantic content, not presentation. It is joined at the hip with CSS for presentation through the browser. Print is a form of presentation.
If you want your structure and presentation intertwined then use ODF.
If you want them separated:
For structure use the book inspired DocBook, or the journal inspired (and generally more flexible) DITA.
To format either of these for presentation (either on screen or in print) you can either use an adaptive layout with HTML+CSS or a predetermined layout with XSL:FO.
Can't think of any way of avoiding CSS as all three solutions use it.
> The ability to hard code page numbers into an HTML document isn't [useful]
BS. As always, a slashdotter recommends changing the way the world works to fit the available tools, instead of wondering how the tools could be fixed to suit the world, like the original poster.
Getting an HTML rendering engine to yield a page number, given an arbitrary reference inside a document is a SMOP. And waaaaay easier than convincing every author in the world to "number documents by section, and you can make the numbering fine-grained enough for bibliographic references".
By the way, I agree with others, LaTeX is some really wonderful technology. '80's technology.
Parent's paper size is for US letter (8.5 x 11 inches). For A4 (210 x 297 mm) with 25mm margins use:
\geometry{papersize={210mm,297mm},total={185mm,272mm}}
\pdfpagewidth 210mm
\pdfpageheight 297mm
Crap, should have doublechecked my own work! Margins in the above are wrong (unless you like 12.5mm margins).
Parent's paper size is for US letter (8.5 x 11 inches). For A4 (210 x 297 mm) with 25mm margins use:
\geometry{papersize={210mm,297mm},total={160mm,247mm}}
\pdfpagewidth 210mm
\pdfpageheight 297mm
LOL
Er, no. HTML is meant to render at arbitrary sizes. The term "page" doesn't mean anything in that context. You don't need to convince "every author in the world" of anything - just convince the ones who are using HTML as their publication medium. Or use my other suggestion - paragraph identification metadata.
Yes, this is such a crazy idea.
Isn't that what XML is for? Document is the data. I'm just saying....
"Any sufficiently advanced technology is indistinguishable from magic." - Arthur C. Clarke
Will xhtml run on the old commodore 64 web server running contiki http://www.c64web.com/ URL
I think contiki only does pure html ?
Ala:
<link href="http://www.michaeljacksonsmissingnose.com/screen.css" type="text/css" media="screen" title="screen stylesheet">
<link href="http://www.michaeljacksonsmissingnose.com/print.css" type="text/css" media="print" title="print stylesheet">
Then in your print stylesheet, for print specific crap, you just do:
@media print {
a {
text-decoration:none;
color: black;
}
}
Or whatever it is you wanna do. I mean, it's up to you as to how you want to define the size of your page, and they COULD add a feature to define the size of the page, but that's better handled in CSS than HTML. The entire point of CSS is to change the way the document looks or is formatted without having to create a separate document for each way you want the document to be viewed. Adding HTML tags is the exact WRONG way to go about this.
With paper publications, the publisher decided on a page size, and Vol 1, Issue 27, page 341 was the same for everyone.
The web has to be readable on everything from an iphone to a 2000 x 3000 pixel display.
Page is not relevant to web browsers.
As far as I can tell, the OP's big problem is the issue of bibliographic citation. How do you cite a particular point in the text, ideally in a way that can be done both automatically on computer, and by people reading the paper copy.
Number the paragraphs.
Just as the various flavours of TeX have prescribed macro packages, an academic journal could have a prescribed CSS style sheet.
A citation then is in the form of author, article, paragraph number instead of page number.
There are details to hammer out: Are tables given their own numbering or are they considered a paragraph. (Can be a real problem with floats.) Illustrations/figures, section and subheads? Stuff that has z-levels?
Third Career: Tree Farmer Second Career: Computer Geek First Career: Teacher, Outdoor Instructor, Photographer.
We searched for ages for a tool to produce high quality print output from HTML for exactly the same reasons before stumbling on Prince (http://www.princexml.com) and haven't regretted adopting it. We use it from wiki pages, for technical and sales documents, for theses. It is CSS3 aware but the underlieing documents still work in most browsers.
Apply grey superscript paragraph numbers, and use those to refer to the text, instead of page numbers. This resolves the problem of varying output devices which is the absolute show-stopper for incorporating pagination into html.
-I like my women like I like my tea: green-
Get used to disappointments.