Domain: tei-c.org
Stories and comments across the archive that link to tei-c.org.
Comments · 26
-
Re:Master format for e-books?
-
Digital storage
I believe digitization of our entire literature is the goal. Think big.
But please don't use MS Word or something like that in the process. When things are going to be digitized, it should stay readable for many years no matter which hardware platform or software is used.
By using formats like DocBook or TEI much future work is saved when the book shall be converted to the current fashion of dataformats.
-
TEI
It's funny -- just as this
/. story came up in my newsreader, I was reading through the Text Encoding Initiative's introduction to TEI, trying to learn about it for implementation in the Virginia Quarterly Review's archives.
TEI is rather an old standard, as I understand it, that serves as an markup standard for the sort of documents that you might find in a library -- books, articles, letters, etc. Rather than using full-on SGML, or an invented XML standard, TEI exists for marking up documents and describing how its different parts relate to each other and how different documents relate to one another. It's delightfully simple, and very much like HTML, only richer. I gather it's the primary standard for such things.
Those who wish to find out more can visit the Text Encoding Initiative Consortium's website.
-Waldo Jaquith -
TEI
It's funny -- just as this
/. story came up in my newsreader, I was reading through the Text Encoding Initiative's introduction to TEI, trying to learn about it for implementation in the Virginia Quarterly Review's archives.
TEI is rather an old standard, as I understand it, that serves as an markup standard for the sort of documents that you might find in a library -- books, articles, letters, etc. Rather than using full-on SGML, or an invented XML standard, TEI exists for marking up documents and describing how its different parts relate to each other and how different documents relate to one another. It's delightfully simple, and very much like HTML, only richer. I gather it's the primary standard for such things.
Those who wish to find out more can visit the Text Encoding Initiative Consortium's website.
-Waldo Jaquith -
TEI
It's funny -- just as this
/. story came up in my newsreader, I was reading through the Text Encoding Initiative's introduction to TEI, trying to learn about it for implementation in the Virginia Quarterly Review's archives.
TEI is rather an old standard, as I understand it, that serves as an markup standard for the sort of documents that you might find in a library -- books, articles, letters, etc. Rather than using full-on SGML, or an invented XML standard, TEI exists for marking up documents and describing how its different parts relate to each other and how different documents relate to one another. It's delightfully simple, and very much like HTML, only richer. I gather it's the primary standard for such things.
Those who wish to find out more can visit the Text Encoding Initiative Consortium's website.
-Waldo Jaquith -
Re:law of averages?
I agree, PG text format is not good for reproducing the features of a printed book. Much better to use something like TEI. However, marking up in a semantic format raises a hairy issue: the proofreader needs to interpret the meaning of textual elements (such as italics which are used for a foreign language term - that's different from italics used as emphasis). That requires more training than simple PG markup. And of course there is the issue of a decent user interface...
Having said this, maybe these problems can be overcome. Any suggestions?
-
Re:formatting
The problem you raise is not so easy to solve. While it sounds nice to separate content from presentation, in many cases the presentation is part of the content. Take the indentation of poetry for example, or for a more specific example, e. e . cummings. Once you wade into these areas you start talking about marking the text, which is a tricky issue. The Text Encoding Iniative has been hammering out a solution for a decade, but the learning curve is steep.
As much as I think the project is digging themselves into a whole with hard formatting, I can understand why they do it. The alternative is a nasty can of worms. -
Two things I'd need before I'd buy a reader
While it would be really nice to have an e-book reader with a really good display, great battery life, and the rest, I couldn't really see spending money on one unless it had a couple of critical features (one technical, one partly technical and partly social).
The first one is the ability to load any content that I already have in electronic form. Most particularly, I'd use this for reading Gutenberg public domain e-texts, and for storing and reading webpages offline (e.g., the Python Library Reference). And to do this, I'd need either for the reader to handle several common formats (text, html, and perhaps PDF), or a Free utility to convert those formats into the reader's native format.
The second is the ability to pay for all-you-can-read access to the catalogs of various publishers, in an open and non-discriminatory fashion. That is, I don't want to be stuck only being able to buy new (copyrighted) books only through the manufacturers of the e-book, and only from the publishers they happen to have a deal with. And I don't want to pay full hardback price for single books. I want to be able to go directly to the publisher (or to the author for self-published books), and pay a yearly fee for access to everything in their catalog, or possibly a very low price for a single book ($50 a year, say, or $5 for one book; right now one hardback costs on average $25, and that's about all any one publisher will see from me in a given year). This will obviously require an open standard for e-book publishing. There seems to be work towards such a thing already in progress.
As for DRM, I'm not sure what I'd accept; certainly not much. Ideally there would be no reading or copying restrictions, but rather some sort of watermarking that would make it easy to track down redistributors, but I'm not sure how feasible that would be. I think with a subscription-based model, the convenience of having full access to a catalog would be enough to make illegal copying not worth the trouble. Publishers might lose a few sales of popular books to copying, but they'd make up for it with a guaranteed revenue stream.
-
How about the TEI XML format?
> However, it insists on at least a plain vanilla version of a text, as that format has proven to be the most durable and accessible.
Sometimes the illustrations that accompany a text are crucial for its understanding.
How about using the Text Encoding Initiative's TEI XML format instead? Graphics can be included using its figure tag. Combine the TEI XML markup with Dublin Core metadata and people could search PG's library by author, publication date, publisher, etc.
The markup can be stored as ASCII text and edited with a simple text editor. This format can also be rendered to ASCII for legacy purposes...
-
How about the TEI XML format?
> However, it insists on at least a plain vanilla version of a text, as that format has proven to be the most durable and accessible.
Sometimes the illustrations that accompany a text are crucial for its understanding.
How about using the Text Encoding Initiative's TEI XML format instead? Graphics can be included using its figure tag. Combine the TEI XML markup with Dublin Core metadata and people could search PG's library by author, publication date, publisher, etc.
The markup can be stored as ASCII text and edited with a simple text editor. This format can also be rendered to ASCII for legacy purposes...
-
Re:still free
I participated in this when it started up. It's dead in the water, becalmed, caught in the horse latitudes, so far as I can tell.
For example, take a look at the dates attached to the marked-up texts in this list. A shame--folks were mighty excited.
The Project Gutenberg XML mentioned earlier here was also exciting, but I've been off the mailing list a few years, and am having trouble finding its archives now. Anybody have more luck than me? As I recall, one of the unanswered threads that ran through it was what to do in the TEI headers, since TEI was an attractive choice for a mark-up vocabulary. It is not that obvious how to accommodate the Gutenberg boilerplate and metadata appropriately in the header. -
Re:still free
I participated in this when it started up. It's dead in the water, becalmed, caught in the horse latitudes, so far as I can tell.
For example, take a look at the dates attached to the marked-up texts in this list. A shame--folks were mighty excited.
The Project Gutenberg XML mentioned earlier here was also exciting, but I've been off the mailing list a few years, and am having trouble finding its archives now. Anybody have more luck than me? As I recall, one of the unanswered threads that ran through it was what to do in the TEI headers, since TEI was an attractive choice for a mark-up vocabulary. It is not that obvious how to accommodate the Gutenberg boilerplate and metadata appropriately in the header. -
DTDs for the humanitiesWhich DTDs have you looked at already and what do you plan to use them for?
Just off the top of my head, I recall TEI and TEI-lite being in wide spread use. There are quite a few subsets of both. In general it's often easier to strip an existing DTD down to what you need than to try to make a new one from scratch.
Docbook, as others have mentioned, is good for simple documents, or ISO-12083 for more complex ones are additional options.
-
SGML and XML editors show years of prior art
Honestly, HTML is a decent precedent for XML. Sure the structure is less ordered, and not so clearly delineated between logical/structral and layout/presentation halves. But the idea of using containing tags to structure text has been around since at least SGML in 1986.
Actually, SGML was accepted as an international standard in 1986. SGML has its origins in the 1960s, but then so does object oriented programming. GML started then and over time was modified to what became SGML which became a standard in 1986. Then concessions were made to simplify it and most importantly, IMHO, make it easier to parse by requiring documents to be "well-formed". So, editors which handle structural markup, including some web editors (e.g. Hotmetal), have actually been around since the 1960's, even if we restrict the scope to SGML/XML.If you want commercial, yet high quality examples, look at some of the tools from ArborText, Softquad, or even Altova. If you want something from the GNU project, then look at the PSGML mode for Emacs, which I recall using already in 1995. I'm sure I'm missing many examples from the 70's and 80's.
To take other recent examples, the versions of HTML prior to XHTML are in SGML. SGML and XML are the rules for defining sets of rules (aka DTDs) like HTML. You have many choices:
I expect that some TeX users could speak up as well. -
SGML and XML editors show years of prior art
Honestly, HTML is a decent precedent for XML. Sure the structure is less ordered, and not so clearly delineated between logical/structral and layout/presentation halves. But the idea of using containing tags to structure text has been around since at least SGML in 1986.
Actually, SGML was accepted as an international standard in 1986. SGML has its origins in the 1960s, but then so does object oriented programming. GML started then and over time was modified to what became SGML which became a standard in 1986. Then concessions were made to simplify it and most importantly, IMHO, make it easier to parse by requiring documents to be "well-formed". So, editors which handle structural markup, including some web editors (e.g. Hotmetal), have actually been around since the 1960's, even if we restrict the scope to SGML/XML.If you want commercial, yet high quality examples, look at some of the tools from ArborText, Softquad, or even Altova. If you want something from the GNU project, then look at the PSGML mode for Emacs, which I recall using already in 1995. I'm sure I'm missing many examples from the 70's and 80's.
To take other recent examples, the versions of HTML prior to XHTML are in SGML. SGML and XML are the rules for defining sets of rules (aka DTDs) like HTML. You have many choices:
I expect that some TeX users could speak up as well. -
The right format: TEI [lite]Instead of re-inventing the wheel, people should just pick the TEI (respectively TEI Lite) SGML/XML DTD of the Text Encoding Initiatve.
For those who haven't heard of it yet: TEI is an open SGML/XML format created for electronic editions of literary texts. It is as comprehensive and well-designed for text philology as DocBook is for technical documentation. The only drawback is that it is, like DocBook, very comprehensive and accurate in its markup tags (fulfilling all needs of academic editions of historical texts), so that for average readers, the trimmed-down TEI Lite DTD should do the job.
For e-literature collections created by professional philologists - such as the Victorian Women Writers Project, TEI already is the standard text format. Thanks to the SGML/XML toolchain, TEI sourcecode can, like DocBook, of course be painlessly transformed into HTML, txt, RTF, PDF etc. (TEI is, btw., also being mentioned in Eric S. Raymond's quite useful DocBook Demystification HOWTO.)
Florian
(philologist by profession)
-
Re:DocBook and SVG
For PDF output, there's FOP. Again, you'll need a JVM.
FOP can include SVG, via the Apache Batik renderer. I find FOP a bit lacking however, especially font handling seems to be a problem. Free alternatives include PassiveTeX (built upon TeX as the name suggests, so quite good-looking output, but a pig to install and lacking lots of formatting objects) and xmlroff (Sun sponsored, built on Gtk2's Pango renderer, still in its infancy).For structured graphics, I recommend SVG. However, I'm not sure what tools are good for generating it. And I'm not sure if FOP can handle it.
Basically, XSL FO rendering with free software isn't where it should be. One could of course use a proprietary renderer, but they tend to be pricey, and not everybody wants to go that way. Personally, I have decided to stay with the DSSSL toolchain (i.e. Jade) for PDF output for a while, but using XSLT for HTML. Jade works with XML documents just as well as with SGML, but of course, this is a maintenance nightmare.
-
On Beyond ASCIII understand the support in a lot of the comments here for the plain-vanilla ASCII Project Gutenberg approach to ebooks. Paradoxically, however, a simple ASCII conversion from print to digital form provides less assurance of future survivability and usability of your book than rendering it with the structured XML markup specified by the Open eBook standard (where well-formed XHTML is the least common denominator).
Why? Well, an ASCII text version of a printed book is really more like an analog facsimile than is a version in XML that has been tagged for structural features. Leaving aside issues of non-English characters, illustrations, and unusual typography, ASCII does a relatively poor job of capturing all of the structural conventions that exist in printed books. Books have copyright pages, tables of contents, chapter titles, subtitles, bylines, epigraphs, block quotations, footnotes, running headers and footers, citation lists, etc. ASCII can provide rough format equivalents of some of these, very poor equivalents of others. With an appropriate XML tagset, however, it's a relatively simple matter to tag most of the structural features of a book and then use stylesheets for presentational rendering. That's the whole assumption of the Open eBook specification.
Suppose you're in a world where all printed copies of Huckleberry Finn have been lost. You have two CD-ROMS that somehow you've managed to decode so that you can read the files and interpret their character sets. One of them contains the Project Gutenberg etext of the novel, an ASCII transcription. The other contains an XML encoding tagged according to a DTD from the Text Encoding Initiative, the current best standard for encoding literary (and many other) texts. It has all of the textual content of the PG version, as well as some that's missing (like the table of contents and the copyright page from the transcribed edition, which the PG version unaccountably omits). XML tags mark all the line and page breaks of the original. In addition, there are tags to mark quoted speech, unusual typography, words in foreign languages, and other significant features of the original. The CD-ROM contains the DTD used along with documentation on the tagset.
In this imaginary scenario, even if all of the XML documentation were missing it would be pretty straightforward for 31st-century programmers to strip out the tags and recreate the ASCII transcription. But with the documentation, it's possible to reconstruct something much closer to the original than the plain-vanilla PG version allows. And suppose your 31st-century archaeologist found a trove of TEI-tagged books on CD: with all of the structural tagging and metadata about authorship, publication dates, etc., a 31st-century librarian will be able to plug all of the books into a cataloging system that allows sophisticated searching. If instead you had a trove of plain-ASCII books, the best you could do with the collection would be simple full-text searches.
Leaving aside the sci-fi scenario, the reality is that our documents, over the next few decades, will move from format to format and be used for purposes that we can only guess at right now. Of course plain ASCII, or even proprietary formats, will be better than no documents at all. But the work involved in converting them will be a lot higher than if they are tagged in a well-documented, structured markup language.
Incidentally, there's already at least one project underway to take Project Gutenberg texts and add minimal XHTML or XML markup to capture structure and make them more readable via stylesheets. The Open eBook specification is just a more sophisticated way of doing the same thing.
-
Re:similar problem with MathML
I note that the LaTeX syntax is the default way of representing mathematical formulae in TEI Lite (xml-ish e-text encoding specification which is probably going to be adopted by Project Gutenberg in the near future).
-
Re:That's part of what DP does
Distributed Proofers is also working on a standard to mark up the books to better preserve tables, illustrations, bold text, math, etc. I suspect that effort is being slowed due to the priority of keeping material on the site.
Three Little Letters:
TEI is to literature as DocBook is to documentation.
-
XML
What we really need is some universally acceptable method to store digital data that isn't likely to decay or fall out of favor in the next ten years. That, I'm afraid, is a difficult proposition.
XML, especially stuff like the Text Encoding Initiative (TEI) and the new MPEG-7 format seem like they would be a good partial response. -
Dude, this article is more than 2 months old.It's a very interesting article, but it came out in February. That aside it's good that some of these are getting mainstream press.
Protocols to mention besides OpenLDAP and OAI are Whois++ and Z39.50. OAI actually is transported over HTTP. You could do the same with EAD or others.
Projects which implemented Z39.50 for the purposes of interoperability are ONE and ONE-2, EUROPAGATE, Desire and Desire II, DECOMATE and DECOMATE II, and Renardus just to touch the surface. Don't forget OHIOLINK...
Another other older, but interesting, metadata activity have been SGML MARC, and the corresponding XML MARC.
Those that are interested in more detailed reading can check out the Nordic Metadata Project, Nordic Metadata Project II, which studied the practical implications of cross browsing multiple databases and especially the use of Dublic Core. Even if you get agreement on the protocol and data standard, cross searching's not as easy as it sounds. One of the tools is the Dublin Core Metadata Temple (get it while you still can).The BYTE article was exciting to see again and could have benefited further from pointing out the relative ease of use of Dublic Core. OAI uses unqualified Dublic Core, SAFARI uses qualified Dublin Core to create an up to date index over academic research in Sweden. Shoot, since it already uses some META tags, you could even tweak htdig to use Dublic Core on your own site for those high precision searches.
With the interest in structured data (XML?) maybe well see some sites serving up not just HTML with Dublic Core, but maybe even Docbook or even TEI / TEI Lite. There are great tools for converting from Docbook to HTML, PDF, RTF, etc. and AbiWord and Kword already have partial support for docbook. If there were more, then we could see some real changes on searching the web. Coding for SGML is more difficult, so the obvious choice would be to start from Docbook XML.
-
Text Encoding Initiative
Check out the TEI web site at www.tei-c.org. Their DTD includes tags for marking up scripts.
-
Text Encoding Initiative
Check out the TEI web site at www.tei-c.org. Their DTD includes tags for marking up scripts.
-
Re:Missing marks
The marks need not necessarily be missed out on transcription. If they're using the Text Encoding Initiative guidelines. TEI allows extra-linguistic marks to be captured alongside the text.
-
Re: TEI DTD for SGMLokay, this is a quick thought, but the Text Encoding Initiative has a DTD for encoding works of literature. i've seen it used for articles, books, and stories, and it might work for an encyclopedia. there are existing engines that can bring these SGML documents to the web (in HTML). it would definitely be more flexible and easier to search for things than HTML. of course, XML is here, and with the power of Schemas over DTDs, is probably a better choice.
i'm not sure if it would be better to get a headstart in getting information into the system, and probably have to encode it (in XML or whatever), or to wait until the encoding standard is available. considering that many entries will be updated, i think the headstart is the better option. but it's going to be hella mundane encoding things after the fact.