Slashdot Mirror


Project Gutenberg Made Accessible

scishop writes "Mazarin is an open-source interface to Project Gutenberg's library. Mazarin increases the accessibility of Gutenberg's 10,000+ books as it formats the books for HTML display -- providing paginations in addition to generating table of contents and other advanced markup features -- along with enabling users to carry out full-text searches on the entire library."

19 of 214 comments (clear)

  1. Tested by mpost4 · · Score: 3, Interesting

    I can not test the claim of all 10k works, but I tested what I thought would be most likely to be left out, and I found that they were there.

    I Tested Martin Luther.

    (if it was not for the printing press the reformation would not have been as sucsessfull as it was)

  2. Looks nice and dandy by tfbastard · · Score: 3, Interesting

    But did they have to make the tutorial presentation a fullscreen flash file?

  3. P2P / Library by Anonymous Coward · · Score: 5, Interesting

    Interesting idea, I can't get to the website but a feature I'd want is the content shared P2P so you don't have to rely on a central server for the content.

    A central webpage index could just have ed2k links to the files: sharereactor for books. When they update the book they release a new hash-link and the file onto the network.

    It being P2P it could open it up to more then just public domain books too ;).

  4. Slashdotted - but nice error messages by twoshortplanks · · Score: 4, Interesting

    Hmm, nicely formatted error messages. Does anyone know what this is? I'm assuming it's a mod_perl handler of some sort.

    --
    -- Sorry, I can't think of anything funny to say here.
    1. Re:Slashdotted - but nice error messages by WWWWolf · · Score: 2, Interesting

      As others noted, this is definitely Perl HTML::Mason, which is one of the best web scripting environments I've ever worked in. An adequate comparision would be something like this: PHP's down-to-earth approach of mixing code and HTML, JSP's and Website Meta Language's ideas on how to separate them again if they need to be (code componentization and tag libraries), Perl as the scripting language, and Apache mod_perl to give it some speed (also works as CGI).

      I'm just wishing to know how to turn the cool-looking error dumps off when they're viewed outside localhost =)

  5. Gutenberg is totally inaccessible by Anonymous Coward · · Score: 5, Interesting

    This sounds like it just adds complexity and does not make gutenberg's data accessible.

    There were several research projects for which I used pg as a corpus. However, pg's a terrible hassle for the first-time researcher, since the format of the introductory text ("we're gutenberg, here's the copyright, blah blah") is inconsistent.

    You have to remove the introductory text to avoid bias in the corpus, however there are so many pathological special cases (different formats, spelling, languages, words used, punctuation, case) that it requires several hours of Perl coding to successfully strip the header text from 75% of the documents with >99% accuracy. Yuk.

    If gutenberg is serious about making their work more accessible, they should think about the simple concern of ensuring consistency in the header text format.

  6. Best way to read online texts? by GGardner · · Score: 4, Interesting

    What's the best way to read online texts? There are a bunch of PG texts I might like to read, but reading them in a web browser, as a big text file gets tiring after ten minutes or so. I'm not sure why I can read a book for hours, but the screen for minutes, but there you have it. I don't think that HTML will help this problem -- does anyone have recommendations for better ways to read these files?

    1. Re:Best way to read online texts? by CGP314 · · Score: 2, Interesting

      does anyone have recommendations for better ways to read these files?

      On an old palm pilot or in the notes folder on an ipod. I found that it's the backlight of a computer screen (and on the new palms) that is what hurts my eyes when trying to read.


      -Colin

    2. Re:Best way to read online texts? by bw5353 · · Score: 2, Interesting
      What works best for me is any text-editor/word processor. I delete line by line or paragraph by paragraph as I have read them. Don't know why I feel that is comfortable, but it is.

      (Keep a backup of the original in case you want to check again what the name of the butler's niece was.)

  7. Straight HTML = archaic by Leobinus · · Score: 5, Interesting

    Bah. Posting HTML is so 1996. You can do so much more with these texts. One example is Open Source Shakespeare, which takes all of Shakespeare's texts, indexes them, presents them in an attractive manner, creates a concordance, provides a full-text search engine, organizes the lines by character, etc.

    All of the texts are open source, and you can download the database and source code from the site, too. Check it out.

  8. Re:and then just think by mangu · · Score: 4, Interesting
    While, at first, one would classify your post as either "offtopic" or "flamebait", I think an interesting point can be raised here: the Lutheran reformation was an early consequence of the maxim "information wants to be free".


    It was very convenient for the Roman Church to have a practical monopoly on what was widely acknowledged at the time to be the main source of information, the Holy Bible. When the printing press was invented, this diluted that monopoly, since then the ordinary people could afford their own copies of the Bible and became independent from the Church for information. Luther was one of the first to realize that, when he urged people to read the Bible. A consequence of that was that people learned to read. Until early in the 20th century, the literacy rate for countries which are mostly Lutheran, e.g. Scandinavian countries and parts of Germany, were much higher than in southern Europe, where people were mostly Catholic.


    A modern analogy:

    Catholic Church --> RIAA

    Lutheranism --> P2P

  9. Re:and then just think by bsDaemon · · Score: 2, Interesting

    Information doesn't want anything. It merely is.
    and what is wrong with monopoly? Uniformity breeds community.

  10. Gutenberg archive and access by rjs.org · · Score: 2, Interesting
    Too bad the site couldn't hold up, I really wanted to see my contribution
    http://www.gutenberg.net/etext04/awbv110.txt
    there in HTML.

    The first volume was converted to HTML by hand by someone else and to pdf, by machine, I think, whereas my site simply has the e-text:
    http://rjs.org/gutenberg/Stevens_Thomas/
    So an automated process would be a boon. What I'd really like to see is an OS text-to-voice reader program. I wrote a wxPython program to assist conversion from scanned text to PG format: http://rjs.org/gutenberg/OCR2Gutenberg/, but I have never been able to find a free set of spoken word wave files or speech library.

    Ray

    --
    http://rjs.org/ - biking, astronomy, photography
  11. Gutenberg, Google by Anonymous Coward · · Score: 2, Interesting

    Wouldn't it be great if Google were involved in Gutenberg in a major way?

  12. Gutenberg Disclaimer by Twinky · · Score: 5, Interesting
    What always struck me as odd is the enourmous length of the disclaimer that Project Gutenberg attaches to every text. To me it seems to be the most obvious sign of a law system that is ridiculously screwed. No book I ever read had a legal statement like this.

    Quote:

    LIMITED WARRANTY; DISCLAIMER OF DAMAGES But for the "Right of Replacement or Refund" described below, [1] the Project (and any other party you may receive this etext from as a PROJECT GUTENBERG-tm etext) disclaims all liability to you for damages, costs and expenses, including legal fees, and [2] YOU HAVE NO REMEDIES FOR NEGLIGENCE OR UNDER STRICT LIABILITY, OR FOR BREACH OF WARRANTY OR CONTRACT, INCLUDING BUT NOT LIMITED TO INDIRECT, CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES, EVEN IF YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGES. If you discover a Defect in this etext within 90 days of receiving it, you can receive a refund of the money (if any) you paid for it by sending an explanatory note within that time to the person you received it from. If you received it on a physical medium, you must return it with your note, and such person may choose to alternatively give you a replacement copy. If you received it electronically, such person may choose to alternatively give you a second opportunity to receive it electronically. THIS ETEXT IS OTHERWISE PROVIDED TO YOU "AS-IS". NO OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, ARE MADE TO YOU AS TO THE ETEXT OR ANY MEDIUM IT MAY BE ON, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimers of implied warranties or the exclusion or limitation of consequential damages, so the above disclaimers and exclusions may not apply to you, and you may have other legal rights. INDEMNITY You will indemnify and hold the Project, its directors, officers, members and agents harmless from all liability, cost and expense, including legal fees, that arise directly or indirectly from any of the following that you do or cause: [1] distribution of this etext, [2] alteration, modification, or addition to the etext, or [3] any Defect.
  13. Already very accessible... by ricky-road-flats · · Score: 4, Interesting
    I only last week downloaded Project Gutenberg as an ISO - it has 9,500 books on it and weighs in at about 3.85 GB. All the books are as plain text within a ZIP file, accessed through a set of basic web pages also on the disc.

    It's great - I now have that on my laptop hard drive, mountable by Alcohol, so I'll never be short of anything to read, especially when the web's not available...

    I can't find the torrent file I got it through, but if it helps the filename is pgdvd.iso and the size is 4,139,646,976 bytes.

  14. Re:Funny definition of "accessible..." by pedantic+bore · · Score: 2, Interesting
    The problem with "plain vanilla ASCII" is that expresses less information than is present in the original and is a PITA to recover this information. This is especially true for texts that use non-ASCII characters (or illustrations of any kind).

    I agree that flavor of the month representations are bad, but markup languages have been around for a long time and it wouldn't have been hard to use something (like small subset of SGML) to add a bit more formatting info. Then when people want to look at the text in the flavor-of-the-month format, it's just a matter of writing a translator for that format. (illustrations are another matter, but I suspect that things have converged enough in this area so that something could be done from this point forward.)

    Don't get me wrong -- I have a huge respect for PG and what they're doing is a benefit to humanity. I just wish that they would aim a little higher than the lowest common denominator for representation, and support other character sets in a simpler manner.

    --
    Am I part of the core demographic for Swedish Fish?
  15. Re:and then just think by mangu · · Score: 2, Interesting
    Information doesn't want anything. It merely is.


    How do you know that? Apart from the religious dogma that postulates the existence of a homunculus called the "soul", we do not know much about how consciousness arises. What we do know is that information doesn't exist in a vacuum. Information needs a physical medium to exist. Check "An Introduction to Information Theory", by John R. Pierce, Dover Publications, ISBN 0-486-24061-4, chapter 10 - "Information Theory and Physics" for a basic explanation why. Now, assuming a certain body of information and a system to handle that information, we have no idea if a sufficiently large amount of information with the right manipulation system will have consciousness. Sometime in the next few decades we will have machines with the same complexity and information-handling power as a human brain, then perhaps we will be able to create a conscious machine with free-will.


    Anyhow, that's not the point. "Information wants to be free" is just an easier way to say that human beings have an urge to share whatever information they have with other humans. History has shown that, given efficient communication media, it's very difficult to maintain information secret.


    and what is wrong with monopoly?


    Intrinsically, nothing. Some public utilities are natural monopolies, it wouldn't be practical to run several different water, gas, and electricity supplies to each house, for instance. Sometimes a monopoly is useful in developing a new technology. The Bell Telephone Co., in the first half of the 20th century, did create a relatively cheap and efficient phone system using a monopoly. Microsoft created a widely used personal computer standard using a monopoly. There are some circumstances under which a new technology spreads faster if a monopoly exists. But a monopoly also induces slackness. Monopoly holders will not be eager to try harder. When growth starts levelling off, a monopoly usually stagnates. That was bad for Christianism, it was bad for the telephone system, it was bad for personal computers... may I generalize?

  16. The Project Gutenberg Index as RSS by grrussel · · Score: 2, Interesting

    I've created an RSS feed from the Project Gutenberg list of etexts. The RSS feed contains titles, authors, descriptions and links to the relevant page or file on http://www.gutenberg.net/

    PGDB.rss PGDB.rss.gz