Slashdot Mirror


Linux and OSS to Aid the Library of Congress

flakeman2 writes with a link to Linux.com article about Linux's new role at the Library of Congress. The national archive of books is looking to begin an ambitious digitization project, aimed at getting some rare and crumbling documents into the public record online. These will include "Civil War and genealogical documents, technical and artistic works concerning photography, scores of books, and the 850 titles written, printed, edited, or published by Benjamin Franklin. According to Brewster Kahle of the Internet Archive, which developed the digitizing technology, open source software will play an 'absolutely critical' role in getting the job done. The main component is Scribe, a combination of hardware and free software. 'Scribe is a book-scanning system that takes high-quality images of books and then does a set of manipulations, gets them in optical character recognition and compressed, so you can get beautiful, printable versions of the book that are also searchable,' says Kahle." Linux.com and Slashdot.org are both owned by OSTG.

63 comments

  1. Help the Library of Congress save American History by JanusFury · · Score: 5, Funny

    For the past few months Microsoft has been dispatching crack teams of special operatives into the past to alter the course of American History for their benefit, in hopes of eventually transforming the United States into the New Microsoft Empire. But little do they know, a world-weary Librarian and Ex-Marine at the Library of Congress won't stand for that shit. He's put together a team of agents in hopes of reversing the damage to the timestream before it becomes irreversable. Together with Agents Linus Torvalds (Technology Specialist - Special power: x-ray glasses), Donald Trump (Logistics Specialist - Special power: nuclear fusion comb-over) and Stephen Hawking (Quantum Physics Specialist - Special power: medusa glare), he just may be the only hope for American History's future.

    --
    using namespace slashdot;
    troll::post();
  2. They tried with Windows first...;-) by jkrise · · Score: 0, Troll

    and got "Dear Blair, let's set so double the killer delete select all ..."

    Suffice to say, they settled with Linux. The Microsoft version had psychic powers, apparently!

    --
    If you keep throwing chairs, one day you'll break windows....
  3. Re:Hmm... by rednuhter · · Score: 4, Informative

    RaTFA (note the lowercase "a" for "all")
    "the Internet Archive has migrated Scribe entirely to Linux, and Windows support has been dropped."
    Seems focused on Linux to me.

    --
    ERR 411[Max number of witty sigs reached]
  4. It's only natural by donglekey · · Score: 2, Funny

    So many technologies have been made specifically to hold libraries of congress.

  5. Yay Linux! by Anonymous Coward · · Score: 0

    Hooray! Linux does something useful! Let's write an article about it! Hooray! Electricity did something useful too! Why isn't there an article about how great electricity is at all this too? Insert your favorite item to rave for the next story. I realize this is a pro-Linux site and all, but come on guys, it gets unnecessary praise and accolades every time it does something. It's like a bunch of Linux geeks sitting around and mutually masturbating each other, slapping each other on the backs, and then telling each other how cool they are. Step back a moment and look at how ridiculous this all is...

    1. Re:Yay Linux! by dhasenan · · Score: 1

      An article on electricity, unless it's some novel use or implementation, will not gain interest here because people here take electricity for granted. An article on Linux will generate interest and therefore ad revenue because there are a lot of Linux fans here, or Windows fans who are wondering why Linux is doing well, or trolls who want to complain about this whole situation.

    2. Re:Yay Linux! by burbankmarc · · Score: 1

      You must be new here...

    3. Re:Yay Linux! by fd0man · · Score: 1

      The reason that "wins" for GNU/Linux systems are something that makes news--and not just here, but in many places--is because it is a good thing. It is a sign that people are beginning to see the alternatives, assess them, and use them on merit more than on "that's all we know". Of course, there are Windows trolls that will argue the points of merit until the cows come home, and then continue even after that; for fun on that one, why not head over to comp.os.linux.advocacy and see what trolls really do. You'll also see lots more in the way of what Linux does, that is not posted on /..

  6. No matter how you look at it by zappepcs · · Score: 2, Interesting

    this has got to cause some flying chairs in Redmond.

    Arguably one of the most important repositories of information in the U.S. is about to be available via OSS software and not MS products. For all the efforts that MS put out in Mass. this has got to be a kick in the face! Just wow!

  7. All copyrighted works should be held by rtb61 · · Score: 3, Interesting
    They should expand the role so that all copyrighted works are held at the Library of Congress. It would certainly save the confusion over who holds what rights to what content. At the same time, unless congress wants to hold and distribute material of questionable moral quality, the copyright law could be amended to limit the protections of copyright to those works that do actually further the arts and the sciences as defined in the constitution.

    The revisions to the law would not be infringing freedom of speech, in fact by allowing the free copying of works that did not further the arts or the sciences it would be limiting copyrights impact upon the freedom of speech. If people are really concerned about the quality of content, they should remember that eliminating the profit motive will have a substantial impact upon the amount of questionable content that is out there including movies, music, pictures and literature. Most of the members of the RIAA and the MPAA have a total disregard for the harm their content cause to society, let them feel some of the pain, wipe out the copyright protections on some of their more divisive content ;).

    --
    Chaos - everything, everywhere, everywhen
    1. Re:All copyrighted works should be held by OverlordQ · · Score: 1

      in fact by allowing the free copying of works that did not further the arts or the sciences

      Way to kill the entire non-fiction genre.

      --
      Your hair look like poop, Bob! - Wanker.
    2. Re:All copyrighted works should be held by forkazoo · · Score: 2, Funny

      At the same time, unless congress wants to hold and distribute material of questionable moral quality, the copyright law could be amended to limit the protections of copyright to those works that do actually further the arts and the sciences as defined in the constitution.


      Sir, you pre-suppose that morally questionable articles do not serve to further the arts or the sciences. I protest most heartily. Every modern technology is served by serving pornography, as we all know. Let us firstly ponder the case of VHS and BetaMax to settle that matter in a stroke. As for the arts, I think it is clear to say that such works as the Seduction of Misty Mundae served to elevate the erotic art form in any manner of ways, at least so much as any diserotic work served to also further the diserotic arts. Or, should you feel a need for some further example, I might offer up the Lord of the G Strings as an additional point. Rather than endevouring the hold the plot to a dramatic study of youthful exploration, as in the case of Seduction, Lord served to offer up comedic commentary and parody against the institions of our society. In so doing it clearly also furthered the arts.

      Fie apon you sir, that youw ould so readily dismiss Misty Mundae's films on mere moral grounds. Fie, indeed.

      Now, I just need to torrent me a copy of Spiderbabe. I heard that shit is hot.
    3. Re:All copyrighted works should be held by vivaoporto · · Score: 2, Insightful

      "At the same time, unless congress wants to hold and distribute material of questionable moral quality, the copyright law could be amended to limit the protections of copyright to those works that do actually further the arts and the sciences as defined in the constitution."

      Uh-uh. Let's repeat the same errors from the past, keeping what the current generation deems "of excelent moral quality", and censoring everything else, just like some works of Michelangelo were. People must to remember, what is of questionable moral quality for some is perfectly acceptable (and even desirable) for others, specially when the benefit of time is give, and that's the idea of archiving for posterity.

    4. Re:All copyrighted works should be held by Aladrin · · Score: 1

      He got modded interesting, and you got modded funny... It should be the opposite. What you say is absolutely correct. Maybe just not exactly how you said it.

      When Shakespeare was writing his work, do you think he thought 'I'll improve the arts and be known throughout the ages as a great writer' or do you think he merely enjoyed his work and liked the money? At the time, I'm sure nobody thought his work even a fraction as important as we now think it is.

      So how are we to judge works of today? We obviously can't know the effect on the future until we get there, so we could be snatching the copyrights from those who would be most worthy of them.

      And let's not forget what the point of copyright was: To encourage. If a writer knows he won't be allowed copyright on his work before he even starts it, it's likely he'll find something else to do with his time. It's twice as bad if the system is unfair about it, granting copyrights to some and not others.

      No, you can't selectively grant copyrights to those you think are worthy. It's an all-or-none situation.

      --
      "If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
    5. Re:All copyrighted works should be held by rasilon · · Score: 1

      I would suggest that holding material of questionable moral content is an important function, even if only to further the historical record so that future scholars can see where we drew the line. Leaving an accurate record of our times does future generations more good than than trying to erase those parts of our culture that some find objectionable.

    6. Re:All copyrighted works should be held by Anonymous Coward · · Score: 0
      I'm with you all the way, Dude!

      Send me more pr0n! Our historical identitty is at stake!

    7. Re:All copyrighted works should be held by rtb61 · · Score: 1
      Of course you can selectively apply copyright. it says so in the constitution, there is absolutely nothing to stop you from expressing yourself, but should other people be spending their tax money to protect your works, especially when your work specifically attacks their values. Should law enforcement or the courts be protecting work that attacks the society that allows it to be produced.

      Should work that attacks family values be protected by the tax dollars that are taken from families. Copyright protection comes at a price, it is only fitting that the people that pay the price decide what work will or will not be protected. The volume of content that will be available in fifty years time will simply be staggering, especially taking into account automatic translation of works.

      A central registry of protected works is required to avoid locking up the courts with millions copyright infringement suits from all over the world. Based upon that you can't very well have the government of the day, storing and distributing content, that the majority deem unsuitable, not my choice, but the choice of the majority or whom ever the appoint to oversee the copyright registry.

      If the RIAA and the MPAA want to play, then they can pay ;).

      --
      Chaos - everything, everywhere, everywhen
    8. Re:All copyrighted works should be held by jimicus · · Score: 4, Insightful

      At the same time, unless congress wants to hold and distribute material of questionable moral quality,

      Stop right there.

      When the purpose of your organisation is, to put it in very simple terms, "catalogue everything", you can't start making exceptions on moral grounds on the simple basis that what constitutes "questionable moral quality" today may be totally different tomorrow. Furthermore, who gets to define "questionable moral quality"? The closest anyone's ever come to creating such a definition is to say "Well, I can't actually come up with a concrete definition but I knows it when I sees it".

    9. Re:All copyrighted works should be held by houghi · · Score: 1

      All works are copyrighted by default. That would mean that each and every letter I write to my mistress is copyrighted and if it would go your way, I MUST send it to the Library of Congress.

      Considering that copyright is an international law, I won't see how this is possible. I have no interest in sending anything to the US. Copyright is turned on by default, so unless you specify that something is NOT copyrighted, it is.

      --
      Don't fight for your country, if your country does not fight for you.
    10. Re:All copyrighted works should be held by dhasenan · · Score: 1

      Perhaps a minor copyright tax could be levied to offset the cost, as well as to limit the duration of copyright for works that do not need it and stop orphaning works so they cannot be licensed.

    11. Re:All copyrighted works should be held by tsalaroth · · Score: 1

      IANAL, but this doesn't sound correct. Don't you have to provide proof of creation in order to hold the copyright on something? If you don't claim it as yours, for example, by putting "Copyright 2007 Some Dude" or something like that, someone can copy it and use it until you ask them to stop. And if I'm not mistaken, just SAYING "Copyright" doesn't mean it is - there's more to do.

    12. Re:All copyrighted works should be held by Scaba · · Score: 1

      You don't have to be a lawyer, and you are mistaken - just look at the Copyright Office website. Simply creating a work in fixed form copyrights it. If you want to be able to prove it in court later that you are the creator of said work, however, it's best to register your copyrights with the Library of Congress. It used to be that you were required to put a copyright notice on your works lest you could lose the copyright, but that's no longer true.

    13. Re:All copyrighted works should be held by houghi · · Score: 1

      The proove part can be done in several ways. One of the cheapest ways is to send it in a sealed envelope to yourself with a proove of sending (registerd?) Then when proove is needed, you can just let the envelop be opend with the apropriate witnesses. Together with the proove of sending this will show who has the first rights.

      So unless the other party has something that clearly dates earlier, you will have the copyright rights.

      However most of the time the question is wether a derived work is a new thing or if it is a copy. To be copied used to be a compliment. Pity that now people are mostly not interested in flatery, but in money.

      --
      Don't fight for your country, if your country does not fight for you.
    14. Re:All copyrighted works should be held by Scaba · · Score: 1

      There's nothing in copyright law that provides for the "poor man's copyright registration" you're talking about, and I don't think there are any cases in which it proved someone the true owner of a copyright. All it proves it that the work was made into a fixed form at some time. It's an easily forged method - you can just mail yourself an unsealed empty envelope, fill it with whatever you want when you receive it, then seal it - and any attorney you're up against will certainly bring that up in the court. If your work is really worth protecting, then filling out the form and paying the $45 (here in the US) is more than worth the cost.

    15. Re:All copyrighted works should be held by rtb61 · · Score: 1
      Same as always, the majority, whatever it may be, after all it is not about blocking it, it is all about society as a whole not providing an opportunity for questionable product to profit at societies expense. Why should any be fined or imprisoned, why should children be threatened by the courts either civil or criminal, for content that the majority would deem not worth protect.

      I support free speech, 'FREE' as in 'FREE', you want society to allow your to generate a profit at society expense, then you are going to have to suffer societies mores as a curb on your creative content or in reality the greed that motivates your content. Produce all the greed driven, destructive content you want, but expect no one to come to your defence when it is copied or to persecute the people who copy it.

      --
      Chaos - everything, everywhere, everywhen
    16. Re:All copyrighted works should be held by houghi · · Score: 1

      It is good enough in most of the cases. I do not want to spend 45USD on a copyright, yet I don't want to loose that copyright. Or perhaps I don't want anybody to read it, yet I want to proove my copyright.

      Also what you talk about would only be good if you already knew what the other erson would be copyrighting.

      Asume I made something on the 1st of january and I send it to myself. You then find out about it and intend to harm me. How are you going to anti-date it?

      Obviously the sealing I was talking about needs to be done not by just one person. Indeed, if you are talking about something that will make you filty-rich, then 45USD is nothing. et asking 45USD for something that is already yours, is a bit like theft.

      --
      Don't fight for your country, if your country does not fight for you.
    17. Re:All copyrighted works should be held by jimicus · · Score: 1

      Same as always, the majority, whatever it may be, after all it is not about blocking it, it is all about society as a whole not providing an opportunity for questionable product to profit at societies expense.

      The one copy that goes to a nation's library hardly constitutes a great profit at societies expense.

      And the whole point of my post was that society changes. What may be considered perfectly acceptable today may not have been 100 years ago. Pre-marital sex immediately springs to mind, but I'm sure there are plenty of other examples.

      Or, taking another angle, what about Vladimir Nabokov's "Lolita"? It's considered a modern literary classic, but the subject matter could hardly be described as morally "acceptable". Does that mean that it should never be archived? (By the way, if you don't know about the novel "Lolita", look in Wikipedia. Don't punch it into Google, especially if you're at work. I'm not sure what Google will return but I've got a pretty good idea).

    18. Re:All copyrighted works should be held by rtb61 · · Score: 1

      So what about the 'immorality' of copying, copyrighted immoral content. Your same argument defeats itself. Copyright infringement is a morality and greed question, as such all the conditions of copyright should be subject to a morality review and greed should not be the basis for acceptance for any particular condition of law or content.

      --
      Chaos - everything, everywhere, everywhen
  8. As a Christian, I oppose this by Anonymous Coward · · Score: 0, Funny

    I am a Christian, as were the Founding Fathers, who this established this country as a nation under GOD. They would be, as am I, deeply offended and disgusted to see the homosexual communist software known as Linux used in the hallowed halls of the Library of Congress.

    1. Re:As a Christian, I oppose this by koxkoxkox · · Score: 1

      Come on, why isn't this one modded funny ? It's still april 1st in Honolulu, maybe he got confused ...

    2. Re:As a Christian, I oppose this by FMota91 · · Score: 1

      Well, it IS offensive.

      That's why it's modded as flamebait.

      I mean, honestly, putting communism and homosexuality together? Hah.

      --
      09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C1 bottles of beer on the wall. Take one down, pass it round... Oh, umm...
  9. The most important part is not free software by Ed+Avis · · Score: 4, Interesting

    As the article says, the OCR itself is still done with proprietary software. I wonder if Google is using Tesseract for their digitization efforts. It would be cool if the original raw scanned images could also be archived and available for download - then you could print your own copy of the book, check the OCR for errors, or even do some weird genetic algorithm thing to make a LaTeX style that typesets the text in the same format as the original book.

    --
    -- Ed Avis ed@membled.com
    1. Re:The most important part is not free software by Millenniumman · · Score: 1

      Huh. I was going to look into it, as I didn't know there was any good open source OCR software.

      Good thing I decided to comment before reading the article.

      --
      Stupidity is like nuclear power, it can be used for good or evil. And you don't want to get any on you.
    2. Re:The most important part is not free software by Anonymous Coward · · Score: 0

      The Internet Archive makes the original images as well as OCR output and bibliographic metadata available for download. Their approach truly is open -- much more so than Google's.

  10. The question is... by Anonymous Coward · · Score: 0

    How many libraries of congress is that?

  11. Scribe? by Ceriel+Nosforit · · Score: 1

    Does anyone have more information on this 'Scribe' system? Google fails me.

    --
    All rites reversed 2010
    1. Re:Scribe? by rs232 · · Score: 1
      --
      davecb5620@gmail.com
    2. Re:Scribe? by rs232 · · Score: 2, Informative
      --
      davecb5620@gmail.com
    3. Re:Scribe? by Ceriel+Nosforit · · Score: 2, Funny

      Ah, thank you. Clearly your google-fu is superior to mine. ;)

      --
      All rites reversed 2010
  12. Re:Help the Library of Congress save American Hist by Anonymous Coward · · Score: 1, Funny
  13. Excellent project!!! by 2Bits · · Score: 2, Interesting

    This is absolutely cool. There are a lot more places where "brittle books" are laying around, waiting to be digitalized and distributed to the whole world. And as the technologies used in this project are going to be refined and improved, and eventually released, everyone will benefit.

    The question now is: would they accept technical contributions from the public (I mean, OS geek communities), just like other open source projects? I know a lot of people would be eager to join. How about a SETI-like system to harness the power of desktop computers around the world to help with image processing and OCR? Hey, I got 4 decent desktop computers that can contribute at least 8 hours/day each.

  14. How much data? by zaguar · · Score: 1

    How much data is needed. Of course, it would be necessary to have this number in a useful units system. Perhaps, the number of Libaries of Congresses of Data?

    --
    "Sure there's porn and piracy on the Web but there's probably a downside too."
  15. Full spectrum scanning by marcsiry · · Score: 1

    It would be neat if they did some other type of scanning, such as laser scanning the exterior, so that the book's heft and presence can be reproduced in the future. I've printed out crisp, high-resolution PDFs of user manuals, and having three hundred pages of printer paper binder clipped together just isn't the same as a nice, perfect-bound manual with a glossy cover.

    Sure, I imagine most of the consumption in the future will be done in a digital environment, but it would be nice if future generations had the option of popping the file into whatever will pass for a replicator and getting a decent representation of a long-vanished physical object- especially since the technology exists and the incremental memory needed is fairly trivial.

    --
    Marc Siry || interactive media professional, motorcycle enthusiast ||
  16. The sad part of digitization. by Lethyos · · Score: 4, Interesting

    Eventually we will have no physical record of these writings and may someday learn from the digital copies that Benjamin Franklin, George Washington, and others had offered enthusiastic support for wiretapping and other forms of electronic surveillance.

    --
    Why bother.
  17. Re:Help the Library of Congress save American Hist by stunt_penguin · · Score: 1

    But... where in the world is she?!

    --
    When the posters fear their moderators, there is tyranny; when the moderators fears the posters, there is liberty.
  18. Google by Anonymous Coward · · Score: 0

    Not sure of the details, but wasn't Google trying to scan books into online searchable format and got sued by publishers? Maybe Google should look at making a donation in time, effort and money to the Library of Congress and maybe get a huge tax writeoff. If its in the possession of the Library of Congress in a searchable format online, what are the publishers going to say then? Obviously Google can't donate unauthorized copies, but they can donate software engineering, scanning services, hosting services, bandwith and money or can they? Would be amusing to watch how government officials line up on such a discussion.

  19. OCR software is still closed source by Rick+Richardson · · Score: 0, Redundant


    The OCR software from Scribe is still closed source.

  20. Ooops by Anonymous Coward · · Score: 0

    Posted the above without having read the New York Times article that previous poster linked which indicates Google has made some donations to such. Would still be interesting to see them come out stronger on it though.

  21. What OCR-Engine do they use? by testerus · · Score: 1

    What OCR-Engine do they use?

  22. Re:Help the Library of Congress save American Hist by TinCanFury · · Score: 1

    I'd totally buy that game for my Xbox360 or "Windows Vista" based home gaming computer.

  23. How much? by muffel · · Score: 1

    Yeah, but how many Libraries of Congress will... eh... never mind.

    --

    bla
  24. Replacing paper documents with digital documents.. by syrion · · Score: 1

    ...has been done before. The Domesday Book was digitized by the BBC Domesday Project. Unfortunately, it ended up being a comic rather than a technical triumph. People must remember that, no matter how low-tech a method of physical data storage seems, it's more reliable in the long run than data storage relying on complex technology. I'm not against digitization by any means, of course: it could be useful as a research tool, and as an alternate method of access. It shouldn't be viewed as a long-term archival project, however. If the documents are really "disintegrating," take high-quality images of the pages and print them in good ink on acid-free archival paper (or vellum).

  25. Oh, it will be. by Grendel+Drago · · Score: 2, Interesting

    Project Gutenberg uses plenty of scans from American Memory to make their etexts--they do pretty much what you describe. At the lowest level, they make a plaintext copy, but they also do formatting and in-text hyperlinking: for instance, linking footnotes to their references, or index page numbers to anchors in the text. (See the HTML version of this etext to see what I mean.) Browse to a random book from this random collection, and you'll see what the LoC provides for their collections currently. As Brewster Kahle will be involved, you might want to see what projects he's done and how they're provided: a random book from the Million Book Project is available as a DjVu document, as well (badly) OCR'd text.

    --
    Laws do not persuade just because they threaten. --Seneca
  26. Quality as well as quantity, please by Ankh · · Score: 3, Informative

    The books I've looked at have been scanned at a resolution that's more or less adequate for OCR, but isn't really adequate for reproducing fine woodcuts, and is hopeless at metal engravings. I've found from my work on fromoldbooks.org that anything less than 1200 dpi generally produces pretty poor results for images, so that, for example, you can't read the signatures of the artist and engraver, still less compare engraving styles. It would be sort of like having a paraphrase of the text instead of the actual words.

    It does, of course, vary a lot depending on the style of image. Bold illustrations for children's books, for example, do better at, say, 800dpi greyscale or colour. Fine steel engravings with lines at, say, less than a tenth of a degree from horizontal (they were done by hand after all) and that come out only a couple of pixels wide even at 1200dpi just turn into gray mush with weird banding artefacts until you go to a higher resolution (I use 2400dpi). There's a widely-cited study indicating that an "ultra-high" scan resolution of 400dpi is more than sufficient, based on an extremely small sample of images.

    The damage that's done by poor quality digitization is that it makes it harder to justify doing a better job in the future.

    --
    Live barefoot!
    free engravings/woodcuts
  27. Re:Help the Library of Congress save American Hist by Xichekolas · · Score: 1

    If I had mod points, I'd use them all to mark this +5 hilarious.

    --

    Self-referential Sigs are cool on /. these days...

    54

  28. Oh NOESZZZ by imikem · · Score: 1

    Poor Benjamin Franklin is about to be deprived of the legitimate compensation owed him. How is the fellow supposed to make a living? He's only been dead 217 years. Surely copyright on his works ought to be retained for awhile yet. Now those communist pinko linux f@9$ have really gone over the line.

      Just in case you're immeasurably thick.

    --
    Perscriptio in manibus tabellariorum est.
  29. Re:Replacing paper documents with digital document by Vombatus · · Score: 1

    Ahh, but if they had saved the digitised images using openly specified formats, rather than some obscure format, they would not have had to much problem reading the images.

    --
    This sig is intentionally blank
  30. Re:Replacing paper documents with digital document by Anonymous Coward · · Score: 0

    like this?

    http://sourceforge.net/projects/xena/

  31. What they don't mention... by redbaritone · · Score: 1

    Is that these documents will be made available on iTunes for $0.99 at roughly 80% digital accuracy and for $1.29 at around 92% digital accuracy. If you want 100% digital accuracy, just get a Library of Congress card, check it out, and copy it yourself.

  32. Re:Replacing paper documents with digital document by syrion · · Score: 1

    That's true, but the media would have needed to be updated to CD-ROM (and DVD?) anyway. Remember, at the time, truecolor displays were very uncommon, and JPEG wasn't yet specified.