Slashdot Mirror


How To Build a TimesMachine (nytimes.com)

necro81 writes: The NY Times has an archive, the TimesMachine, that allows users to find any article from any issue from 1851 to the present day. Most of it is shown in the original typeset context of where an article appeared on a given page — like sifting through a microfiche archive. But when original newspaper scans are 100-MB TIFF files, how can this information be conveyed in an efficient manner to the end user? These are other computational challenges are described in this blog post on how the TimesMachine was realized.

41 comments

  1. OCR? by The-Ixian · · Score: 2

    Seems like this is the obvious choice.

    Maybe just the headlines or the first paragraph and then link to a compressed version of the image file or PDF (not the TIFF itself for Jebussake).

    --
    My eyes reflect the stars and a smile lights up my face.
    1. Re: OCR? by Anonymous Coward · · Score: 0

      What about a Hot Tub Time Machine?

    2. Re: OCR? by The-Ixian · · Score: 1

      Sounds like a good enough concept to make 2 movies...

      --
      My eyes reflect the stars and a smile lights up my face.
    3. Re:OCR? by Anonymous Coward · · Score: 0

      Sigh, another huge company that has no clue what is already out there and what they did isn't anything special.

      https://www.fold3.com/

      Those guys have done this successfully with millions of documents. Maybe, just maybe they should have sought some expertise before reinventing work that has already been done.

      And just how many books has google scanned? Come on NYTimes get with the times... As usual newspapers lag behind the curve and this is the #1 reason you're dying off.

    4. Re:OCR? by amicusNYCL · · Score: 1

      It's obviously more than just OCR, because when you mouse over the page it will hightlight each article you're looking at, however oddly that article was fit into the page (including highlighting associated pictures, etc). That's actually a pretty cool site, in fact that archive might even be a better reason to get a subscription than the actual current edition.

      The free preview doesn't give you much, but it does show the issue from February 1, 1979. In the top right is a story about Ayatollah Khomeini arriving back in Iran from exile, which is interesting to see the contemporary story. On the bottom-left of the front page is a smaller article, without pictures, that continues on page 21. Here's the headline:

      Security Agency Holds A Quiet, Crucial Power Over Communications

      The article begins "For the last quarter century, one of the Government's most secret agencies has played an important, largely undisclosed role in shaping the nation's privately owned ..." (that's your free preview). If I zoom in on the paper I think I can make out the rest of that sentence as "... communications network of [broadcast towers?], underground cables, satellites, and computers." The list of subjects on that article lists astronautics, communications, internal security, internal communications, National Security Agency, telephones, and United States. 37 years ago today the NY Times was reporting on the NSA holding power over the communications infrastructure of the US.

      Page 4 has an ad for the latest TI calculators, available at Bloomingdale's.

      This story is on page 7:

      SOLAR ENERGY HELD STILL DECADES AWAY
      Panel Does Not Expect Major Shift Until Technology Is Ready for Conversion in Electricity
      A panel of leading specialists, convened a year ago at the request of the White House to assess prospects for generating electric power from sunlight, has concluded that the ultimate prospects are "bright" but that for at least a decade the technology will not be sufficiently advanced to initiate a major conversion effort.

      --
      "Our two-party system is like a bowl of shit looking at itself in a mirror." - Lewis Black
    5. Re:OCR? by dj245 · · Score: 1

      It's obviously more than just OCR, because when you mouse over the page it will hightlight each article you're looking at, however oddly that article was fit into the page (including highlighting associated pictures, etc). That's actually a pretty cool site, in fact that archive might even be a better reason to get a subscription than the actual current edition.

      The free preview doesn't give you much, but it does show the issue from February 1, 1979. In the top right is a story about Ayatollah Khomeini arriving back in Iran from exile, which is interesting to see the contemporary story. On the bottom-left of the front page is a smaller article, without pictures, that continues on page 21. Here's the headline:

      Security Agency Holds A Quiet, Crucial Power Over Communications

      The article begins "For the last quarter century, one of the Government's most secret agencies has played an important, largely undisclosed role in shaping the nation's privately owned ..." (that's your free preview). If I zoom in on the paper I think I can make out the rest of that sentence as "... communications network of [broadcast towers?], underground cables, satellites, and computers." The list of subjects on that article lists astronautics, communications, internal security, internal communications, National Security Agency, telephones, and United States. 37 years ago today the NY Times was reporting on the NSA holding power over the communications infrastructure of the US.

      Page 4 has an ad for the latest TI calculators, available at Bloomingdale's.

      This story is on page 7:

      SOLAR ENERGY HELD STILL DECADES AWAY Panel Does Not Expect Major Shift Until Technology Is Ready for Conversion in Electricity A panel of leading specialists, convened a year ago at the request of the White House to assess prospects for generating electric power from sunlight, has concluded that the ultimate prospects are "bright" but that for at least a decade the technology will not be sufficiently advanced to initiate a major conversion effort.

      Old newspapers are very interesting. I thought I knew all about World War II. Then I started reading the newspaper, starting in October 1938. I am reading the Canberra Times, since it is freely available on the internet. And Australia has such a great website to read it on! It's a bit sad that the Library of Congress doesn't seem to have this kind of system for American newspapers. I guess that's what living in a society that believes in perpetual copyright gets you.

      --
      Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
  2. Huh? by Anonymous Coward · · Score: 0

    What the hell is this gobbly gook

  3. TIFF? by U2xhc2hkb3QgU3Vja3M · · Score: 1

    Isn't there another format that would be better than TIFF? How about TGA? :p

    1. Re:TIFF? by Anonymous Coward · · Score: 2, Informative

      Isn't there another format that would be better than TIFF? How about TGA? :p

      TIFF can use Group IV compression on monochrome images, whereas TGA is limited to RLE.

    2. Re:TIFF? by omnichad · · Score: 1

      TIFF is ideal for black and white scans. There are multiple compression options and it contains DPI/dimension information.

      What's your argument for TGA?

    3. Re:TIFF? by Anonymous Coward · · Score: 0

      How About Png Its The Best Format Ever When You Use Png Crush Png Is The Future

    4. Re:TIFF? by U2xhc2hkb3QgU3Vja3M · · Score: 1

      Easier-to-decode, fixed-length header? :p

    5. Re:TIFF? by omnichad · · Score: 1

      There are a lot more considerations. For one, you can start with an existing library for TIFF. Overall file size is smaller as you can use JPEG compression for photos and LZW, RLE or CCITT for B&W text (not that they had mixed compression - nobody seems to use that).

      Also, TIFF is already the standard storage format for document imaging. This is likely the format their archives were already in.

    6. Re:TIFF? by U2xhc2hkb3QgU3Vja3M · · Score: 1

      The one thing we miss for the web is a sort of JPEG-compressed image data but with a PNG-lossless alpha channel.

    7. Re:TIFF? by omnichad · · Score: 1

      Apparently, you can use a bit of SVG magic to use a PNG for an alpha mask on a JPEG: http://peterhrynkow.com/how-to...

    8. Re: TIFF? by Anonymous Coward · · Score: 0

      FLIFF might be the future.

    9. Re:TIFF? by SeaFox · · Score: 1

      4-bit PNG would be a better choice for black and white pages as it will retain detail on text and can handle the greyscale photos okay.
      But it would need to be something else for spreads with color.

  4. Show the text of the article. by Anonymous Coward · · Score: 0

    What the fuck is the problem here?!

    Show the headline as plain text.

    Show the article body as plain text.

    Link it to the image, in case there's some formatting or context the user may wish to view.

    Done.

    1. Re:Show the text of the article. by Locke2005 · · Score: 1

      OCR it into plain text to build indexes that point back into the TIFF files. You want the context, which in many cases includes (badly halftoned) pictures.

      --
      I've abandoned my search for truth; now I'm just looking for some useful delusions.
    2. Re:Show the text of the article. by Anonymous Coward · · Score: 0

      Why are you repeating exactly what the GP comment said to do? Did you even bother to read that comment before you replied to it? It seems not!

    3. Re:Show the text of the article. by Anonymous Coward · · Score: 0
    4. Re:Show the text of the article. by Impy+the+Impiuos+Imp · · Score: 1

      Also interesting are the ads.

      Will they be censoring old writings that might, umm, trigger the modern sensibility?

      --
      (-1: Post disagrees with my already-settled worldview) is not a valid mod option.
    5. Re:Show the text of the article. by twosat · · Score: 1

      Something like this has already been done in New Zealand. I don't see what the big problem is. As previously suggested, just have a low-resolution scan of the original text displayed next to a plain-text version.

      Here's an example of the original text when I searched for a street address near to me:

      http://paperspast.natlib.govt....

      And here's the plain-text version, complete with OCR errors:

      http://paperspast.natlib.govt....

      As a bonus, they both have the searched-for text highlighted as well.

  5. nobody cares, google did it first by Thud457 · · Score: 5, Funny

    sticking it behind a paywall will cut down on bandwidth usage dramatically.

    --

    the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff

    1. Re: nobody cares, google did it first by gstoddart · · Score: 1

      Yeah, no shit ... by browser just blocks NYT entirely ... no, I'm not turning on cookies and javascript so I can see your paywall.

      I've just cut out the crap and blocked it entirely. NYT has ceased to exist as far as I'm concerned.

      --
      Lost at C:>. Found at C.
  6. Colossus must be fed by Thud457 · · Score: 1

    sorry, dropped the link.

    Although, sadly, they apparently never came to an agreement with The New Zork Times.

    --

    the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff

  7. How? by Anonymous Coward · · Score: 0

    Filter it through imagemagick to downsize it drastically, cache the results.

    That's how I would do it.

  8. Common tech by Anonymous Coward · · Score: 0

    Besides mapping, several sites use this kind of tiling technology to show parts of a large image.
    Example: http://imagealgorithmics.com/bayer.html allows you to drag a ROI selector box in a low resolution guide image and shows that ROI in high resolution. Takes a few kB of javascript.

  9. Goolge news archive by Anonymous Coward · · Score: 1

    I browse the Google newpaper articles a lot in Google.
    It changes the perspective on why we are the way we are now and how we have changed.

    Newspaper used to put all admisions and discharges from the local hospital and hotels/motels in the area. John Reese from Clevelend checked into the corner motel, he is in town discussing a business deal with Jim Smith.

    Job postings from the 40-late 50's that included a female and a male section.

    And the biggest one was the false advertising and ads made to look like articles and people selling snake oil as the cure for everything. Papers as far back as I've read until around the 1910's are full of them.

    None of this would be acceptable today.

             

    1. Re: Goolge news archive by Anonymous Coward · · Score: 0

      Tabloids continue several "old fashion" ideas. There is a market for every kind of information.

    2. Re:Goolge news archive by Anonymous Coward · · Score: 0

      ads made to look like articles and people selling snake oil as the cure for everything. ... None of this would be acceptable today.

      Have you opened a paper recently? My city's paper is full of "Paid Advertisements" that look just like articles (except with "Paid Advertisement" centered above it in a small font). Full of ads for magnetic wonder bracelets and copper back braces and if your zip code begins with 1 2 3 4 5 6 7 8 9 or 0 then you qualify for this limited time bullshit hand crafted by jehovas witness orphans imported from China!

      I subscribe for the coupons on Sunday and the grocery ads on Wednesday and they won't let me get a subscription with just those two days, so I get the whole week and toss most of it out.

  10. Unlike those pesky old Microfiche and Paper copies by pecosdave · · Score: 1

    these newer electronic versions of the archives are much simpler for the Ministry of Truth to manage. A minitrue employe, Mr. Winston Smith has stated that though his job maintaining the integrity of the archives has changed little it was much more satisfying to be able to verify his corrections are reaching the masses quicker and more thoroughly than in the past when it could take months for corrections to propagate, and there were less lose ends when no one was really sure if a copy went down a memory hole or not. Yes he assured us, these changes are double plus good and will do much to strengthen the party.

    --
    The preceding post was not a Slashvertisement.
  11. How about good old-fasioned version control by Anonymous Coward · · Score: 0

    The Times is notorious for its front-page edits without retraction notices. Try checking out some of your favorite articles in the wayback machine.

  12. ridiculous scan density by ihtoit · · Score: 1

    who needs an archival scan of NEWSPRINT TEXT at 600+DPI? Most paper photographs (remember those?) only have about 300DPI worth of information on them *if you are lucky*, you'd be wasting your time scanning any higher for those. Making legal digital duplicates of typeset documents only requires 150DPI (which is the same as standard Fax which also happens to be a legal service method).

    (source: several years experience dealing with document archiving and photo digitising on a commercial as well as a personal basis)

    --
    Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
    1. Re:ridiculous scan density by dj245 · · Score: 1

      who needs an archival scan of NEWSPRINT TEXT at 600+DPI? Most paper photographs (remember those?) only have about 300DPI worth of information on them *if you are lucky*, you'd be wasting your time scanning any higher for those. Making legal digital duplicates of typeset documents only requires 150DPI (which is the same as standard Fax which also happens to be a legal service method).

      (source: several years experience dealing with document archiving and photo digitising on a commercial as well as a personal basis)

      Legal requirements on DPI probably are intended to make something legible. Making something pleasantly readable is a higher bar. Newspaper printing is was an analog process, especially for old newspapers. 300DPI is readable and probably fine, for today. 20 years from now it is going to seem like be the equivalent of a wax cylinder recording.

      --
      Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
    2. Re:ridiculous scan density by Anonymous Coward · · Score: 0

      The tiff images were monochrome (1-bit) scans at 300 DPI.

  13. PNG? by shaitand · · Score: 1

    Really, this isn't that complicated a problem. It's a newspaper, run the things through a compession algorithm such as png, drop the colorspace down to something reasonable for a newspaper. Hell, black and white is actually just fine but you can keep 256 colors if it makes you feel warm and fuzzy and still drop that 100MB to a 10-100k tops. Then OCR it all for search. Tada.

  14. God Dammit Slashdot by Anonymous Coward · · Score: 0

    The article leads me to believe that they built the archiving system out of Python, which blows my freaking mind. I wish it had exposed more, but still fascinating.

    That's the second really interesting article today. What the fuck is happening?

    1. Re:God Dammit Slashdot by twistedcubic · · Score: 1

      You forgot to ask how much NYTimes paid Slashdot to do this article. But it is interesting, nonetheless.

  15. Meh. Still want a Phillips Coffee Machine. by RogueWarrior65 · · Score: 1

    Coffee? Coffee. Coffee! COFFEE!!