How To Build a TimesMachine (nytimes.com)
necro81 writes: The NY Times has an archive, the TimesMachine, that allows users to find any article from any issue from 1851 to the present day. Most of it is shown in the original typeset context of where an article appeared on a given page — like sifting through a microfiche archive. But when original newspaper scans are 100-MB TIFF files, how can this information be conveyed in an efficient manner to the end user? These are other computational challenges are described in this blog post on how the TimesMachine was realized.
Seems like this is the obvious choice.
Maybe just the headlines or the first paragraph and then link to a compressed version of the image file or PDF (not the TIFF itself for Jebussake).
My eyes reflect the stars and a smile lights up my face.
What the hell is this gobbly gook
Isn't there another format that would be better than TIFF? How about TGA? :p
What the fuck is the problem here?!
Show the headline as plain text.
Show the article body as plain text.
Link it to the image, in case there's some formatting or context the user may wish to view.
Done.
sticking it behind a paywall will cut down on bandwidth usage dramatically.
the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff
sorry, dropped the link.
Although, sadly, they apparently never came to an agreement with The New Zork Times.
the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff
Filter it through imagemagick to downsize it drastically, cache the results.
That's how I would do it.
Besides mapping, several sites use this kind of tiling technology to show parts of a large image.
Example: http://imagealgorithmics.com/bayer.html allows you to drag a ROI selector box in a low resolution guide image and shows that ROI in high resolution. Takes a few kB of javascript.
I browse the Google newpaper articles a lot in Google.
It changes the perspective on why we are the way we are now and how we have changed.
Newspaper used to put all admisions and discharges from the local hospital and hotels/motels in the area. John Reese from Clevelend checked into the corner motel, he is in town discussing a business deal with Jim Smith.
Job postings from the 40-late 50's that included a female and a male section.
And the biggest one was the false advertising and ads made to look like articles and people selling snake oil as the cure for everything. Papers as far back as I've read until around the 1910's are full of them.
None of this would be acceptable today.
these newer electronic versions of the archives are much simpler for the Ministry of Truth to manage. A minitrue employe, Mr. Winston Smith has stated that though his job maintaining the integrity of the archives has changed little it was much more satisfying to be able to verify his corrections are reaching the masses quicker and more thoroughly than in the past when it could take months for corrections to propagate, and there were less lose ends when no one was really sure if a copy went down a memory hole or not. Yes he assured us, these changes are double plus good and will do much to strengthen the party.
The preceding post was not a Slashvertisement.
The Times is notorious for its front-page edits without retraction notices. Try checking out some of your favorite articles in the wayback machine.
who needs an archival scan of NEWSPRINT TEXT at 600+DPI? Most paper photographs (remember those?) only have about 300DPI worth of information on them *if you are lucky*, you'd be wasting your time scanning any higher for those. Making legal digital duplicates of typeset documents only requires 150DPI (which is the same as standard Fax which also happens to be a legal service method).
(source: several years experience dealing with document archiving and photo digitising on a commercial as well as a personal basis)
Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
Really, this isn't that complicated a problem. It's a newspaper, run the things through a compession algorithm such as png, drop the colorspace down to something reasonable for a newspaper. Hell, black and white is actually just fine but you can keep 256 colors if it makes you feel warm and fuzzy and still drop that 100MB to a 10-100k tops. Then OCR it all for search. Tada.
The article leads me to believe that they built the archiving system out of Python, which blows my freaking mind. I wish it had exposed more, but still fascinating.
That's the second really interesting article today. What the fuck is happening?
Coffee? Coffee. Coffee! COFFEE!!