Google To Digitize Millions of Old Newspaper Pages
hhavensteincw writes "On Monday Google detailed new plans to digitize millions of newspaper pages with articles, photographs, and headlines intact so they can be accessed and searched online. 'Around the globe, we estimate that there are billions of news pages containing every story ever written,' Google said in a blog post. 'It's our goal to help readers find all of them, from the smallest local weekly paper up to the largest national daily.' For example, Google noted the availability of an original article from the Pittsburgh Post-Gazette from 1969 about the landing on the moon." When you search the news archive for, e.g., "Chicago fire" or "Rosenberg trial," a significant fraction of the result pages cost money to view.
http://news.google.com/archivesearch?q=%22armadillo+aerospace%22&scoring=t
Fuck I wish Carmack would stop using his Time Machine to get 1957 publicity.
How we know is more important than what we know.
From the billions of dollars of public good that is Google Maps to their true lack of evil, from their sucessful attempts to make the world a better place to the way they treat their employees, Google is truly great.
ALL HAIL GOOGLE. ALL HAIL GOOGLE. ALL HAIL GOOGLE.
If video games influenced behavior the Pac Man generation would be eating pills and running away from their problems.
Now, all those guys/girls who streaked during Woodstock are going to repent (more).
But seriously...
1. Guy/girl does something goofy in 70s as a teenager.
2. Gets covered by local news (at that time).
3. Google digitises that news.
4. Now CEO (then guy/girl) is suddenly let go.
Who hasn't done something goofy and thought in retrospect wished they hadn't done it (not necessarily something criminal). Google might make their "second chance" disappear.
ps. Carly F. might have seen this coming ;-)
I welcome this news. For too long, research on the Internet has been a frustrating task. For any events after about 1997, there's oodles of information. However there's a giant hole in the amount of information available for events before then. Google Books went some way towards addressing this, but it was still an intense task because a lot of the time, you still have to find and buy the books (or find them in a Library).
I really hope they plan to go as far as putting local, regional newspapers online as well.
At last, something that looks really GOOD, from Google! With free access, this will really change the world, even more.
History revisionists will find it even more difficult to dupe.
Maybe there are serious drawbacks, but, for the time I cannot see anything but the positive aspects.
I hope they aren't restricting it to just newspapers. I've saved tons of interesting web articles from official news websites that have mysteriously disappeared over the years. They're not even in the Google cache. Hopefully, most of them will be in the Google News archive.
Bravo! And, as a Pittsburgher, I was elated to see the Kennywood ad, back before they made the "new" new Noah's ark....
Monstar L
Guy/girl does something goofy in 70s as a teenager. Gets covered by local news (at that time).
I've seen that already. I looked up an executive, and Google returned a hit from a student newspaper from the 1960s that they'd digitized from microfilm. The story mentioned the guy being a member of the Socialist Workers Alliance.
I hope to god that they edit out the advertising otherwise all us consumers will be frantic with longing for products that are no longer available, what with advertising not being a huge sham and all!
They whose government reduces their essential liberties for temporary security, receive neither liberty nor security.
Most amazing thing to me is on the next page is a story of the fucking asshole kennedy and his murder of Mary Jo Kopechne at Chappaquiddick...nice one Google.
I recently did some research that had me looking in the NYT's article archive. Thankfully, it was in the 1900-1920 period, so all the articles I wanted to access were free.
However, if I had been doing research in a later time period, say 1930-1940, I would have had to pay for access to the articles (well, probably not me - I'm sure we have institutional access, but someone would have had to pay).
Google seems to be offering this free of charge to viewers, at least initially. It sounds like they've obtained the rights, or are working in partnerships with publishers. I wonder if NYT will continue to require payment for access to some of their archives?
"Anyone who [rips a CD] is probably engaging in copyright infringement." - David O. Carson
My thirty-year, $50-billion plan to consolidate the microfiche market may well be in the shitter.
Why doesn't Google just purchase some of the better newspaper archive databases, such as NewsBank, and simply release all the stories for free? It'd likely be a lot cheaper than duplicating effort, and would help information be released more quickly.
Incidentally, if you're close to a university or a good library, many of these places already hold subscriptions to such services and offer the use of them for free. I'd love to see Google expand upon this already-good base rather than duplicating effort.
I wonder how the news cartels will react to their copyrighted works being copied and put online... they've tried to sue google just for displaying content available on their sites and referenced from their sites with links...
-- Sex is the antonym of pringles. Once you pop it's time to stop.
You can already access the archives of The Times online :
http://archive.timesonline.co.uk/
It's quite interesting to read about Marie-Antoinette's execution or Jack the Ripper's crimes, I especially like the writing style :)
It dropped like a rock after news (from 2002) ended up online. Google and the Tribune Co. say each other is at fault. http://www.latimes.com/business/la-fi-moneyblog9-2008sep09,0,1609687.story What next? A news story about Pearl Harbor being attacked?
I've latterly been thinking about the googlization of everything digital. I've latterly also been thinking about the spread of botnets (Storm, Kraken and the like). This has led me to conclude there is a Google Black Ops department intent on replacing Google's vast server farms with users' own PCs - i.e., Google aims covertly to use our computers as its hardware!
From Google's perspective it makes perfect sense to use idle cycles on Aunt Harriet's aging Dell to serve googlicious applications to an eager populace. Why shouldn't she host your gmail account?
The whole concept can even be justified from an environmental point of view: scaling is naturally proportional to demand and load-spreading is extremely efficient. In the long term, Google won't need any of its own hardware other than expensive corporate buildings equipped with limitless executive toys and a few dumb terminals. Hell, we're beginning to see that already. Everyone benefits.
As for the the spam emanating from botnets, this is a mere smoke-screen (or should I say cloud-screen?) designed to keep us off the scent.
I, for one, salute our new Gotnet overlord.
Now I can find out everyone I knew who's died with Google archiving the obituaries.
Well, if you won't switch, then just enjoy your never-ending loading screen with frames and ads.
Your choice, your cross to bear.
"The Google makes work for idle scanners."
Genesis 1:32 And God typed
Google news, brought to you by the department of truth! :)
Let's hope they'll not be too selective in which articles they publish.
I don't care if they take over the world, just so long as i don't have to scroll through years of microfilmed newspapers ever again - it makes me feel seasick!
At last, I can finally go back and tell my 3rd grade teacher THIS is why I didn't need to learn how to use a flippin microfiche!!
So... just like the London Gazette has already been digitized. The difference is, the Gazette began publishing in 1665. Sod the moon landings! You can read the front-line reports about the American Revolution.
http://paperspast.natlib.govt.nz
(already being done in New Zealand for some years thanks to the work of the National Library of New Zealand) papers available back to 1839. With text search too! Cool!
... would allow google to do the same thing. There's been so many times what was interesting came up in a book google searched only to have pages blanked out. Sometimes I wonder if they should just put advertising on the book itself and pay the owners/authors directly (for the hits/adclicks/being read, etc).
...the first known publication of Duke Nukem Foverer is dated November 1997. http://en.wikipedia.org/wiki/Image:Dnf-lol.jpg
So when is google going to start scanning The National Enquirer or other tabloid newspapers, so the slashdotters can look up Natalie Portman's news with grits handy!
The purpose of writing is to inflate weak ideas, obscure poor reasoning, and inhibit clarity....Calvin
Or it might finally make people realize that we are all human, and a stupid act at 18 doesn't equate to judgment post 30. Naaahhh...
You must be new here. Welcome to Earth. We're a little strange here, but you will find that some of us can be relaxed and groovy. Enjoy your stay.
P.S. Please take me with you when you leave the planet
"Empathise with stupidity, and you're halfway to thinking like an idiot." - Iain M. Banks
...they're tracking news back to 200BC:
http://news.google.com/archivesearch?q=apollo&btnG=Search+Archives&scoring=t
A lot of interest in Apollo back then, and not a Cylon in sight.
holy cow batman!
google's archived 6000 years of newspaper clippings: http://news.google.com/archivesearch?q=egypt&ie=UTF-8&oe=UTF-8&btnGt=Show+Timeline
_ In Egypt Networks: Network Solutions with a Twist
and we are all going to regret it. Remember the public library system? Or the archival organizations? A bunch of highly trained people with literally centuries of experience in classifying and cataloging information, preserving the originals and investing heavily in digitization to help with that task and to make them more accessible? Most of their services are free or at a minimal cost, especially for students and researchers. And completely ad-free (at least here in Europe). Sure, their marketing sucks, they do not have the latest Web x.0 gimmicks. The tend to be a bit stuffier, old fashioned and not as flashy as our bubble heroes of the "do no evil" (but don't do anyting good either) kind, but then they on average tend to think in decades and not in quarterly results. Data (even massive amounts of it) is not information and Google is not a research tool. Google will always tweak search results towards higher advertising revenues. It is at best a brute force instrument with a vey low signal to noise ratio. It is a pest because it leads people to believe that keyword search is a solid method for research and it adds to the funding problems for libraries because who needs a library, when you can "google" everything. Google sucks up all it can get and leaves behind a desert without structure, significance or context, Support and use your local (national) library, while you still have it.
No, I'm New Here
Compared to the summer of '69, this is a slow news year . Yes, I'm old enough to remember all that stuff. I don't remember it happening all in the same day, but it sure is interesting.
Research shows that 67% of those who use the term "research shows", are just making shit up.
Perhaps Google could just send some money directly to me.
Don't get me wrong, I would love to see this happen, but I'm not sure google would conclude that there's a lot in it for them to do this.
citizen, show me your identification papers.
And on the next page, the headline reads "Kennedy Faces Charge in Fatal Crash." Hehe, funny how things work out.
...now I know what Google really is!
This comment is for entertainment purposes only. Any similarity to real insight or information is purely coincidental.
----------------------------------- My Other Sig Is Hilarious -----------------------------------
Folks should accept that everything in their past is necessary to get them to being who they are today. Beyond that, follow up with "besides, I was trying to impress a girl," and your harshest critics should start mumbling and looking at their shoes (cuz they've done it too.) Hell, I'd be more concerned if you didn't have youthful indiscretions, because that indicates you're more likely to do stupid things as an adult ... you didn't get it out of your system as a kid, and haven't learned your lesson yet.
Do you know anybody who works in the news media? I do, several guys both in TV and paper news who have been placed all over the spectrum from editing room floors to the administrative level and even teaching positions at media and public relations colleges. They ALL report (privately) that the whole game is a giant crock of malarkey. The most interesting aspect is when the news teams don't even realize they're doing it, but simply re-broadcast biases and falsehoods because they are part of a form of non-deliberate groupthink. But it's worst when suggested stories are simply struck from the record because they don't match up with whatever political beliefs the owner happens to hold.
One of the big problems is the AP Newswire, to which so many large journals subscribe and pull feeds from word for word. --One thin little bottleneck through which major breaking news passes, meaning entire nations uniformly learn about events which are filtered by only a very small number of people.
The intriguing thing about bloggers is that they don't do this; they represent a broad and varied non-uniform message. This does not mean all bloggers are accurate or that there isn't the internet 'echo chamber' effect going on, but it does mean that there is actually a higher probability of actual news coming through the system. Have you ever clicked into democracynow.com? Some of the more prolific blogger sites have their own journalists covering stories and you generally get broader coverage, and people being interviewed in a non-soundbite kind of way.
-FL
It would be nice to know the titles and dates of which newspapers will be included in this "archive." If this follows the same pattern as the GoogleScholar and GoogleBooks we won't know which publishers are participating, how many titles are included, or how far back it goes. Knowing the scope of what's being searched usually helps in determining if it should be used in the first place.
Great, now UAL will crash every time Google scans a newspaper with an article about their bankruptcy. I imagine Google news will start showing headlines like "Man Lands on Moon" next.
So what is the best OCR package that runs on Linux?
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
Having to pay to view these old articles is irritating.
I realize it costs money to scan and archive them, but perhaps these costs can be covered by putting Google Adwords on the sides and using advertising?
This sort of resource is invaluable. I can go to the library right now and go through newspaper archives on microfilm; Google should find a way to offer the same online without charging.
What a beautiful way to look into history, by reading the news articles of the day.
I hope they can make this happen for free or ad-supported somehow.
Who controls the past controls the future. Who controls the present controls the past.
- George Orwell
Google are going to patent /dev/memoryhole next.
At the (UK) university where I work we subscribe to Lexis Nexis (http://www.lexisnexis.com/). This gives full text from loads of newspapers around the world - there are no images (i think) and you can't see the contemporary ephemera such as adverts, but it's great for stories.
A search for '9/11' (as an example of a massively covered event worlwide) gives thousands of results and with the first thousand English language hits there are newspapers such as: Cobourg Daily Star (Ontario); The Independent (London); The Sydney Morning Herald (Australia); The Scotsman & Scotland on Sunday; Seattle Post-Intelligencer; South Bend Tribune; New Straits Times (Malaysia); The Japan Times.
Unless google is planning on doing like-for-like digitisations and/or giving free access to everyone I don't see that they are offering anything that doesn't already exist (admittedly as a [probably] expensive research tool).
It's interesting how in the example article from 1969 they use the real quote (in which Armstrong flubbed the line). I wonder when the revisionism started?
Karma: pi (Mostly due to circular reasoning in posts).
Check the other page os the newspaper, at the left: go for the movie ads. *That* is interesting. :)
AT &F1DT0,T0800665544 - Real men, real help desk support.
Just I need now is an access point in the toilet and it will be great!
"Don't be evil" is just an advertising slogan, like "At Pontiac we build excietement" (bad brakes, crappy handling), "Chevy - Like A Rock" (damned thing won't start), "At Ford, Quality is job 1" (Got their work cut out for them).
Pontiac's handling has gotten a lot better. The GTO was a bit squishy but the new G8 is said to be a worthy challenger to the M5. If that's not good brakes and good handling, then I do not know what is.
Similarly, Ford is now routinely winning various quality rankings in it car offerings... but Ford's problem is that it has too much debt and can't build enough of the cars it is selling all too well while at the same time has a lot of people building big trucks that no one wants.
This is my sig.
The problem is that you are going to end up with a single source for information and that in and of itself is a bad thing.
Undetectable Steganography? Yep, there's an app fo
If the public library system allows itself to be superseded by Google, then it must be full of people who aren't nearly as insightful and wise as I was always led to believe!
Honestly, some physical content just isn't worth the space it occupies, to keep it around. We have entire periods of history that are completely *gone*, all because of fires that destroyed the documents in libraries.
Certainly, there is a place for "vintage books and magazines", but that place is probably a museum, not a library. Most content turns out to be far more useful after it's digitized into a fully text-searchable format. It's great that libraries are staffed with people very knowledgeable in helping you find content you're seeking. But in modern times, they need to expand their skill-set to include becoming expert searchers of digitized content too.
I view web services like Google as "DYI research tools", ultimately. There's absolutely nothing wrong with people trying to learn to do things like car repair or home improvement on their own. It saves them money, helps them learn new skills, and odds are, it gets their problem(s) solved. On the other hand, there's no substitute for professionals in any of those areas, either - and any good "do it yourselfer" knows when it's time to call in a pro. The library is the "professional" version of these research tools.
You can start here: http://news.google.com/archivesearch
To get past the ones with a cost (esp. from New York Times $3.95) and get free sources, click on 'Advanced archive search' next to the search button, and choose only articles with 'no price'.
Here's an example: http://news.google.com/archivesearch?q=rosenbergs&num=10&as_price=p1&sa=N&sugg=d&as_user_ldate=1950&as_user_hdate=1959&lnav=d3&hdrange=1980,2008
While I think this will be a great resource, as long as it's free, I'm afraid it will lead to further thinking like this:
"Burn down the library. C'mon, all the books in the world are already digitized. Burn the thing down. Change it into a gathering place, a digital commons. Stop air conditioning the books. Enough already. None of us has the Alexandria Library. Michigan, Stanford, Oxford, Indiana. Those guys have digitized their collections. What have you got that they haven't got? Why are you buying a new book? Buy digital. Enough."
--Adrian Sannier, chief technology officer at Arizona State University, in his keynote speech, "A New American University for Next-Gen Learners," Campus Technology 2008 conference, July 29, Boston.
When it's not true. Not everything has been digitized, and with thinking like this, valuable research information will be lost.
The AP picks up the investigation into the alleged manslaugheter of former RFK aid Mary Jo Kopechne.
Israle and Egypt duel over the Suez.
Sammy Davis JR. visits the "wailing wall"
The $84.00 mohair suit still isnt worth 84 bucks in 2008 but is quite retro.
Gather enough newspapers from all around the country and pretty much anything you find will be almost as reliable as finding something written by a random blogger on the web.
Historically newspapers were like blogs, they didn't have journalists, as paid employees, in the modern sense, they had http://en.wikipedia.org/wiki/Correspondent
'>correspondents that would send letter, telegrams and military dispatches for publication.
The oldest newspaper are more like one long letters page.
"By the way I was a member of the Socialist Worker Student Society when I was a student because I was trying to impress a girl."
Boys do stupid things to impress girls sometimes. Considering who you were hanging out with, perhaps a simple compliment would have won her heart... something like "Wow, your legs are hairier than mine!"
Life is hard, and the world is cruel
"I welcome this news. For too long, research on the Internet has been a frustrating task."
Right, but even though Google Books is doing the public a service, to me, doing the newspaper archives is a much bigger service in terms of research, because most people don't like reading entire books on a computer screen... they like to relax and kick back with a physical, paper book, that gives a sense of tactile satisfaction as well as mental stimulation. Reading books is a totally different sensory experience than short reading.
But... newspapers are bette for research uses than books in many cases, and this is where Google's scanning project is truly helpful. Newspaper story research is all about information in short bursts... perfect for the Internet. There are untold millions of pages of of newspaper stories out there, and not only are they great sources of information, they're a visit back in time... look at the ads, the language, the tone of conversation. Reading a newspage from 1948 is instantly different than reading one form 1978, even if there's no date on the page. A hobby of mine is buying old books (pre-1950), especially textbooks, and one I recently acquired had an article clipping from a Minnesota newspaper from 1944 in it... the article was on the new modern miracle of ambulance airplanes, and how they were drastically cutting down on the casualty rate in the War. On the other side were ads for local entertainment venues and auto service. Fun and fascinating stuff.
The biggest benefit here? The research and historical value. I hope the newspaper companies go along with this. The addition to knowledge and history from this in incalculable.
Life is hard, and the world is cruel
I hope the rights to digitization are non-exclusive.
For one thing, Google books is so appallingly badly done that it can't be used even for OCR: it's rare to have a whole book without a few missing pages, folded pages, badly under or ever-exposed pages etc.
For another, the resolution is too low. For the posted newspaper spread, look at the ad near the bottom left, Today at your neighborhood theater, and notice that you can't really read it. Given that this was a famous story chosen for a major press release, you'd think they'd take care. If this is their best, we can expect most other issues to get much worse treatment. Pages 18 and 19 are sideways, was this a mistake? Seems so.
If they are doing this for posterity they need to do it well. It might never be done again.
What thought have they given to copyrights for adverts, and to privacy... if you ever posted a classified ad your 'phone number is about to be made public; if you were ever wrongly accused of a crime, it may re-surface... this is different than news posted on the Web today, because no-one thought of newspapers as easily-accessible public archives in the same way as they do for Web pages. Old newspapers were often archived at libraries, sometimes microfilmed, but not readily available.
There should also be a mechanism, where possible, for people to make transcriptions (distributed proofreaders comes to mind as a possible model) so that the newspapers can be indexed and made accessible to people who are not sighted, or who can't read 6-point type :D or have low bandwidth.
So it's an interesting start, but no, please don't do this without some hard thinking by someone who isn't just a marketing executive. There's a whole lot of things that don't appear to have been considered and should be considered on such a project.
Once they are considered, there's a lot of fabulous stuff to be uncovered (which is why I have a Web site for scanned images and texts, of course, but at least I scan at as close to archival standards as possible, as such standards evolve).
Live barefoot!
free engravings/woodcuts
Try this one: http://olddisasters.blogspot.com/2007/01/indiana-train-wrecks.html/
Impetuous! Homeric!
More germaine would be President Bush and Vice President Cheney's drunk driving covictions. I'd say something that could result in people getting killed is a lot more serious than streaking.
Well, Ted Kennedy actually killed somebody and it doesn't seem to have kept him out of politics.
The electorate doesn't seem to care unless you're buggering somebody (aside from simply "America" generally); that they'll break out the pitchforks for.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
When I was about thirteen I went on a two-day peace march. Something like ten miles a day along hilly Vermont roads. At the start of the second day John Kenneth Galbraith gave a speech before we set out. I, being me, didn't really care who he was nor did I like what he was saying, so I just sat where I was, right under the podium zoning out. When we got back to camp after the march I was handed the paper by various people who insisted that the kid vaguely visible in the picture of his speech was me.
Well, now, twenty-eight years later or so, I would love to have that pic if it is what I remember. Would I be willing to go to Vermont and dig through archives for hours finding it? No. But, hell, if it's on Google, I'm up for looking now and again until it turns up. I suspect that there are millions of us with such things that we will now do now that the barriers to entry have so decreased.
The more important issue, afaic, is what will happen with all of the major political events that have been "disappeared" from our collective memory with disinformation now that original accounts will need to either selectively not be available or far more expensively be suppressed? My guess is that, for example, stories about General Motors streetcar fraud will slip through and that within a year or so any number of big political issues will start to be sen differently by the chattering classes.
It's all about the information. And what we do with it.
Indiana law forbits release of adoption records, even that old.
Ask to have the law changed. Seriously. Legislators live for this sort of thing.
I meant to write "NewsBank" where I wrote "NewsWire". I don't know where "NewsWire" came from ...
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Meta-moderated censorship elections?
Neither the article nor the blog post contains much information about what exactly they're doing. Does "digitize" just mean "produce digital images", or are they going to OCR the images so that the text will be searchable and copyable? Obviously they're indexing them somehow, but whether it is full text indexing or not they don't say.
This must be made illegal at once. It could destroy the Palin candicacy!
it says "Sen. Kennedy to face charges in death" - see, you can run but you cannot hide...
Ask Me About... The 80's!
I'm thinking of all those movies that show someone staring at a computer screen which has a picture of an old newspaper on it... That looked so out-of-date every time I saw it. And now it appears that was actually the future. Sci-fi coolness. Oh, well. :)
Leonid Mamtchenkov