I think the real question is whether Google could singlehandedly eliminate Netscape 4.7 by switching. They'd be the savoir of webmasters... er... more so.
Er... obviously I didn't read your comment in full. Apologies.
I disagree that it's just "a throw away title." Titles set your reading mindset. They have a big influence on how your read the article. If you read the article at all.
I think you're right on the rest of it, though. The Post's writer does an unusually good job of pointing out there is no causitive explanation. Though, still, the quotes he chooses use scientific authority to speak about possible causitive agents, which, I think, is a bit suspect.
If I could tattoo one thing on everyone's head, it'd be: "Correlation does not equal causation!"
This study does not mean that if you sleep less, you will live longer. A correlation has been found, that's all. Maybe people who sleep less have better circulation, also linked with long life. Or maybe the space aliens who shorten life spend two hours a night doing it.
Point is, we don't know what's causing this effect, at least not from this article.
No one has to "steal" anything to be sued for patent infringement.
In the US, patents grant a 20-year monopoly to the person who can prove they thought of something first*. Say you and I don't know each other, I invent something, and you invent it the next day. I then get a patent. By law, I can sue you for patent infringement if you're still using your invention. I can sue you, even though you were just as inventive than me. All that matters is that I was a day earlier.
There's usually no pejorative undercurrent to patent infringement cases, actually. Patent infringement cases are usually just coldblooded strategic marketing in action.
----
* Yeah, yeah, this is simplified -- in the US, there are also tons of rules about whether you kept it s3kr1t enough, whether someone else can prove they did it first, etc etc.
Save money on OCR by sacrificing quality
on
From Paper To PDF?
·
· Score: 5
Ahh... mass-OCR cost-effectiveness... it takes me back...
I just used an off-the-shelf OCR engine and hacked the text together with the images programmatically myself. We would get TIFF images, which most engines could understand.
On really, really big OCR jobs, though, the real problem is the tradeoff between human intervention and quality. See, OCR engines just guess at stuff. The only reason they work at all is that they guess well. But they guess wrong anywhere from 0.1% to 10% of the time, depending on the quality of the input.
Each mistake must be correct by a human being. But humans are expensive. If you have lots of documents to OCR, the technology integration costs and the cost of the OCR engines themselves are amortized. They end up dwarfed by the paychecks of the humans.
The cost of massive amounts of OCR, therefore, is directly related to the amount of human correction of OCR mistakes.
Thus, you can save tons of money by selectively sacrificing OCR quality. Getting every page perfectly formatted requires around 60 seconds a page for a skilled OCR operator. It's all about reducing that time. How? Simple. Don't expect everything to be perfect. There are various levels of quality you can get out of OCR engines-human systems:
no correction: just let 'er run. You can get it fully automated this way, but the quality is crap.
zoning only: The OCR engines just suck at text with multiple columns, inserts, and tables. You can get people to correct the engine's zoning at a clip of around 5 seconds a page, 10 seconds if you require them to put in tokens representing the excised images.
spelling correction: Typically, most people object to the spelling mistakes OCR introduces. With good quality text an operator can correct them at around 20-30 seconds a page.
formatting correction: OCR engines can really mess up indentation and text flow. Unfortunately this is the most time consuming problem to fix, anywhere from 30 seconds to a couple of minutes per-page.
Oh, and it really helps if you get the workflow of the OCR down. Allow the operator to move on to the next document automatically, save them the trouble of remembering the name of the document they're working on, etc. etc. This may require a bit of hacking of the OCR engine you're using, but it's worth it.
So when doing something like this, ask yourself: how perfect does it have to be, really? You can save tons of money if you can cut any quality corners.
You seem to have three hands.
It's called Netscape 4.7.
... more so.
I think the real question is whether Google could singlehandedly eliminate Netscape 4.7 by switching. They'd be the savoir of webmasters... er
Er ... obviously I didn't read your comment in full. Apologies.
I disagree that it's just "a throw away title." Titles set your reading mindset. They have a big influence on how your read the article. If you read the article at all.
I think you're right on the rest of it, though. The Post's writer does an unusually good job of pointing out there is no causitive explanation. Though, still, the quotes he chooses use scientific authority to speak about possible causitive agents, which, I think, is a bit suspect.
I'm responding to timothy, who titled the article "Sleep Less, Live Longer."
If I could tattoo one thing on everyone's head, it'd be: "Correlation does not equal causation!"
This study does not mean that if you sleep less, you will live longer. A correlation has been found, that's all. Maybe people who sleep less have better circulation, also linked with long life. Or maybe the space aliens who shorten life spend two hours a night doing it.
Point is, we don't know what's causing this effect, at least not from this article.
In the US, patents grant a 20-year monopoly to the person who can prove they thought of something first*. Say you and I don't know each other, I invent something, and you invent it the next day. I then get a patent. By law, I can sue you for patent infringement if you're still using your invention. I can sue you, even though you were just as inventive than me. All that matters is that I was a day earlier.
There's usually no pejorative undercurrent to patent infringement cases, actually. Patent infringement cases are usually just coldblooded strategic marketing in action.
----
* Yeah, yeah, this is simplified -- in the US, there are also tons of rules about whether you kept it s3kr1t enough, whether someone else can prove they did it first, etc etc.
I just used an off-the-shelf OCR engine and hacked the text together with the images programmatically myself. We would get TIFF images, which most engines could understand.
On really, really big OCR jobs, though, the real problem is the tradeoff between human intervention and quality. See, OCR engines just guess at stuff. The only reason they work at all is that they guess well. But they guess wrong anywhere from 0.1% to 10% of the time, depending on the quality of the input.
Each mistake must be correct by a human being. But humans are expensive. If you have lots of documents to OCR, the technology integration costs and the cost of the OCR engines themselves are amortized. They end up dwarfed by the paychecks of the humans.
The cost of massive amounts of OCR, therefore, is directly related to the amount of human correction of OCR mistakes.
Thus, you can save tons of money by selectively sacrificing OCR quality. Getting every page perfectly formatted requires around 60 seconds a page for a skilled OCR operator. It's all about reducing that time. How? Simple. Don't expect everything to be perfect. There are various levels of quality you can get out of OCR engines-human systems:
Oh, and it really helps if you get the workflow of the OCR down. Allow the operator to move on to the next document automatically, save them the trouble of remembering the name of the document they're working on, etc. etc. This may require a bit of hacking of the OCR engine you're using, but it's worth it.
So when doing something like this, ask yourself: how perfect does it have to be, really? You can save tons of money if you can cut any quality corners.
I believe Yahoo Mail also uses Python. I wouldn't be surprised to hear that other Yahoo sites use it.
Same basic issue: someone insisting upon being paid for using their intellectual property, and a bunch of people who weren't paying for it.