Xerox Photocopiers Randomly Alter Numbers, Says German Researcher
First time accepted submitter sal_park writes "According to a report from German computer scientist D. Kriesel, some Xerox WorkCentre copiers and scanners may alter numbers that appear in scanned documents. Having analyzed the output of two such devices, the Xerox WorkCentre 7535 and 7556, Kriesel found that "patches of the pixel data are randomly replaced in a very subtle and dangerous way": in particular, some numbers appearing in a document may be replaced by other numbers when it is scanned."
So, it has come to this.
You're a temporary arrangement of matter sliding towards oblivion in a cold, uncaring universe
Kriesel found that âoepatches of the pixel data are randomly replaced in a very subtle and dangerous wayâ
Slashdot users are advised not to use Xerox copiers for submissions.
Some of these machines have been used for digitizing documents whose originals were later shredded, so some people now have subtly wrong "original" digitals. It's particularly problematic because of the nature of degradation; usual lossy degradation of images is in a non-semantic way, just produces blurring or blocking or other kinds of artifacts, not OCR-error style mistakes.
The issue here seems to be the lossy mode of JBIG2, which tries to find patches of the image that approximately match, and consolidates them. The idea seems to be that if the letter "e" appears 5000 times in a document in the same typeface, you just store some version of it once, and then reference it everywhere it appears. But now you get OCR-style errors, if you end up matching some patches to incorrect partners. You have your lightly printed "8" replaced by the "0" patch now and then, that kind of thing. And unlike people doing OCR, who know they need to take this into account, the operators of these machines likely had no idea this was even a possible failure mode to watch for, so who knows how many numbers are wrong in miscellaneous documents (letters are a little less problematic, because most random letter mutations don't destroy meaning).
Blargh.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
Caused by misconfigured JBIG2 compression. When pixel error rate is low enough, similar looking features get printed with the same subimage.
That's Xenu, not Xerox.
This is not smoothing, distortion or individual pad pixels. Entire image patches are copied incorrectly, essentially repeating a scanned section containing one number over another part of the image containing a different number.
Maybe you should read the article.
Liberty in your lifetime
Scanning an article without comprehension and your complaining about your misinterpretation. Really?
Before anyone spreads wrong information: The problem is with the JBIG2 image compression algorithm used when scanning to PDF format. OCR has nothing to do with this. Also, TIFF format images are not affected as they don't use JBIG2.
Scanning 7pt text at 200dpi with consumer level scanner technology and you're complaining about scan errors. Really?
These 'errors' are substantially worse than ordinary scanner suckitude or lossy-compression legovision: JBIG2's pixel-block matching creates the potential for a block containing one character to be mis-identified and replaced with a block containing a different character.
The replaced character will be exactly as legible as text elsewhere on the page, just entirely incorrect.
If it were just the scan quality being lousy, or somebody turning, say, JPEG compression up to the point of pain, mangled characters would be obviously mangled. Not as good as being legible; but the issue is obvious. In this case, the errors will look as good as the rest of the document.
Quote: "Normal/Small produces small files by using advanced compression techniques. Image quality is acceptable but some quality degradation and character substitution errors may occur with some originals"
Source: http://www.cs.unc.edu/cms/help/help-articles/files/xerox-copier-user-guide.pdf
Hey, even photo copiers and faxes need freedom of speech.
Good is never enough, when you dream of being great!
Scanning 7pt text at 200dpi with consumer level scanner technology and you're complaining about scan errors. Really?
Consumer level? This isn't a home, or even home-office, machine. It's sold on the website under the office section.
If you read the documentation from XEROX... it claims that on scanning it is a known problem that "Image quality is
acceptable but some quality degradation and character substitution errors may occur with some
originals." page 107 from http://www.cs.unc.edu/cms/help/help-articles/files/xerox-copier-user-guide.pdf
also on page 129 we have the following: "Quality / File Size
The Quality / File Size settings allow you to choose
between scan image quality and file size. These settings
allow you to deliver the highest quality or make smaller
files. A small file size delivers slightly reduced image quality
but is better when sharing the file over a network. A larger
file size delivers improved image quality but requires more
time when transmitting over the network. The options are:
Normal/Small produces small files by using advanced
compression techniques. Image quality is acceptable but some quality degradation and character
substitution errors may occur with some originals."
Huh?
I'm sorry. I understand those 6 words individually. But when you put them in that order, they don't make any sense.
Read? The? Article? You are not making any sense, man!
I lack the proper attention span to read the article. Let's make a deal: I quickly skim through it, and soon return here with another completely wrong conclusion. Be back in 30 seconds.
OMG, my Canon ImageRunners are doing the same thing! It must be a virus!
I'd better write up a research document on this and request some grant money.
The things you learn. I never knew before about JBIG2 and how scanners use it to repeat pieces of image. Seems to me that the JBIG2 parameters are tuned incorrectly in these scanners.
This was a decision by Xerox to get around ever being sued for copyright violations...
Seven puppies were harmed during the making of this post.
If you read the article you would see it's not a simple case of scan error where a "13" appears blurry and looks like "B". Whole numbers are changed: 21.11--> 17.43. This is a major issue if it was on a construction drawing for example. A beam 4m too short would be a problem. Even if caught the engineer signing off might have to go through a whole audit process.
Well, there's spam egg sausage and spam, that's not got much spam in it.
Why do we need such aggressive compression algorithms, algorithms that can make the data WRONG, in this day and age when storage and memory is so incredibly cheap?
This is not 1987 when every byte was precious and 1MB of RAM cost a hundred bucks. There is NO EXCUSE for this these days; just use PNG or JPG compression; at least those don't freaking CHANGE THE DATA!!
I printed out the article in order to hang it on the wall above my office's Workcentre as a warning to coworkers. But apparently printing it fixed the problem, because the article headline became:
"Xerox scanners/photocopiers Scan Documents Flawlessly and are the Best in the Industry"
That's all I did, and I learned what they were talking about pretty quickly.
It's actually pretty insane. They had architectural diagrams that had the square meters for the rooms copy/pasted by the scanner into other rooms. For instance, here were the room sizes for the three rooms on the diagram as reported on the original diagram and various scans of it (I've bolded incorrect values):
Original Diagram: 14.13m^2, 21.11m^2, 17.42m^2
Xerox WorkCentre 7335 scan: 14.13m^2, 14.13m^2, 14.13m^2
Xerox WorkCenter 7556 scan 1: 14.13m^2, 14.13m^2, 14.13m^2
Xerox WorkCenter 7556 scan 2: 17.42m^2, 21.11m^2, 17.42m^2
Xerox WorkCenter 7556 scan 3: 14.13m^2, 14.13m^2, 17.42m^2
They have images of this happening. It's just outright substituting blocks of text from one part of a scanned image into an entirely separate part. Not just mangling pixels or uniformly displacing each by a few mm, but outright moving them into a different part of the image that was similar, yet slightly different. Maybe it's some sort of optimization or compression gone wrong? I.e. They detected a block that appeared to be the same as a previous one, so assumed they were the same and only kept one copy of that data?
It's bizarre.
This is how people get shot, because the police are given the wrong address to raid a house. This is how people get foreclosed on because a few account numbers are switched.
Holy crap. That makes me never want to go near a copier again.
If telephones are outlawed, then only outlaws will have telephones.
You came up with the exact same conclusion as the author of the article you just read:
Hey now, there's no need to accuse me of reading the article just because I looked at the pictures.
I expect the bug is because it is trying clean up the scanned image. Trying to account for what it thinks is missing data.
14.13m^2, 21.11m^2, 17.42m^2
It see 3 blocks of information that probably roughly looks the same to the software accounting for errors. The amount of pixels used in each are fairly close. I expect the scanner sees the three blocks and thinks they are the same, and tries to find the block that seems the most sharp and reproduces them over the other spots.
Scanning isn't pixel perfect you get a different match. So the image cleaning processor will probably try to clean the numbers differently.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
I work for Xerox. I specifically support these machines in a tier 3 capacity. I have not seen or heard a single case of this.
So does Francis Tse, and he's apparently heard of it.
My group handles calls from all of North America, and some South.
You might want to talk to somebody who handles calls from Western Europe - Germany, in particular.