Xerox Photocopiers Randomly Alter Numbers, Says German Researcher
First time accepted submitter sal_park writes "According to a report from German computer scientist D. Kriesel, some Xerox WorkCentre copiers and scanners may alter numbers that appear in scanned documents. Having analyzed the output of two such devices, the Xerox WorkCentre 7535 and 7556, Kriesel found that "patches of the pixel data are randomly replaced in a very subtle and dangerous way": in particular, some numbers appearing in a document may be replaced by other numbers when it is scanned."
So, it has come to this.
You're a temporary arrangement of matter sliding towards oblivion in a cold, uncaring universe
OOPS
Now, in a more subtle way.
Kriesel found that âoepatches of the pixel data are randomly replaced in a very subtle and dangerous wayâ
Slashdot users are advised not to use Xerox copiers for submissions.
Some of these machines have been used for digitizing documents whose originals were later shredded, so some people now have subtly wrong "original" digitals. It's particularly problematic because of the nature of degradation; usual lossy degradation of images is in a non-semantic way, just produces blurring or blocking or other kinds of artifacts, not OCR-error style mistakes.
The issue here seems to be the lossy mode of JBIG2, which tries to find patches of the image that approximately match, and consolidates them. The idea seems to be that if the letter "e" appears 5000 times in a document in the same typeface, you just store some version of it once, and then reference it everywhere it appears. But now you get OCR-style errors, if you end up matching some patches to incorrect partners. You have your lightly printed "8" replaced by the "0" patch now and then, that kind of thing. And unlike people doing OCR, who know they need to take this into account, the operators of these machines likely had no idea this was even a possible failure mode to watch for, so who knows how many numbers are wrong in miscellaneous documents (letters are a little less problematic, because most random letter mutations don't destroy meaning).
Blargh.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
Caused by misconfigured JBIG2 compression. When pixel error rate is low enough, similar looking features get printed with the same subimage.
which kicks in when saving to PDF, and doesn't handle low image resolution very well?
"I don't know, therefore Aliens" Wafflebox1
That's Xenu, not Xerox.
Scanning 7pt text at 200dpi with consumer level scanner technology and you're complaining about scan errors. Really?
when other brands don't do that, yes, yes we are.
Maybe you should read the article.
Liberty in your lifetime
Scanning an article without comprehension and your complaining about your misinterpretation. Really?
Did you even read the blog post?
Before anyone spreads wrong information: The problem is with the JBIG2 image compression algorithm used when scanning to PDF format. OCR has nothing to do with this. Also, TIFF format images are not affected as they don't use JBIG2.
Tom cruise is NOT a role-model! Shame on you Xerorx!
you actually watch that shit? hahahahaha. no wonder you posted ac.
It's the first subtle warning of the machine awakening. ...It's coming...
Scanning 7pt text at 200dpi with consumer level scanner technology and you're complaining about scan errors. Really?
These 'errors' are substantially worse than ordinary scanner suckitude or lossy-compression legovision: JBIG2's pixel-block matching creates the potential for a block containing one character to be mis-identified and replaced with a block containing a different character.
The replaced character will be exactly as legible as text elsewhere on the page, just entirely incorrect.
If it were just the scan quality being lousy, or somebody turning, say, JPEG compression up to the point of pain, mangled characters would be obviously mangled. Not as good as being legible; but the issue is obvious. In this case, the errors will look as good as the rest of the document.
Quote: "Normal/Small produces small files by using advanced compression techniques. Image quality is acceptable but some quality degradation and character substitution errors may occur with some originals"
Source: http://www.cs.unc.edu/cms/help/help-articles/files/xerox-copier-user-guide.pdf
How could Xerox make copiers for this length of time and not have a proofreading algorithm that works with a super-resolution scan & no interpolation to "machine check" the final commercial copier as a way of quickly finding errors?
Internatlly, Xerox engineering had to know they were "correcting" pixels, rather than just "copying" them, so how did they verify their software?
NSA strikes again.
my guess is that since digitization of documents that are later destroyed are treated as originals, then this will be used to bring uncertainty and doubt to information that will otherwise be essential to bringing accountability to large organizations that used these machines.
People: 0 , Big brother: 999999999999999999999999
Hey, even photo copiers and faxes need freedom of speech.
Good is never enough, when you dream of being great!
Scanning 7pt text at 200dpi with consumer level scanner technology and you're complaining about scan errors. Really?
Consumer level? This isn't a home, or even home-office, machine. It's sold on the website under the office section.
This Xerox product was popular on Wall Street a few years ago, especially those dealing in mortgage-backed securities.
If you read the documentation from XEROX... it claims that on scanning it is a known problem that "Image quality is
acceptable but some quality degradation and character substitution errors may occur with some
originals." page 107 from http://www.cs.unc.edu/cms/help/help-articles/files/xerox-copier-user-guide.pdf
also on page 129 we have the following: "Quality / File Size
The Quality / File Size settings allow you to choose
between scan image quality and file size. These settings
allow you to deliver the highest quality or make smaller
files. A small file size delivers slightly reduced image quality
but is better when sharing the file over a network. A larger
file size delivers improved image quality but requires more
time when transmitting over the network. The options are:
Normal/Small produces small files by using advanced
compression techniques. Image quality is acceptable but some quality degradation and character
substitution errors may occur with some originals."
Huh?
I'm sorry. I understand those 6 words individually. But when you put them in that order, they don't make any sense.
Read? The? Article? You are not making any sense, man!
I lack the proper attention span to read the article. Let's make a deal: I quickly skim through it, and soon return here with another completely wrong conclusion. Be back in 30 seconds.
They probably have some parts made of wub fur. Those machines are more advanced than I thought!
Nae king! Nae laird! Nae yurrupiean pressedent! We willna be fooled again!
to RTFM
A $12,000 scanner/printer is "consumer level?"
"Ignorance more frequently begets confidence than does knowledge"
- Charles Darwin
I just spent ten minutes describing exactly how JBIG works here before noticing someone already realised what is happening and put it up on the page.
OMG, my Canon ImageRunners are doing the same thing! It must be a virus!
I'd better write up a research document on this and request some grant money.
This problem showed up a while ago, when Obama's birth certificate was released. Some doofus scanned it using some overblown Adobe product, which probably without asking, did OCR on it and added layers of gray OCR'ed text.
That set off a spitstorm of wingnuts posting smarmy YouTube videos where they showed how "intelligent" they were at "detecting" that the image was so, so, so "manipulated".
Quote: "Normal/Small produces small files by using advanced compression techniques. Image quality is acceptable but some quality degradation and character substitution errors may occur with some originals"
Source: http://www.cs.unc.edu/cms/help/help-articles/files/xerox-copier-user-guide.pdf
Page 129 for those incapable of searching a PDF.
But, seriously dude, this is scientific research! You can't seriously expect the man to RTFM.
The things you learn. I never knew before about JBIG2 and how scanners use it to repeat pieces of image. Seems to me that the JBIG2 parameters are tuned incorrectly in these scanners.
This was a decision by Xerox to get around ever being sued for copyright violations...
Seven puppies were harmed during the making of this post.
It's just a bug in the NSA eavesdropping algorithm.
how a compression that may lead to documents altered in such a way (numbers replaced by other numbers) can be considered fit for use in a photocopier. This can lead to very real, expensive and even dangerous problems down the line.
Scanning 7pt text at 200dpi with consumer level scanner technology and you're complaining about scan errors. Really?
These 'errors' are substantially worse than ordinary scanner suckitude or lossy-compression legovision: JBIG2's pixel-block matching creates the potential for a block containing one character to be mis-identified and replaced with a block containing a different character.
The replaced character will be exactly as legible as text elsewhere on the page, just entirely incorrect.
If it were just the scan quality being lousy, or somebody turning, say, JPEG compression up to the point of pain, mangled characters would be obviously mangled. Not as good as being legible; but the issue is obvious. In this case, the errors will look as good as the rest of the document.
After actually looking at the images in TFA, it does seem like there is a problem with the way 6/8 and 4/7 are interpreted. However, you can't say that the results aren't quite noisy; I would look at a scan like that with a squinty eye and be super annoyed at the jerk who couldn't just procure the *original* electronic format. Just because the scanner "seems to do ok" on other equally tiny numbers doesn't make it right. Get the goddamn original file.
If you read the article you would see it's not a simple case of scan error where a "13" appears blurry and looks like "B". Whole numbers are changed: 21.11--> 17.43. This is a major issue if it was on a construction drawing for example. A beam 4m too short would be a problem. Even if caught the engineer signing off might have to go through a whole audit process.
Well, there's spam egg sausage and spam, that's not got much spam in it.
A $12,000 scanner/printer is "consumer level?"
What are we? Farmers?
That is asking too much of him - maybe if he just looked at the pictures in the article?
You can never know everything, and part of what you do know will always be wrong. Perhaps even the most important part.
duh..
but its really the embedded serial numbers in scanned and printed documents that's getting in the way.
That's Xenu, not Xerox.
Xenu... Xerox... Xenu-Rox?
An enigma, wrapped in a riddle, shrouded in bacon and cheese
There was a trend a while back for copiers and printers to put fingerprints on output to make police investigations easier and to prevent effective counterfeiting. I think it may still be happening industry wide.
JJ
Except Xenu doesn't. Crazy lunatics!
Why do we need such aggressive compression algorithms, algorithms that can make the data WRONG, in this day and age when storage and memory is so incredibly cheap?
This is not 1987 when every byte was precious and 1MB of RAM cost a hundred bucks. There is NO EXCUSE for this these days; just use PNG or JPG compression; at least those don't freaking CHANGE THE DATA!!
I printed out the article in order to hang it on the wall above my office's Workcentre as a warning to coworkers. But apparently printing it fixed the problem, because the article headline became:
"Xerox scanners/photocopiers Scan Documents Flawlessly and are the Best in the Industry"
Windows-1251 character codes do not belong on the internet. Many people now a days don't use windows.
For one thing, using Windows code pages does not require Windows. They are well-defined encodings of a subset of Unicode. If I were to apply the same etymological fallacy to your suggestion to stick to the American Standard Code for Information Interchange, it might look like this: "Many people now a days don't live in America." A lot of languages don't easily map to just the Basic Latin block (U+0020 through U+007E). For example, in Spanish, "esta" means "this" while "está" means "is currently" or "is located".
I've explained this several times. Slashdot introduced a code point whitelist after past abuses of bidirectional override characters.
That's what he said before he scanned it on his WorkCentre.
AJ Henderson
You must be new here...
In theory, TIFF is a container format for any image codec that has a TIFF embedding defined. In practice, TIFF is a container format only for those codecs supported by common TIFF viewers. To use your analogy to AVI, when people see "AVI", they think of the codecs commonly used with an AVI container, such as MPEG-4 ASP video and MPEG-1 Layer III audio back in the DivX era. I could wrap the obscure codec of PlayStation 1 or Game Boy Advance FMV in an AVI or MKV container, but there'd be no use because next to nothing that supports such a container also supports those codecs. WAV is also a container, but over 9 times out of 10, the compression features aren't used.
Why would a copier do OCR + compression? To store and to transmit.
We knew already that copiers save all material to hard disk. We now know they OCR+compress. What better way to maximize disk usage and save transmission time to the Mother Ship (NSA Utah).
....AND that's why you don't use a lossy compression for your important text documents.
That's all I did, and I learned what they were talking about pretty quickly.
It's actually pretty insane. They had architectural diagrams that had the square meters for the rooms copy/pasted by the scanner into other rooms. For instance, here were the room sizes for the three rooms on the diagram as reported on the original diagram and various scans of it (I've bolded incorrect values):
Original Diagram: 14.13m^2, 21.11m^2, 17.42m^2
Xerox WorkCentre 7335 scan: 14.13m^2, 14.13m^2, 14.13m^2
Xerox WorkCenter 7556 scan 1: 14.13m^2, 14.13m^2, 14.13m^2
Xerox WorkCenter 7556 scan 2: 17.42m^2, 21.11m^2, 17.42m^2
Xerox WorkCenter 7556 scan 3: 14.13m^2, 14.13m^2, 17.42m^2
They have images of this happening. It's just outright substituting blocks of text from one part of a scanned image into an entirely separate part. Not just mangling pixels or uniformly displacing each by a few mm, but outright moving them into a different part of the image that was similar, yet slightly different. Maybe it's some sort of optimization or compression gone wrong? I.e. They detected a block that appeared to be the same as a previous one, so assumed they were the same and only kept one copy of that data?
It's bizarre.
Especially for the checksum, which is required to be printed in bold 18 point text, and corresponds to the provided the electronic format of the document.
What you see here is that the copiers have achieved the singularity and are now posting defensively. Guytoronto is a network-connected WorkCentre copier that is using JBIG in everything including its thought processes and is thus misinterpreting everything.
This is how people get shot, because the police are given the wrong address to raid a house. This is how people get foreclosed on because a few account numbers are switched.
Holy crap. That makes me never want to go near a copier again.
If telephones are outlawed, then only outlaws will have telephones.
They have images of this happening. It's just outright substituting blocks of text from one part of a scanned image into an entirely separate part. Not just mangling pixels or uniformly displacing each by a few mm, but outright moving them into a different part of the image that was similar, yet slightly different. Maybe it's some sort of optimization or compression gone wrong? I.e. They detected a block that appeared to be the same as a previous one, so assumed they were the same and only kept one copy of that data?
It's bizarre.
You came up with the exact same conclusion as the author of the article you just read:
Edit: It seems that the above thought was not that wrong at all. Several mails I got suggest that the xerox machines use JBIG2 for compression. This algorithm creates a dictionary of image patches it finds “similar”. Those patches then get reused instead of the original image data, as long as the error generated by them is not “too high”. Makes sense.
You came up with the exact same conclusion as the author of the article you just read:
Hey now, there's no need to accuse me of reading the article just because I looked at the pictures.
What are we? Farmers?
Bum, badum, bum bum bum bum.
I expect the bug is because it is trying clean up the scanned image. Trying to account for what it thinks is missing data.
14.13m^2, 21.11m^2, 17.42m^2
It see 3 blocks of information that probably roughly looks the same to the software accounting for errors. The amount of pixels used in each are fairly close. I expect the scanner sees the three blocks and thinks they are the same, and tries to find the block that seems the most sharp and reproduces them over the other spots.
Scanning isn't pixel perfect you get a different match. So the image cleaning processor will probably try to clean the numbers differently.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Perhaps, their traffic cam software should be checked for "substitution errors."
We just completed some testing using a 7535 and got the same 8/6 mixing issue.
We scanned in the originals and then shredded them. This is now the official copy!
-- ssoorrrryy,, dduupplleexx sswwiittcchh oonn.. -Quote found on actual fortune cookie.
Anyone who hasn't RTFA really ought to at least look at the example. This is not only a case of a blurry 6 being replaced with a blurry 8, which would be bad enough. If surrounding context matches, it will replace numbers with complete different text! In the first example given, the number 14.31 is scanned in one place, and used to replace the numbers 21.11 and 17.42 in two other places. In all cases, the numbers are perfectly legible.
In what world is this acceptable? To actually document this on page 129 of the handbook (that almost no one will ever read) and deliver the product - insane!
Enjoy life! This is not a dress rehearsal.
Then again, "better" is a subjective statement based on the perception and crieteria of the observer.
However YOU propose it as some sort of Universal Truth.
Either you're a frigging moron or you think that you are God.
http://realbusinessatxerox.blogs.xerox.com/2013/08/06/always-listening-to-our-customers-clarification-on-scanning-issue/?CMP=SMO-EXT#.UgEhdRgk98F
By Francis Tse, principal engineer, Xerox
Recently there have been articles about Xerox devices randomly altering numbers in scanned documents. We take this issue very seriously.
The problem stems from a combination of compression level and resolution setting. The devices mentioned are shipped from the factory with a compression level and resolution that produces scanned files which are optimized for viewing or printing while maintaining a reasonable file size. We do not normally see a character substitution issue with the factory default settings however, the defect may be seen at lower quality and resolution settings.
The Xerox design utilizes the recognized industry standard JBIG2 compressor which creates extremely small file sizes with good image quality, but with inherent tradeoffs under low resolution and quality settings.
For data integrity purposes, we recommend the use of the factory defaults with a quality level set to “higher.” In cases where lower quality/higher compression is desired for smaller file sizes, we provide the following message to our customers next to the quality settings within the device web user interface: “The normal quality option produces small file sizes by using advanced compression techniques. Image quality is generally acceptable, however, text quality degradation and character substitution errors may occur with some originals.”
Xerox is totally committed to customer satisfaction and with this feedback we will look for ways to help our customers better manage their scanning application needs.
For more information, contact Xerox Support at http://www.xerox.com/perl-bin/world_contact.pl#0.
That's exactly what it did - JBIG - compression algorithm. Why on earth would a scanner be using a compression algorithm? Memory is cheap, do a pixel scan and send me that, let me deal with compressing it, if I want to.
The cesspool just got a check and balance.
If you break apart data into chunks, hash each chunk to a hash smaller than a chunk, make an incorrect assumption about lack of collisions and then try to reconstruct, this is what you'd come across.
John_Chalisque
I work for Xerox. I specifically support these machines in a tier 3 capacity. I have not seen or heard a single case of this. My group handles calls from all of North America, and some South.
How do you think I feel?
I work for an engineering company and we've got Xerox workstations, so this basically has "long day" written all over it.
It would be great if they had a few test sheets we could run.
---
ECHELON is a government program to find words like bomb, jihad, plutonium, assassinate, and anarchy.
See They're photocopies! You don't need to proofread each one!
Unbelievable software incompetence. Not only did they do this, but they knew about it and documented it!
Are all wrong. What a convienient excuse for the liars in government to put out ridiculous wrong numbers. "Who could have known?" There's no inflation right? Unemployment (if you count part time jobs designed to elminate need for obamacare)...and so on. This seems in the examples to likely print lower numbers...How handy for the liars to have an excuse for it.
Why guess when you can know? Measure!
I have some test sheets that should do the job for you.
I'll scan them and send you the images.
-- I have monkeys in my pants.
In particular, there's no excuse for using < 300 dpi when using bilevel. Bilevel documents at 300dpi are 100kb or less when using reasonable lossless compression (lossless jb2/ jbig2, CCITT Group 4 TIFF, or even just PNG).
Using lossy jb2/jbig2 like these copiers were doing is at most going to save you a couple dozen kb per page. Not worth the problems in many cases.
(doing this reply again since I forgot slashdot eats less than signs unless you use html entities. Man, what an anachronism.)
Zalgo and page-widening trolls, beeotches.
Because Unicode validation & sanity checking is harrrrdddddd...
It's probably a setting. Scan to TIFF is generally turned off because TIFFs are big. The work around he found was to change the coarse/fine setting.
Given the included images, nobody should be using the copies for official work. The ones that weren't changed were mostly unreadable (or at least partially ambiguous). Just print 3 originals, rather than one original and making three copies of it.
And despite the use of the word "random" it didn't appear random at all, and the author even said so himself.
Learn to love Alaska
Well, that's the thing with the "all rolled up into one" solutions. Those scanners scan/mail/archive/etc... all by themselves, without further user intervention and without the need for an additional computer attached.
The big foul-up is that hey use JBIG, not a more sensible compression algorithm like LZW or JPEG where "to small to read" stuff really gets "too small to read" in the scan, too, not "improved" to something else. The foulup would have been exactly the same if someone later in the tool chain had used JBIG.
They probably hat a test run by a marketing drone that found that JBIG "looks so much clearer" ;-P
Something like this shouldn't have passed QA.. did we outsource or what?
Have you fscked your local propeller head today?
Not only that, this is probably a 6-10 thousand dollar machine (Depending on options).
It isn't some home office multifunction.
I once worked at a place that had a printer that intermittently flipped characters. It was difficult to solve because it was so intermittent. It couldn't be recreated in the lab to test without blowing thru thousands of sheets of paper, and that wasn't enough to isolate the problem or prove to the supplier. It drove everybody crazy.
Rumor has it that a technician secretly sabotaged it by juicing it with too much voltage so that the whole thing had to be replaced after a "mysterious failure". Sometimes you welcome dishonesty.
Table-ized A.I.
I haven't even read this article and I know the culprit exactly: JBIG2.
The compression algorithm operates on binary (2 color) images and has two modes, a lossless mode which is sort of like the love child of RLE and JPEG and a higher compression mode which operates by running the lossless blocks through a comparison routine and discarding and replacing any blocks that are sufficiently similar with references to the first copy. It's actually a good algorithm, but you have to understand how it works to implement it properly. When you have a perfect storm of certain fonts (especially small ones where a glyph can fit perfectly inside a block), have some noise in the bitonal images and have the compression threshold too high you can get some real zingers.. 9, 6, 0, 3, and 8 can all easily get muddled up, not to mention what happens to letters like e o c etc. The key to the whole thing is having good algorithms that can produce quality bitonal images from poor originals and scanning at sufficient resolution (or lowering the compression threshold enough) that blocks cannot hold an entire glyph.
As to why the copier is using the lossy mode of JBIG2 internally is mystery, especially in the "copy" pipeline. I can think of no good reason that it should use anything other than the lossless mode or uncompressed data.
You came up with the exact same conclusion as the author of the article you just read:
Actually, his conclusion *was* very different, but Slashdot's malfunctioning compression algorithm didn't realise this and inadvertantly replaced it with a duplicate of the original instead.
"Slashdot - News and Chat Sites Deviant". (Click "homepage" link above for details).
Any chance of a public link? Xerox here too, and I'd put short odds that more than a few slashdotters are affected.
*facepalm* Aargh, I just saw what you did there. +1 Internets to you, sir/madam/other.
There's a pre-error PNG of the drawing sample and a pre-error TIF of the number table sample they used in the original article. Perhaps try scanning printouts of them, it appears to be how some of the readers are reproducing the error?
The blog's also had a few updates, indicating affected models known so far and a possible workaround.
Well I'm glad my business uses another printer brand. I don't entirely understand the cause of the problem that's described in the original article but I can't believe it took this long for this bug to be found.
Now the question becomes: what moron made this setting the default? Maybe a setting that can undetectably corrupt your data can be provided if appropriate warnings are given, but it sure as hell should never be the default. I would've thought that was obvious.
The guy who came up with this posted several updates to his blog.
1. The setting is not the default.
2. There is a warning when you change the settings in the web frontend.
3. Xerox's support staff was not aware of this problem and could not come up with a solution.
OS Reviews: Free and Open Source Software
This is EXACTLY why I take the disk space premium and only archive stuff in lossless formats (unless the original was encoded in lossy format:: I'm looking at you DV/h264 camcorder formats!)
I made the mistake years ago of ogg-encoding a bunch of audio files instead of flac. While I didn't shred the originals, as soon as I started using quality speakers I noticed the terrible quality compared to the source CDs. Needless to say I went to the trouble of re-ripping them once I had enough hard disk space (Thus allowing me to avoid further wear to the discs due to placing them into cd players.)
well that engineer needs to compare original and copy
thats part of his job
And how far back are we talking ?
Somebody would have noticed, surely...
Unless, of course, the resulting errors were errors people accepted/wanted.
Is the system using JPG compression ? We have all seen boxes of pixels move around when the DVD player gets confused.
"There is no god but allah" - well, they got it half right.
I would mod this "Funny" but my auto-correct substituted "F***ing genius"
There is no use case for compression gain over semantic fidelity.
Period.
Please make sure you've understood my comment before attempting a rebuttal. This is very much a case of "just because you can, doesn't mean you should".
Does no one remember the previous revelation that Xerox color printers were printing "serial number" coded "not visible to the naked eye" codes on all color prints? This seems to have been secretly installed in the equipment for the convenience of some government agency.. (think Treasury, i.e. copying currency images)
How could a .jpg algorithim only substitute numbers for numbers - vs. random alphanumeric characters? unless the machine was converting the contents via OCR.. possibly to forward to? Most of those machines do now have internet connections..