Choosing Better-Quality JPEG Images With Software?
kpoole55 writes "I've been googling for an answer to a question and I'm not making much progress. The problem is image collections, and finding the better of near-duplicate images. There are many programs, free and costly, CLI or GUI oriented, for finding visually similar images — but I'm looking for a next step in the process. It's known that saving the same source image in JPEG format at different quality levels produces different images, the one at the lower quality having more JPEG artifacts. I've been trying to find a method to compare two visually similar JPEG images and select the one with the fewest JPEG artifacts (or the one with the most JPEG artifacts, either will serve.) I also suspect that this is going to be one of those 'Well, of course, how else would you do it? It's so simple.' moments."
Paste both images in your image editor of choice, one layer on top of each other, apply a difference/subtraction filter.
it is lossy compression, after all . . .
I suppose you could recompress both images as JPEG with various quality settings, then do a pixel-by-pixel comparison computing a difference measure between each of the two source images and its recompressed version. Presumably, the one with more JPEG artefacts to start with will be more similar to its compressed version, at a certain key level of compression. This relies on your compression program generating the same kind of artefacts as the one used to make the images, but I suppose that cjpeg with the default settings has a good chance of working.
Failing that, just take the larger (in bytes) of the two JPEG files...
-- Ed Avis ed@membled.com
How about Amazon's Mechanical Turk service?
https://www.mturk.com/
The ImageMagick package includes a command called identify, which can read the EXIF data in the JPEG file. You can use it like this:
identify -verbose creek.jpg | grep Quality
In my example, it gave " Quality: 94".
This will not work on very old cameras (from ca. 2002 or earlier?), because they don't have EXIF data. This is different info than you'd get by just comparing file sizes. The JPEG quality setting is not the only factor that can influence file size. File size can depend on resolution, JPEG quality, and other manipulations such as blurring or sharpening, adjusting brightness levels, etc.
Find free books.
To make a JPEG, you cut it into blocks, run the DCT on each block and mess with the 4:2:2 color formula and pkzip the pieces... That said, I would think measuring the number of blocks would be related to number of artifacts... In my barbaric approach to engineering, (assuming there is no other suggested way on slashdot), I would get the source code to the JPEG encoder/decoder and print out statistics (number of blocks, block size) of each image...
Run the DCT and check how much it's been quantized. The higher the greatest common factor, the more it has been compressed.
Alternatively, check the raw data file size.
Others have mentioned file size, but another good approach is to look at the quantization tables in the image as an overall quality factor. E.g., JPEG over RTP (RFC 2435) uses a quantization factor to represent the actual tables, and the value of 'Q' generally maps to quality of the image. Wikipedia's doc on JPEG has a less technical discussion of the topic, although the Q it uses is probably different from the example RFC.
Compute the root-mean-square difference between the original image and a gaussian-blurred version?
JPEG tends to soften details and reduce areas of sharp contrast, so the sharper result will probably
be better quality. This is similar to the PSNR metric for image quality.
Bonus: very fast, and can be done by convolution, which optimizes very efficiently.
Just look at the manner in which JPEGs are encoded for your answer!
Take the DCT (discrete cosine transform) of blocks of pixels throughout the image. Examine the frequency content of the each of these blocks and determine the amount of spatial frequency suppression. This will correlate with the quality factor used during compression!
load up both images in adobe after effects or some other image compositing program and apply a "difference matte"
Any differences in pixel values between the two images will show up as black on a white background or vise versa...
adam
BOXXlabs
ThumbsPlus is an image management tool. It has a feature called "find similar" that should do what you want as far as identifying to pictures that are the same except for the compression level. Once the similar picture is found you can use ThumbsPlus to look at the file sizes and see which one is bigger.
Oh sure, it starts out innocently enough - pick the better image. Next thing you know Skynet's decided that it's the better LIFE-FORM.
AI - JUST SAY NO!
Brought to you by the Coalition for Human Survival (C) Aug. 29, 1997
I mean, you don't want second rate pictures in your pr0n stash?
I had problems building it back then, let alone writing the scripts for it and the hassle of figuring out which images were duplicates, but this utility seems to fit the bill.
HOSAKA K., A new picture quality evaluation method.
Proc. International Picture Coding Symposium, Tokyo, Japan, 1986, 17-18.
More Noise = Less Compression
I'm a Programmer. That's one level above Software Engineer and one level below Engineer.
I wonder if out of focus or blue detection methods will give you a metric which varies with the level of jpeg artifcats, after all the jpeg artifacts should make it more difficult to do things like edge detections etc which are the same the things that made more difficult by blurry and out of focus images
A google search for blur detection should bring up things that you can try, Here is series of posts that to do a good job of explaining some of the work involved
Assuming the only quality loss is due to JPEG compression, I guess a fourier transform should give you a hint: I think the worse quality image should have lower amplitude of high frequencies.
Of course, that criterion may be misleading if the image was otherwise modified. For example noise filters will typically reduce high frequencies as well, but you'd generally consider the result superior (otherwise you woldn't have applied the filter).
The Tao of math: The numbers you can count are not the real numbers.
Well, of course, how else would you do it? It's so simple.
You're right, it needs to be done by humans to be sure.
Amazon's Mechanical Turk should do the trick.
https://www.mturk.com/mturk/welcome
First, make a bumpmap of each image. Then, render them onto quads with a light at a 45 degree angle to the surface normal. Run a gaussian blur on each resulting image. Then run a quantize filter, followed by lens flare, solarize, and edge-detect. At this point, the answer will be clear: both images look horrible.
There are 0x40000000 types of people: those who understand 32-bit IEEE 754 floating point, and those who don't.
I don't know about "quality", but frankly it shouldn't be too hard to compare similar images just by doing simple mathematical analysis on the results. I'm only vaguely familiar with image compression, but if a "worse" JPEG image is more blocky, would it be possible to run edge detection to find the most clearly defined blocks that indicates a particular picture is producing "worse" results? That's just one idea, I'm sure people who know the compression better can name many other properties that could easily be measured automatically.
What a computer can't do is tell you if the image is subjectively worse, unless the same metric that the human uses to subjectively judge a picture happens to match the algorithm the computer is using, and even then it could vary by picture to picture. For example, a highly colorful picture might hide the artifacting much better than a picture that features lots of text. While the "blockiness" would be the same mathematically, the subjective human viewing it will notice the artifacts in the text much more.
AntiFA: An abbreviation for Anti First Amendment.
For what it's worth: I remember using Paint Shop Pro 9 a few years ago. It has a function called "Removal of JPEG artifacts" (or similar). I remember being surprised how well it worked. I also remember that PSP has quite good functionality for batch processing. So what you could do is use the "remove artifact" function and look at the difference before/after this function. The image with the bigger difference has to be the one of lower quality.
I am not sure if there is a tool that automatically calculates the difference between two images, but this is a task simple enough to be coded in a few lines (given the right libraries are at hand). For each color channel (RGB) of each pixel, you basically just calculate the square of the difference between the two images. Then you add all these numbers up (all pixels, all color channels). The bigger this number is, the bigger the difference between the images.
Maybe not your push-one-button solution, but should be doable. Just my $0.02.
And to reply to myself.. several other posters have noted that taking the DCT of the compression blocks in the image will give information on how highly compressed the image is... there's one example.
AntiFA: An abbreviation for Anti First Amendment.
I would very much like to do the same with audio. I have so many duplicate tracks in my music collection in different formats and bitrates.
Warning: The Surgeon General Has Determined that Sigs are Dangerous to Your Health
JPEG works by breaking the image into 8x8 blocks and doing a two dimensional discrete cosine transform on each of the color planes for each block. At this point, no information is lost (except possibly by some slight inaccuracies converting from RGB to YUV as is used in JPEG). The step where the artifacts are introduced is in quantizing the coefficients. High frequency coefficients are considered less important and are quantized more than low frequency coefficients. The level of quantization is raised across the board to increase the level of compression.
Now, how is this useful? The reason heavily quantizing results in higher compression is because the coefficients get smaller. In fact, many become zero, which is particularly good for compression - and the high frequency coefficients in particular tend towards zero. So partially decode the images and look at the DCT coefficients. The image with more high frequency coefficients which are zero is likely the lower quality one.
http://www.cs.dartmouth.edu/farid/research/tampering.html
http://www.cs.dartmouth.edu/farid/publications/tr06a.html
I've been lax, in a way, in my pruning of late so the findimagedupes program found about 28000 groups of near duplicate images. Finding that many was a surprise and that's why I started looking to see if a program had been written yet for the next step, finding the better image. I wrote a little script that prunes the identical files but now run into the problem of non-identical files that contain the same or nearly the same image.
Even simpler mathematical analysis would include such techniques as seeing which one takes up more disk space. Last I checked, that was very highly correlated with compression level.
Exploit JPEG's weakness.
JPEG encodes pixels by using a cosine transform on 8x8 pixel blocks. The most perceptually visible artifacts (and the artifacts most suceptible to cause troble to machine vision algorithms) appear on block boundaries.
Short answer:
a. 2D-FFT your image
b. Use the value of the 8-pixel period response in X and Y direction as your quality metric. The higher, the worse the quality.
This is a crude 1st approximation but works.
Here's a simple but expensive formula:
1. Get the image
2. Compress it severely.
3. Compare the difference between original and the compressed.
The lower the difference, the lower the image quality.
4. Profit!
Or you could just measure the amount of data in the DCT space. Duh.
AI or small utility... You never know with computers ;)
Analogies don't equal equalities, they are merely somewhat analogous.
Thou shalt not make a machine in the likeness of a human mind.
Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
Or you could just measure the amount of data in the DCT space. Duh.
That'd be a Discrete Cosine Transform
(for the confused like me. Crazy what they can do with math these days)
Since the mods haven't noticed, and I don't have mod points, let me point out that THIS POST HAS THE ANSWER. A real program that will do what the asker wants. The source is available, but I can't seem to find its license (it includes some of the Independent JPEG Goup's code). Also, doesn't a jpeg's EXIF data or some other tag in the file tell you what quality it was saved at?
The government can't save you.
Things such as thin wires, multi-colored ribbon cable, close-ups of a circuit board, and other images with lots of similar details seem to benefit most from this kind of tweaking, mainly thanks to the placement and qualities of the artifacts, rather than their mere existence or apparent severity.
I've had this happen many times - set an icon for, say, 35% quality and it will probably look kinda grungy, but step it down by just one or two percent and suddenly the artifacts shift around or change their appearance, sometimes in a manner that better suits the image - almost like constructive interference.
Thanks to the many who took this as a serious question and didn't turn this into a "It's just pr0n so who cares." Some is pr0n, some isn't, the most consistent thing is humor.
Many ideas needed the original image to find the better quality of the copy and some asked where I get these images from. These are linked in that I get the images from the USENET, from forums and from artists' galleries. This means that there's only a small set, from the artists' galleries, that I know are original. Others may be original but it may not be the original that comes to me first. On occasion, an artist may even publish the same image in different forms depending on the limitations of the different forums he frequents.
There were some ideas that were nicely different from the directions I was following that they'll give me more to think about.
I'll also acknowledge those who said that how the image is represented is less important than what the image represents. That's quite true but if I have a machine that can find the best representation of something I enjoy then why not use it.
It almost does what he wants. He doesn't spell it out, but it seems strongly implied that he also wants a system capable of automatically finding these duplicates by itself, and then automatically determining which image is "best."
Which seems obvious, to me: If he's got enough photos of sufficient disorganization that he can't tell automatically which duplicate is best, then there probably isn't any straight-forward way (with filenames or directory trees or whatever) to find out which ones are dupes to begin with.
Judge, the afore-linked program, only does the job of finding the best image out of a set of duplicates.
What tool can be used to find the (near) duplicates to begin with?
Kid-proof tablet..
http://www.jhnc.org/findimagedupes/
There's a bunch, but I know you can construct command line operations with this one. I imagine you could construct a system from this and the parent program that will find dupes, then nuke the poorer quality of each, or whatever.
In general your best bet would be to use an image quality metric that takes into account how the human visual system works. The 2D frequency response of the human eye looks something like a diamond, which means that we see vertical and horizontal frequencies better than diagonal ones.
In fact, most image compression techniques (including JPEG) take this into account, however, conventional ways of determining the noise in images (minimum mean squared error, peak signal to noise, root mean squares) don't factor in the human visual system.
Your best bet is to use something like the structural similarity method (SSIM) by Prof. Al Bovik of UT Austin and his student Prof. Zhou Wang (now at the University of Waterloo).
You can read all about SSIM and get example code here: http://www.ece.uwaterloo.ca/~z70wang/research/ssim/
Or read more about image quality assessment at Prof. Bovik's website: http://live.ece.utexas.edu/research/Quality/index.htm
If you don't care about how it works, and just want to use it, you can get example code for ssim in matlab at that website and C floating around the net. The method is easy to use; essentially the ssim function takes two images and returns a number between 0 and 1 that describes how similar the images are. Given two compressed images and the original image, take the SSIM between each and the original. The compressed image with the higher SSIM value is the "best".
It sounds like for your problem you might NOT have the original uncompressed image. In that case you might try checking for minimal entropy or maximum contrast in your images.
Essentially entropy would be calculated as:
h = histogram(Image);
p = h./(number of pixels in image);
entropy = -sum(p./log2(p));
You will need to make sure you scale the image appropriately and don't divide by zero! Or better yet, you should be able to find code for image entropy and contrast on the web. Just try searching for entropy.m for a matlab version.
Good luck!
That's only a reasonable indicator if the two copies of the same image you are comparing are also the same resolution. It's not hard to have a higher resolution image consume less disk space if the compression level has been bumped up. Also, different programs usually produce different JFIF streams even when set to the same compression level and using the same *uncompressed* source image, making the DCT size approach even less reliable.
Unfortunately, its not all that easy to compare. In general, the file with the higher byte count will be the better image, BUT ... The problem is there are different ways to compress the same picture. (There are several "controls", even in baseline JPEG. (Where the "quantisation" steps occur, where the high frequency cutoff for each macroblock occurs. Then there are different ways for the JPEG engine to entropy encode the bitstream. IE: Arithmetic coding is allowed by the JPEG standard, however, due to patent issues, most implementations use Huffman coding, which is slightly less efficient.) It should be remembered that the JPEG standard is just baseline Any implementer is free to improve upon the baseline coding, as long as it still decodes correctly. There used to be JPEG viewing software that decompressed and cleaned up images that looked terrible using "standard JPEG decoding software. (I am not sure, but I suspect the blockiness and quantisation errors were smoothed out, improving the displayed image immensely.)
Of course, what you really need is the NCIS image enhancement package.
This just about gets to the heart of it. "Better" is a subjective term, so choosing better quality images is not going to be something everyone can agree on. Your example nails it. If you have two copies of the same image, one is higher resolution than the other, but saved with a higher compression rate, which is better? The answer is going to be "it depends on if the noise introduced by the higher compression annoys me more than the reduced information in the lower resolution image."
If the compression on the high resolution image is high enough, you might still have better detail in the lower resolution image. If the higher resolution image isn't actually higher resolution, just higher dimensions (it's the smaller image scaled up), this is automatically a lower quality image (you can always recreate the higher resolution image from the lower resolution image, but not vice versa as rounding errors cause information loss whenever you scale an image).
There may also be subjective differences like brightness/contrast/tone mapping differences.
Given that the question being asked is a subjective one, the correlation of file size to subjective image quality should be so high that you may gain only a few percent better predictability with an extremely complex algorithm.
Slay a dragon... over lunch!
You probably don't necessarily want to find the "best quality" image, but rather the image that was closest to the original.
I take it you're either trying to eliminate the low-quality duplicates or thumbnails from a really large collection of pr0n, or trying to write an image search engine that tries to present the "best" rendition of a particular image first.
For the second pass, you'd likely want to scan through the metadata first, especially stuff exposed by EXIF. So you'd want to give higher scores to EXIF data that makes it sound like it came directly off a digital camera or scanner, and bump down the desirability of pictures that appeared to have been edited by any sort of photo editing software.
Then maybe you want to look at something that would rank down watermarks or other modifications.
Another step would be to compare compression quality, but I think that's what most of the other posts are concentrating on. But this is a difficult step because it can be easily fooled, since idiots can re-save a low quality image with the compression quality cranked all the way up so the file size becomes high even though the actual image quality is worse than the original. You probably need to run it through one of those "photoshop detectors" that could tell you whether the image has been through smoothing or other filters in a photo editor. The originals (especially in raw format and maybe high quality JPEG) will have a certain type of CCD noise signature that your software might be able to detect. In the same vein, a poorly-compressed JPEG will have lots of JPEG quantization artifacts that your software might be able to detect as well. Otherwise, you're kinda left with zooming in on pics and eyeballing it.
Finally you might be left with a group of images that are exactly the same but have different file names... you probably want some way to store some of the more useful bits of descriptive text as search/tag metadata, but then choose the most consistent file naming convention or slap on your own based on your own metadata.
Hopefully this gives you a start to important parts of the process that you might have overlooked...