Slashdot Mirror


GZipping Life Forms: Deflate Reveals Bare-Bones

An anonymous reader writes "To distinguish images derived from living vs. non-living sources, USC and NASA JPL researchers report today using the standard gzip compression utility. As a measure of overall pattern complexity, they find that the inherent pixel content of biologically generated fossils produces higher image compression ratios [more data redundancy], compared to their non-biological counterparts. The more the file shrinks, the more likely it is that a living process was involved. A test is live online here. This extends the simple, but powerful, uses of gzip to biogenic fossil detectors, in addition to spam cop filters, DNA sequence comparisons, digital camera image crunchers, etc. In nine months, the two Mars rovers will send back the first microscopic-scale images of Mars rocks, which should be amenable to some of these same techniques: thus gzipping is apparently pretty zippy."

55 of 243 comments (clear)

  1. Makes sense... by Anonymous Coward · · Score: 4, Insightful

    Lifeforms seem to be built on patterns afterall. Patterns are easily compressible.

    1. Re:Makes sense... by jolyonr · · Score: 5, Interesting

      Unfortunately it's not that simple, inorganic systems can have as much visual complexity as organic things. For example.. um.. (looks out of window here in Toronto).. a snowflake! Fractal complexity, such as that seen in the branches of a tree, is frequently mirrored in the inorganic world - the snowflake is one example, another less well known example are manganese dendrites, they look just like fossil plants, but are totally inorganic such as these [Victoria Museum]. The patterns of frost on a frozen windscreen are another example. I can't see how a computer program can distinguish whether such complex patterns are signs of life or not. Still, if it helps NASA get more funding, then who am I to argue! Jolyon

      --


      Please read my Canon EOS tech blog at http://www.everyothershot.com
    2. Re:Makes sense... by Ted_Green · · Score: 2, Insightful

      Of course, so do a lot of crystalized structures. Lots of things are built on patterns.

      Anyways as far as this technique is concered this (organic images being more compressable) only holds true for organicly created stromatolite structures vs. chemcialy created stromatolite-like structures.

      They've only done 20 images or so, I'd like to know the comparitive compression ratios.

  2. I compress.. by mr.+methane · · Score: 4, Funny

    ... therefore I am.

    I'm not sure I should be flattered that the best way to tell a picture of me from a picture of a rock is that I have more redundant image data. :-)

    1. Re:I compress.. by DShard · · Score: 5, Funny

      That actually should flatter you. You have less entropy so you are of a higher order than the rock. You can brag to all your non-rock friends that those stupid rocks have high entropy.

    2. Re:I compress.. by 4of12 · · Score: 2, Interesting

      Not only are you, but are uniquely Mr Methane, because each individual author has unique and identifying characteristics that can be measured using - guess what - compression algorithms.

      Given enough samples, individual authors can be identified and graphs of language relationships, too.

      I think it's interesting because it raises the bar on preserving anonymity if you publish widely.

      Add some entropy to your life; write drunk.

      --
      "Provided by the management for your protection."
  3. A-ha! by grub · · Score: 4, Funny


    So when we compress the ultimate, super-duper intelligent life form we get a two byte file containing "42"

    --
    Trolling is a art,
  4. I'd assume by Omkar · · Score: 2, Interesting

    that this has something to due with patterns and image continuity. If so (enlighten me!), then it would be a decent filtering tool, but reliability would be a major problem. Geological (or whatever) patterns could fool the algorithm. Finally, the most compressible image consists of monochrome - is it alive?

    (Mods: the last line was a joke, intended to point out a particularly simple example of a problem - not a troll)

  5. Excellent... by Anonymous Coward · · Score: 5, Funny

    No more sniffing when i'm checking items in the refrigerator - is it 'alive' ? gzip is the answer!

  6. uhhh.. huh? by SamBeckett · · Score: 2, Interesting

    Doesn't gzip only look for patterns in one dimension? Assuming they are using these for pictures, they are missing the boat on at least one more area of complexity!

  7. gzip gates by greenalbatros · · Score: 2, Funny

    then we will find out if he truly is the borg!

    --
    this sig steers like a cow. and i can prove it
  8. Be Humble by hugesmile · · Score: 4, Funny

    OK, so if I have this right: Life is less random, and more predictible (more compressable)than non-life.

    So that tells me that life contains less data then non-life.

    Perhaps sophisticated life (human life?) contains even less data than non-sophisticated life. So the smarter we get, the more predictable we get, and the less data we contain.

    Perhaps we will someday get smart enough to be totally compressed to one bit. In the time I thought about this concept, I think my gzip file got even more compressed. Hmm....

    1. Re:Be Humble by javatips · · Score: 5, Insightful

      > So that tells me that life contains less data then non-life.

      No, it means that life contain less noise than non-life.

    2. Re:Be Humble by p3d0 · · Score: 2

      No, you still have it wrong. Information is entropy. More information is more entropy. However, imagine the amount of information in a JPEG of your face, compared with a JPEG of bits from /dev/random. The latter will have more information and thus more entropy. That shouldn't give you an inferiority complex. :-)

      --
      Patrick Doyle
      I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
  9. I told you so! by twoslice · · Score: 2, Funny

    The Magic School Bus is true!

    --

    From excellent karma to terible karma with a single +5 funny post...
  10. I was wondering... by mingthemerciless · · Score: 2, Funny

    ... if it could find life forms in my doom wad's?

  11. bzip2? by maxwell+demon · · Score: 2, Interesting

    Has anyone checked if bzip2 is better or worse in detecting biological products?

    After all, they have quite different compression characteristics (on one hand, compression of a megabyte of zeroes is much better in bzip2, OTOH adding the same file on top of itself and then compressing gives much less additional compressed size with gzip than with bzip2 - tested with /usr/src/linux/kernel/sys.c, 24957 bytes uncompressed).

    --
    The Tao of math: The numbers you can count are not the real numbers.
  12. The fractal geometry of nature? by RNG · · Score: 4, Interesting

    Although I'm certainly no compression expert, I think this makes sense. Many (most?) natural systems have fractal structures on some level so it only makes sense for them to compress better (ie: have more self-similar features) than systems which don't have this feature.

    Then again, what do I know? Maybe something more immersed in this field can tell us whether there's a seed of truth to my ramblings ...

    Greetings
    --> R

    1. Re:The fractal geometry of nature? by jeff_bond · · Score: 2, Insightful
      Could you give some examples of fractal structures in a human?

      For starters, how about the branching structure of the airways in your lungs?

      Jeff

      --
      stty erase ^H
  13. Thought this would be somewhat obvious... by ignoramus · · Score: 2, Insightful

    Every one of us is incredibly redundant, and I don't just mean in our posts on slashdot!

    Simply consider that you can have a reasonably good duplicate of yourself, with only the DNA contained in a single cell!

    You may need most of your parts to be functional but, information-wise, it all comes down to 1 germ cell (say, a spermatozoid) and the aparatus needed to move it into proximity of another compatible germ cell ;)

    1. Re:Thought this would be somewhat obvious... by AugustMoon · · Score: 2, Interesting

      Your DNA is only sufficient to create another state machine with the same rules you had at birth.

      It will not re-create your complexity because our dna-state machines are designed to create brains which are 'genetically-memoryless', capable of self modification, and have incredible data collection and storage capacity.

      Think of your DNA as the graphics engine for Quake. It is relatively small (space-wise) compared to the textures and levels. Add different data, and you have still have a first-person game, but a completely different one.

  14. this might have a few glitches by jj_johny · · Score: 4, Funny
    When I compressed the transcript of the Osbornes, it got increadibily high compression but I don't think they are intelligent life forms. Or maybe I am really wrong.

    This post can't be compressed.

  15. The Mars fossil IS made by life; my wife is not. by Saint+Aardvark · · Score: 5, Funny
    In a true first for extraterrestrial biotic research, I decided to compare two pictures:

    at the comparison page attached to the article that lets you run the same test on images that the researchers tried. In a startling discovery that is sure to earn me a Nobel Prize for Physics, Chemistry, Biology and Marital Relations, I was told the following:

    "Answer: Image 1 [the Mars image](1.43702451394759 % compression) has a higher complexity measure than image 2[the image of my wife] (0.773501341151519 % compression), and thus image 1 is more probably biogenic."

    Not only does this prove that there was once life on Mars, but it also proves that my wife is some sort of robot. Further research will be undertaken pending receipt of my prize money.

  16. The.. by saqmaster · · Score: 2, Funny

    .. thought of being gzipped is quite disturbing.

    Mad Scientist: "Fire up the GZip Continueum Transfunctioner!"
    Operator: "Okay, Boss"

    *Bizzzttt*

    --
    "Never let the truth get in the way of a good story..."
  17. Information vs. Meaning by 16977 · · Score: 2, Interesting

    One of the posters brings up an interesting point. Although meaningful data has more information than pure noise, it also has less than a blank signal. When you download pictures, regardless of the "meaning" they have to you, their compression can vary a considerable amount. And you've probably heard the statistic that the english language is 50 percent redundant. That figure may vary a bit too, but the point is that english's meaning to us is independent of its information content. And the probability that an image of a life form with more information will also have more "meaning" is probably just as uncertain.

  18. Kolmogorov Complexity by MarkWatson · · Score: 4, Interesting
    This seems like a "sort of" restatement of Kolmogorov Complexity.

    Roughly, Kolmogorov Complexity is a measure of randomness - the measure is how long a computer program needs to be to reproduce data (pardon an oversimplification).

    -Mark

  19. Re:The Mars fossil IS made by life; my wife is not by Anonymous Coward · · Score: 5, Funny


    The problem here is that your wife is wearing clothes. Clothes are man made.

    If you send me a picture of your unclothed wife, I'll be happy to, uhm, test this theory.

  20. Filtering Images by CommieBozo · · Score: 2
    While slightly different, this reminds me of the way I filtered a bunch of images from a video camera. I was taking many frames per second of a thunderstorm and I wanted to find which frames out of thousands contained lightning strikes.

    It was pretty simple... Images over a certain size contained lightning, the others were mostly black, therefore smaller. Once I filtered it that way, manually filtering out the better images was easy.

  21. Re:The same image... by maxwell+demon · · Score: 2, Informative
    Hmmm.... really as "image1" and "image2", and not as "img1" and "image_2_with_an_incredibly_long_file_name"?

    BTW, if you want to be file name independent, you can use
    cat file | gzip -c9 | wc -c
    This way, gzip doesn't see the file name, and therefore doesn't include it into the .gz file.
    --
    The Tao of math: The numbers you can count are not the real numbers.
  22. Operating Principle? Kolmogorov Complexity by fygment · · Score: 3, Informative

    Read about it in _the_ book (http://www.cwi.nl/~paulv/kolmogorov.html) or check out the web site here (http://www.hutter1.de/kolmo.htm). For a more succint idea of the approach, these articles by one of the gurus on the topic (http://www.cs.ucsb.edu/~mli/focs.ps and http://www.cwi.nl/~paulv/papers/ecml97.ps).

    --
    "Consensus" in science is _always_ a political construct.
  23. Re:The Mars fossil IS made by life; my wife is not by (startx) · · Score: 3, Interesting

    ahh, but the picture of your wife contains a lot of inanimate objects. I'm sure if you cropped the picture down to just her (or reasonably close) she would fare better in this comparison.

  24. I am not by Karpe · · Score: 4, Funny

    I compress to binary 0, therefore I am not.. :(

  25. Biological clocks in unicorns... by dpbsmith · · Score: 4, Interesting

    zip is a fine thing, but it's not a pattern-recognition program!

    This is the loopiest thing I've heard of since Rosenblatt reported that his Perceptrons could distinguish between music composed by Bach and music composed in imitation of Bach.

    Good heavens, any picture that's slightly out of focus will now be declared to be evidence of "biological processes."

    I'm guessing that the researchers are not as nutty as they sound and that they've done more than is being reported, but still...

    Reminds me of the researchers in the sixties who were publishing analyses of data that supposedly showed "biological clocks." It turned out that they were using smoothing algorithms that, basically, were filters that had a 24-hour peak in the frequency domain--so their analysis was creating the patterns they claimed to be detecting. A debunking article was published in Science in which another research used data from a random number table (the "unicorn" data) and showed that the same analysis techniques showed that the unicorn had a biological clock.

    1. Re:Biological clocks in unicorns... by archeopterix · · Score: 2, Insightful
      Similar thoughts here. From the article:
      So how does one separate the wheat from the chaff, the true stromatolites from the fakes?
      One method is to examine the suspect rock with a microscope, looking for visual evidence of microorganisms. But as researchers who study ancient terrestrial rocks- and one notorious Martian meteorite - have discovered, it isn't all that easy to tell, just by looking at shapes, whether or not a microscopic blob in a rock was once alive.
      So, what do they verify the gzip method against? Their guesses about the image origins? Does not look great from the methodology standpoint, eh?
  26. gzip - the swiss army knife utility by kinnell · · Score: 5, Funny

    I myself have successfully used gzip for factoring large prime numbers, sorting the men from the boys, unblocking the kitchen sink and cracking safes. I'm currently trying to locate Osama Bin Laden by compressing Al Jazeera footage, but all I come up with are reports of Elvis sightings.

    --
    If I seem short sighted, it is because I stand on the shoulders of midgets
  27. Slightly Dodgy by jolyonr · · Score: 5, Interesting

    This whole thing is slightly dodgy, and I begin to wonder whether it was released a day early by mistake.

    The big problem is the use of JPEG source images. Unless you've stuck it up to the maximum size on quality, then the jpeg artifacting (which is in effect repeating blocks of image data after transitions) will probably mask any hidden level of complexity in the images - the human brain is a much better tool at pattern recognition than most computer algorithms (especially those algorithms not designed for the task!).

    Throw high-resolution bitmap files at it, and I'd be more persuaded that there is a genuine effect. Until then, I suspect it's more of a happy coincidence that the files they've thrown at it give results they are excited about.

    Jolyon

    --


    Please read my Canon EOS tech blog at http://www.everyothershot.com
    1. Re:Slightly Dodgy by kris_lang · · Score: 2, Interesting
      I've seen similar errors made by vision science (note that I did not say "image processing") researchers trying to analyze natural scene statistics and come up with interesting patterns. They created "basis functions" and did principal component analysis on sets of images and came up with a basis set that looks curiously like the base images of the DCT (discrete cosine transform), the underlying calculations of the JPEG image format. This is to be expected when you start with a set of images that are JPEG compressed.

      This was actually published in a (barely) peer-reviewed journal, Vision Research. I didn't say "image processing" above because a lot of these vision scientists seem to be psycologists doing visual psychophysics without having a strong background in math, or optics, or (it seems at time) the fundamentals of science.

      The other thing to take into consideration is that gzip is "pseudolinear". It does not take into account the 2-dimensional correlations that exist in image data. Even fax compression takes advantage of it. (and yes, I do realize that gzip can account for runs from previous regions regardless of length or location, but I am trying to point out that there is a specific 2-dimensional set of correlations extant in 2-d image data).

      In these cases being cited that use GZIP, the major function of GZIP seems to be as an indicator of the presence or absence of high-frequency components in the signal stream. Lots of irregular high frequency -> Low compressibility, very little irregular high frequency --> High compressibility factors.

  28. Re:why no bzip2 ? by bill_mcgonigle · · Score: 5, Interesting

    doesnt bzip2 outperforms gzip ?

    gzip might be preferable because it works more locally. It only keeps track of the last n bytes of data and does substitutions based on patterns seen in those n bytes.

    bzip2 uses a markov predictor and the chain length is typically much longer than gzip uses, so the compression is less local. That's great if you're going for compression but for this work, it might be misleading.

    That said, gzip doesn't know about image formats, so I wonder if these guys are getting some false positives on scanline wraps and other non-image data.

    --
    My God, it's Full of Source!
    OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
  29. Re:The Mars fossil IS made by life; my wife is not by gatesh8r · · Score: 2, Funny

    Wait wait wait you have a wife? Dude, this is Slashdot; are you sure you're not a misdirected user???

    --
    Karma whorin' since 1999
  30. 42 by snarkh · · Score: 2, Insightful


    42 is one byte.

  31. gzip == measure of information content by firecode · · Score: 2, Informative

    This is not surprising at all really. Gzip and other compression utilities can be used to get upper bound for real/nonredundant information content.



    <p>I'm not sure if above is public knowledge, but I have used it as a one additional feature for certain pattern recognition tasks for a while.</p>
  32. Compression to measure semantic content by KingRamsis · · Score: 3, Interesting

    It was an interesting coffee break discussion with one of my professors, we were arguing if there is neat way to estimate the semantic content of a neural network after training it, I recall suggesting to compress the value of the weights of all layers and the less compressible the more this neural network is trained.

  33. Re:and language detection. by spot35 · · Score: 3, Informative

    Could this be what you're after?

  34. Pattern Recognition by cyber_rigger · · Score: 2, Interesting


    I envision a whole array of compression algorithms.

    Each algorithm could be fine tuned for a paticular type of pattern.

    Is that an elephant or a giraffe?
    Does it compress better with the elephant algorithm or the giraffe algorithm?

  35. Not particularly ironic by delphi125 · · Score: 2, Funny

    It is a well-known fact that any compression algorithm will cause some files to increase in size when 'packed'. If this were not the case, then '42' would be the compressed version of some other file, say 'Wdugiu*6x9', which in turn would be the compressed version of DNA's DNA, which in turn might be the compressed version of the answer to life, the universe, and everything. Furthermore, everybodies DNA would compress down to the same file '42' (since we all contain the answer within ourselves, presumable mice would compress down to something else), which would mean we were all clones, which means that I am the Pope and you are CowboyNeal (and vice versa). QED.

  36. Seperate the chaff by Anonymous+Struct · · Score: 2, Interesting

    I doubt this is very accurate for marking photos as hits or misses directly. This kind of thing may be useful more for detecting the lack of life rather than the presence of it. If compression rates are low, maybe you don't have to look at this photo so much. If they're high, maybe you want to examine it more closely. If you're dealing with truck loads of data and you're looking for a needle in a haystack, a mechanism for ruling out uninteresting data is invaluable.

    That having been said, it sounds good in theory that 'organisms are highly patterned and therefore compress better', but then why would you use gzip? Why not take that theory and build something a little more adept at locating particular types of patterns you're interested in, or ruling out the ones you know are going to create false positives?

    So, THAT having been said, I'm forced to wonder if somebody forgot that March has 31 days. Lord knows I can never keep track.

  37. hidden markov models by nounderscores · · Score: 2, Interesting

    Interesting. For genome analysis Hidden Markov Models have been used in a lot of software.

    Maybe if you could have an image recognition system do the Hard Machine Vision probelm of generating a schematic of the picture, and then fed the "leg bone is connected to the hip bone" kinda data into a HMM you could work out which fossils are ancient Cambrian crustations and which ones are Trogdor the Burninator.

  38. viruses? by Mentally_Overclocked · · Score: 2, Interesting

    I wonder if viruses (sorry - didn't RTFA) would compress like living life forms or if they would be more similar to nonliving.

    Just a thought.

    --

    Mathematician, n.:
    Someone who believes imaginary things appear right before your i's.
  39. Pretty sloppy, you mean... by TheSHAD0W · · Score: 2, Insightful

    There are other techniques for measuring the level of chaos in a set of data, and they'd probably yield more consistent results than running the data through an algorithm meant for an entirely different purpose.

  40. Bzip2? Bah , new fangled rubbish! by Viol8 · · Score: 3, Funny

    What about compress? Or even good old "compact". Ah I remember the days when we had 20% compression
    and were glad of it and some of the old timers could have been confused with non living processes
    even without the help of gzip anyway!

  41. this can also detect PHB's by IDigUNIX · · Score: 4, Funny
    As alternative to this hypothesis consider:
    feed a business technology proposal through gzip
    • A very high compression ratio indicates that the proposal was likely to be written by consultants. As supported by the fact that they usually re-use the same buzz phrases over and over.
    • A moderate compression ratio indicates that the proposal was written by engineers. Typically they use large words, and unique phrases that are already compressed. I.E. SNMP, J2EE, WWW, and so on.
    • A zero to negative compression ratio indicates that the proposal was likely to be written by a PHB, and hence void of all indications of intelligent life. As evidenced by most PHB's having a hard time using buzz phrases and keywords in context, so they won't recycle enough words to form a good compression dictionary.
  42. Re:Cool by tijnbraun · · Score: 3, Interesting

    A similiar technique has been used by italian mathematicians to differentiate pages from various authors by using zip. A nature article can be found here. After a request from a dutch newspaper they were able to identify one author (Marek van der Jagt, which made his first debut) to be the same as an already well-known author (Arnon Grunberg).

  43. Did something like this years ago by rasper99 · · Score: 2, Informative
    I used a technique like this to do a web cam way back in 1997 before web cams were an easy thing to do. I was supporting Silicon Graphics workstations at the time. One of the models came with a digital camera. The cameras did not have automatic exposure.

    Using CGI as the user hit the web page it took pictures at different shutter speeds. Working up from the slowest shutter speed the first JPG over 20K bytes was the right exposure and was shown on the page.

  44. Gzip doesn't preserve well... by Anonvmous+Coward · · Score: 2, Funny

    "This extends the simple, but powerful, uses of gzip to biogenic fossil detectors..."

    The problem with gzip is that doesn't preserve data very well. Now tar, it preserves fossil data quite well.

  45. Windows XP is alive! by laard · · Score: 2, Funny

    Did a test run with some default images in windows xp. Windows XP's "Purple Flower.jpg" is apparently more "alive" than Windows Xp's "Tulips.jpg" but "Windows XP.jpg" is more alive than both of them!

    --
    --- If we knew half the things we shouldn't we'd stop wishing we knew it all