Slashdot Mirror


Naming All Lifeforms On Earth With Hash Functions

First time accepted submitter ssasa writes "A Virginia Tech researcher is proposing a new naming system for all life on earth [based on each organism's] genetic fingerprint — basically something like a hash function of an organism. Hash functions are in common use in software development. Hopefully it will pass some time before we see a hash collision between a cat and some dinosaur."

24 of 97 comments (clear)

  1. The actual journal article by Anonymous Coward · · Score: 4, Informative

    For those that want to read the actual journal article
    http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0089142

    The word hash is never mentioned either :)

    1. Re:The actual journal article by Fwipp · · Score: 2

      An important limitation of his approach is that it only works for "all organisms whose genomes can be aligned to each other." (With no mention of how "good" the alignment has to be, nor the fact that alignment is not objective.)

      So, you'd have multiple schemas for each "group" of organisms. I think his idea is possibly applicable to, say, describing multiple samples within a species. It's clearly ill-suited for a universal naming strategy like the article proposes, though.

    2. Re:The actual journal article by davester666 · · Score: 2

      I'd totally hash that!

      --
      Sleep your way to a whiter smile...date a dentist!
    3. Re:The actual journal article by Anonymous Coward · · Score: 2, Insightful

      So are every two people who aren't twins going to have a completely different hash function?

      Perhaps a better scheme would be to assign a function that describes the genetic similarity between two organisms. Well, we kinda already have that. We can use percentage. and if all organisms are 90 percent similar and only vary by ten percent, for instance, we can narrow our function to those ten percent. Create a new scale from 1 to 100 where the genetically most similar organisms would be grouped next to each other (a 1 would be genetically very similar to a 2 varying by a percentage of a percentage or whatever it scales to) and the least similar organisms would be grouped far apart (a 100 would be genetically least similar to a 1 varying by ten percent). Wait ... we kinda already do that.

      So what are the advantages of this guy's ideas.

    4. Re:The actual journal article by GoodNewsJimDotCom · · Score: 2

      (With no mention of how "good" the alignment has to be...

      Chaotic good suffices.

    5. Re:The actual journal article by Frosty+Piss · · Score: 3, Informative

      An important limitation of his approach is that it only works for...

      Those who pay the licence (since it's being patented)?

      --
      If you want news from today, you have to come back tomorrow.
    6. Re:The actual journal article by Anonymous Coward · · Score: 2, Insightful

      and the idea is nothing new except he is adding more digits and making it more confusing for us by removing the intuitive base ten that has been the scientific standard since the metric system and replacing it with something worse (kinda like how the 'standard' system is worse than the metric system).

    7. Re:The actual journal article by Anonymous Coward · · Score: 2, Informative

      (same author)

      I just read the article (I know, I should have read it before posting anything earlier instead of relying on the often misleading Slashdot summary) and while I don't really understand what he's doing it does seem to make more sense than what the Slashdot summary shows.

      and, no, my original idea does not seem to be what the article proposes (my original idea is an obvious improvement over what the misleading Slashdot summary proposes but next time I should read the article before posting).

      I agree that the current way organisms are named and classified can be inconsistent and confusing. A lot of time the usefulness is based on whether a particular bacteria produce a specific enzyme (enzymatic tests can be done to determine this and then substances that inhibit an enzyme can be used to stifle the bacteria). but what I found interesting about the article is that

      "

      With the naming scheme developed by Vinatzer, the name of every single anthrax strain would contain the information of how similar it is to other strains. Using Vinatzer's genome sequence, the Ames strain used in the bioterrorist attack would, for example, be known as lvlw0x and the ancestor of this strain stored at the U.S. Army Medical Research Institute for Infectious Diseases would be known as lvlwlx.

      Vinatzer's naming convention would also give researchers the ability to name new pathogens in a matter of days—not months or years—based on their similarities to known pathogens.

      The proposed naming process begins by sampling and sequencing an organism's DNA. The sequence is then used to generate a code unique to that individual organism based on its similarity to all previously sequenced organisms."

      The article is kinda vague and I would like to see more detail on exactly how it works but it does seem like it has potential.

  2. Names are for communication by Chemisor · · Score: 4, Funny

    I think I'll go hunt some af7caaf1e73a2d24924371a370b4ef9b so I can feed my 362842c5bb3847ec3fbdecb7a84a8692 and a nice quiet evening with my 34b46c8cf192431e84ea81109660367b, chatting about the difficulty of talking about a474fb23f886eeaa16223eba872e53b1 that some socially inept scientist decided to name with a hash function.

    1. Re:Names are for communication by physicsphairy · · Score: 2

      Names are indeed for communication, but 'name' here is mostly bad terminology, or at least The Fine Article leads me to believe these are meant more as serial numbers to supplement the existing system of nomenclature than anything else.

      Which is actually somewhat useful. Any research project starts by looking at what other research has already been done. It's no good if your search terms don't bring up the relevant papers. I suppose this might be somewhat like the nomenclature system for chemistry, in which the IUPAC standard for naming molecules has replaced common names. Frequently used chemicals still are referred to by common names, but mostly even if a molecule you encounter has a common name you're not likely to know it off the top of your head. It's pretty hand to be able to figure out the standard name by its structure, so you can then search for it or look up its properties in the CRC.

    2. Re:Names are for communication by Darinbob · · Score: 3, Funny

      It all tastes like 518bf09f107329cef14fd9c9dbddab3c anyway.

    3. Re:Names are for communication by subreality · · Score: 2

      ["dog\n", "deer\n", "wife\n", "animals\n"] ... People would find these names easier to understand if you used "echo -n".

    4. Re:Names are for communication by Hognoxious · · Score: 2

      I don't think they're like serial numbers. A serial number should be just that - assigned serially to whatever is produced/discovered next. These will have meaning embedded in them.

      I generally don't like identifiers with meaning. Yeah, let's give the females odd employee numbers and the males even ones. And while we're at it, make the 3rd digit indicate their grade and the 5th their education level...

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  3. No success by Sven-Erik · · Score: 4, Informative

    Not so sure this will take off since they have applied for a patent and wants users to pay a license fee to use it.

    --
    - "Every demand is a prison, and wisdom is only free when it asks nothing." Sir Betrand Russell
    1. Re:No success by Frosty+Piss · · Score: 2

      For those that did not RTFA:

      Virginia Tech is submitting a patent describing the naming scheme. Vinatzer and his collaborator Lenwood Heath, a professor in the Department of Computer Science in the College of Engineering founded This Genomic Life Inc., which will license the invention to develop it further.

      --
      If you want news from today, you have to come back tomorrow.
  4. Biology and Computer Science Two Way Street by utkonos · · Score: 5, Insightful

    Last month, at ShmooCon a talk was given about spatial analysis of malware samples. The technique is borrowed directly from bioinformatics. This is a great example of techniques from Biology being used effectively in the IT security realm.

    I hope that the researcher involved in naming organisms based on hash algorithms chooses context triggered piecewise hashes (CTPH) AKA fuzzy hashing or a similarity hash algorithm rather than an algorithm like SHA512. Google's simhash or at least the ideas of this type of algorithm would lend itself much better to the naming of organisms.

    FYI: a FOSS implementation of fussy hashing is called ssdeep. The project site is here. This is an implementation that is widely used in open source malware analysis tools like Cuckoo Sandbox.

    1. Re:Biology and Computer Science Two Way Street by dirt · · Score: 2

      Thanks for those links. Comments like yours are why I continue to read /.

      --

      ---
      You are not what you own -- Fugazi, "Merchandise"
  5. Individials of the same species have by Anonymous Coward · · Score: 3, Insightful

    differing genetic code.

  6. The most obvious problem with this approach by Anonymous Coward · · Score: 4, Informative

    This kind of thinking has a tremendous problem with it. Presently, organisms take the name of a previously described species if and only if it is a member of the same species as a particular type specimen from which the species is described. This holotype serves as the reference specimen for each species. This system has worked extraordinarily well for more than 200 years and has promoted nomenclatural stability.

    The biggest problem with attempting to identify species on the basis of their genetic "fingerprint" or bar code is that unless you have some other means to establish that the specimen from which the genetic material is in fact from the same species as the holotype, then the genetic fingerprint will simply misidentify the specimen. This is a major problem for much of the genetic data in GENBANK, for which, more often than not, there is no longer a means of associating the source of the genetic material with a specimen, whose identity can be established independently). because the original specimens are seldom vouchered or saved. Consequently, the actually identity of the species that has been sequenced, remains uncertain, even if alignments of the code are "perfect". As for the patent, the rules of Zoological Nomenclature forbid the commercialization of names used in science. These guys can make up their own naming scheme, but scientists, who must rely on having their work, at least in principle repeatable and refutable, will be unable to use it for the purposes of science.

    1. Re:The most obvious problem with this approach by utkonos · · Score: 2

      I think you''re mostly correct, except for the case of organisms with horizontal gene transfer such as bacteria and archaea. The current naming convention breaks down when it is applied to this type of organism.

  7. Not sure how similar this is to hashing by fozzy1015 · · Score: 3, Informative

    I first thought the genetic sequence of an organism would be the input to a hash function, but reading further that doesn't seem to be the case.

    "Using Vinatzer's genome sequence, the Ames strain used in the bioterrorist attack would, for example, be known as lvlw0x and the ancestor of this strain stored at the U.S. Army Medical Research Institute for Infectious Diseases would be known as lvlwlx."

    The output name would still show ancestry using identical values, when one of the key properties of a hash function is that small changes in the input result in a completely changed output.

    1. Re:Not sure how similar this is to hashing by martin-boundary · · Score: 2
      No, cryptographic hash functions have certain strong guarantees, but all(*) hash functions are supposed to mimic independent, uniformly random, behaviour of inputs. Since in the physical world, inputs often come from processes, and processes tend to evolve continuously, the inputs to be hashed by a computer system often have some amount of similarity if they occur close together in time. Thus to transform consecutive inputs into a pair of independent uniformly random hashes, it is desirable that small changes in the input result in completely changed output.

      (*) There are exceptions, such as when devising algorithms for locality sensitive hashing, but they are few.

  8. Re:Doesn't want a hash by VortexCortex · · Score: 2

    You don't want a hash function for this, where the hash is effectively random. You need a function that derives a unique value for each input, but retains the relative distance of the original value. i.e. two values that are very similar yield an index that is similarly close.

    Certain hash tables for search functions are built around hashes exhibiting the very type of behavior you describe -- Not to mention current 'reverse' image search technologies. A "hash" function is not required to have a seemingly random output. Cryptographic hashes try to produce high entropy deterministic output, but other types of hashing can and do have different goals, namely with far less entropic outputs.

    I would ask you to turn in your geek card, but the standard for issuance is far lower nowadays...

  9. Hippo's tasty, but I couldn't eat a whole one by Hognoxious · · Score: 2

    shooting things, having the servants stuff them [...] there just weren't that many species on the table.

    Ah. When you said "stuff them" I thought you were referring to taxidermy, and not the herbs and breadcrumbs kind.

    --
    Confucius say, "Find worm in apple - bad. Find half a worm - worse."