Slashdot Mirror


Dumb Things With Bioinformatics

PrvtBurrito writes: "About 3% of the human genome is "coded" as genes. The proteins those genes encode can be represented as long sequences of amino acids, a twenty letter alphabet. In an attempt to perhaps prove that nothing is sacred, someone has cataloged all of the english words found in known annotated protein sequences from many organisms. It looks like after cataloging over 37,000,000 characters, the longest word is chapstick and the most common word is kilter."

2 of 30 comments (clear)

  1. Amino Acids by oregon · · Score: 4, Informative

    The 20 letters are

    a Alanine
    r Arginine
    n Asparagine
    d Aspartic acid
    c Cysteine
    q Glutamine
    e Glutamic acid
    g Glycine
    h Histidine
    i Isoleucine
    l Leucine
    k Lysine
    m Methionine
    f Phenylalanine
    p Proline
    s Serine
    t Threonine
    w Tryptophan
    y Tyrosine
    v Valine

    --

    ---
    Oregon
  2. Do it yourself by meiocyte · · Score: 4, Informative

    Here's a link to check whatever protein sequence you want against the human genome. Make sure to select "blastp" (for protein sequences) in the pulldown menu. Use the alphabet provided above.. it will find near matches too. Enjoy..

    --
    The thing in the box has no place in the language-game at all; not even as a something; for the box might even be empty.