Slashdot Mirror


Tracing the Limits of Computation

An anonymous reader writes: For more than 40 years, researchers had been trying to find a better way to compare two arbitrary strings of characters, such as the long strings of chemical letters within DNA molecules. The most widely used algorithm is slow and not all that clever: It proceeds step-by-step down the two lists, comparing values at each step. If a better method to calculate this "edit distance" could be found, researchers would be able to quickly compare full genomes or large data sets, and computer scientists would have a powerful new tool with which they could attempt to solve additional problems in the field.

Yet in a paper presented at the ACM Symposium on Theory of Computing, two researchers from the Massachusetts Institute of Technology put forth a mathematical proof that the current best algorithm was "optimal" — in other words, that finding a more efficient way to compute edit distance was mathematically impossible. But researchers aren't quite ready to record the time of death. One significant loophole remains. The impossibility result is only true if another, famously unproven statement called the strong exponential time hypothesis (SETH) is also true.

7 of 82 comments (clear)

  1. The algorithm isn't clever, but scales well. by Ihlosi · · Score: 3, Insightful
    The algorithm may not be very clever, but it scales almost perfectly with the number of cores (as long as you don't have more cores than characters in the string).

    Comparing strings is a case for specialized hardware, not for more sophisticated algorithms.

    1. Re:The algorithm isn't clever, but scales well. by Required+Snark · · Score: 3, Informative
      The formal description of the edit distance is different from that described in the article. This is the one used in some cases of DNA or text comparison.

      Insertion of a single symbol. If a = uv, then inserting the symbol x produces uxv. This can also be denoted e->x, using e to denote the empty string.

      Deletion of a single symbol changes uxv to uv (x->e).

      Substitution of a single symbol x for a symbol y x changes uxv to uyv (x->y).

      It's not just about changing a symbol, it also includes adding and deleting symbols. That makes simple parallelization impossible.

      It's not clear if the mathematical proof cited in the article is the simple one of only changing symbols, or the more complex case of symbol addition and deletion. From the article, it appears that they are talking about the simple case, which makes the more complex algorithm even more intractable.

      --
      Why is Snark Required?
    2. Re:The algorithm isn't clever, but scales well. by Sique · · Score: 2

      Especially for DNA comparisation, this shortcut won't work. Moreso, it is antithetic to the task at hand. We want to know the edit distance, or in DNA speak, we want to know how many mutation we need to get from DNA1 to DNA2, for instance to determine relation. The shortcut that we just look if all letters are present and begin and end are ok, and don't care about their sequence in the middle of the word doesn't give us the edit distance, as it hides exactly the answer to the question we asked to begin with. It's ok for determining which words are there, and it speeds up reading, but only if all sequences of letters are actually words we know perfectly well. It would slow down the reading to a crawl if our task was for instance to read a text in a foreign language and translate it with the help of a dictionary. If we then stumble upon scrambled words, it will take ages until we figure out the right sequence of letters to look the word up in the dictionary.

      --
      .sig: Sique *sigh*
  2. This reminds me... by Dutch+Gun · · Score: 5, Interesting

    This article reminded me of learning about string-matching algorithms as a student. When you first naively implement a string-matching algorithm as a student yourself (sans prior research), you'd generally start with what I'd call the string-matching equivalent of a bubble sort - it works, but it's hopelessly inefficient, as you begin a simple search with each new character in the source text, backtracking if the match doesn't occur, and checking the next potential match.

    When you try to get a bit more clever, you realize there's probably a way to ensure that you can get examine each character in the searched text once and only once, never having to backtrack at all, and you get something like the Knuth–Morris–Pratt algorithm.

    However, it takes a fairly impressive leap of intuition / invention to get to the Boyer–Moore string search algorithm, which actually allows you to skip over checking some characters entirely, which I would have intuitively thought impossible without making that mental leap. Learning about these impressive algorithms was one of my favorite part of the Computer Science curriculum when I was in school.

    It will be interesting to see if someone eventually breaks through the current state-of-the-art limits of string comparison in our lifetime. It would be a bit sad if the hypothetical maximum efficiency were already reached, as was predicted by these mathematical models, because the current best algorithms still require a lot of computation time. I've long since devoted myself to more practical topics as a working programmer, but it's fun to peek into what's happening at the bleeding edge of pure computer science research.

    --
    Irony: Agile development has too much intertia to be abandoned now.
  3. Inaccurate Summary by Anonymous Coward · · Score: 3, Interesting

    First, having worked on software for DNA sequence alignment years ago, I don't think anyone thought they could do better than Smith-Waterman (a minor DNA-centric extension of a DP algorithm to find Levenshtein distance). Rather, everyone was using clever implementation techniques (SIMD instructions, GPUs, FPGAs) and heuristics to limit the amount of brute-force aligning. Oh, and lots and lots of hardware.

    Second, the summary author thinks that a dynamic programming algorithm isn't clever? They're either quite arrogant or don't know what they're talking about. Quicksort must be downright stupid, eh? Something for the short bus kids to play with.

  4. SETH is a pretty big assumption by JoshuaZ · · Score: 2

    SETH is a very big assumption, much stronger than even assuming P != NP. Essentially, the exponential time hypothesis (ETH) says that any algorithm which solves instances of 3-SAT https://en.wikipedia.org/wiki/3-SAT will have worst case running time that is exponential. However, it is conceivable that one could have a whole series of algorithms which each solves 3-SAT better and better. The idea is that there's an algorithm a_i which solves 3-SAT in time O(x_i^n) and as i goes to infinity, x_i goes to 1. SETH is the hypothesis that says essentially that that doesn't happen. Many people are not convinced that ETH is true although it certainly looks plausible. I think most people who have thought about this consider the possibility that ETH holds and SETH doesn't hold to be a deeply weird and generally unlikely option.

  5. Here is an idea to play with: Gödel notation by vikingpower · · Score: 2

    Strings T ( the "text" ) and P ( the "pattern" ) both use an alphabet A of N symbols.
    Index each of A's symbols with a distinct odd prime number p1...pN. [ only odd primes as I personally don't like 2, the "oddest prime of them all" ;-)  ]
    String T has length M.
    Raise each prime index in T to the power of its location <= M in T, multiply them all, yielding a number sigT ( for signature of T ). [ begin counting locations with 1, not with 0, in which case you'll lose the information about T's first symbol... ]

    Do the same thing for P, yielding a number sigP.

    if sigT mod sigP yields zero, T contains P.

    Example: T = "ABRE". A -> 3, B-> 5, R->67 (the 18th odd prime), E->13
    so sigT = 3^1 * 5^2 * 67^3 * 13^4 = 644256903225

    P = "BR". B->5, R->67
    so sigP = 5^1 * 67^2 = 22445

    644256903225 mod 22445 = 0, so T contains P.

    Computational complexity: depends on complexity of the 2 multiplication steps. sigT will by definition be the largest number ( humongous, actually, for genome sequences ). So you'll need a very efficient multiplication algorithm; the best one I know is Sch&#246;nhage-Strassen with complexity O(log n log( log n ) ).

    Just slurps enormous amounts of memory :-)

    --
    Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace