Tracing the Limits of Computation
An anonymous reader writes: For more than 40 years, researchers had been trying to find a better way to compare two arbitrary strings of characters, such as the long strings of chemical letters within DNA molecules. The most widely used algorithm is slow and not all that clever: It proceeds step-by-step down the two lists, comparing values at each step. If a better method to calculate this "edit distance" could be found, researchers would be able to quickly compare full genomes or large data sets, and computer scientists would have a powerful new tool with which they could attempt to solve additional problems in the field.
Yet in a paper presented at the ACM Symposium on Theory of Computing, two researchers from the Massachusetts Institute of Technology put forth a mathematical proof that the current best algorithm was "optimal" — in other words, that finding a more efficient way to compute edit distance was mathematically impossible. But researchers aren't quite ready to record the time of death. One significant loophole remains. The impossibility result is only true if another, famously unproven statement called the strong exponential time hypothesis (SETH) is also true.
Yet in a paper presented at the ACM Symposium on Theory of Computing, two researchers from the Massachusetts Institute of Technology put forth a mathematical proof that the current best algorithm was "optimal" — in other words, that finding a more efficient way to compute edit distance was mathematically impossible. But researchers aren't quite ready to record the time of death. One significant loophole remains. The impossibility result is only true if another, famously unproven statement called the strong exponential time hypothesis (SETH) is also true.
Comparing strings is a case for specialized hardware, not for more sophisticated algorithms.
This article reminded me of learning about string-matching algorithms as a student. When you first naively implement a string-matching algorithm as a student yourself (sans prior research), you'd generally start with what I'd call the string-matching equivalent of a bubble sort - it works, but it's hopelessly inefficient, as you begin a simple search with each new character in the source text, backtracking if the match doesn't occur, and checking the next potential match.
When you try to get a bit more clever, you realize there's probably a way to ensure that you can get examine each character in the searched text once and only once, never having to backtrack at all, and you get something like the Knuth–Morris–Pratt algorithm.
However, it takes a fairly impressive leap of intuition / invention to get to the Boyer–Moore string search algorithm, which actually allows you to skip over checking some characters entirely, which I would have intuitively thought impossible without making that mental leap. Learning about these impressive algorithms was one of my favorite part of the Computer Science curriculum when I was in school.
It will be interesting to see if someone eventually breaks through the current state-of-the-art limits of string comparison in our lifetime. It would be a bit sad if the hypothetical maximum efficiency were already reached, as was predicted by these mathematical models, because the current best algorithms still require a lot of computation time. I've long since devoted myself to more practical topics as a working programmer, but it's fun to peek into what's happening at the bleeding edge of pure computer science research.
Irony: Agile development has too much intertia to be abandoned now.
First, having worked on software for DNA sequence alignment years ago, I don't think anyone thought they could do better than Smith-Waterman (a minor DNA-centric extension of a DP algorithm to find Levenshtein distance). Rather, everyone was using clever implementation techniques (SIMD instructions, GPUs, FPGAs) and heuristics to limit the amount of brute-force aligning. Oh, and lots and lots of hardware.
Second, the summary author thinks that a dynamic programming algorithm isn't clever? They're either quite arrogant or don't know what they're talking about. Quicksort must be downright stupid, eh? Something for the short bus kids to play with.
SETH is a very big assumption, much stronger than even assuming P != NP. Essentially, the exponential time hypothesis (ETH) says that any algorithm which solves instances of 3-SAT https://en.wikipedia.org/wiki/3-SAT will have worst case running time that is exponential. However, it is conceivable that one could have a whole series of algorithms which each solves 3-SAT better and better. The idea is that there's an algorithm a_i which solves 3-SAT in time O(x_i^n) and as i goes to infinity, x_i goes to 1. SETH is the hypothesis that says essentially that that doesn't happen. Many people are not convinced that ETH is true although it certainly looks plausible. I think most people who have thought about this consider the possibility that ETH holds and SETH doesn't hold to be a deeply weird and generally unlikely option.
Strings T ( the "text" ) and P ( the "pattern" ) both use an alphabet A of N symbols. ;-) ]
:-)
Index each of A's symbols with a distinct odd prime number p1...pN. [ only odd primes as I personally don't like 2, the "oddest prime of them all"
String T has length M.
Raise each prime index in T to the power of its location <= M in T, multiply them all, yielding a number sigT ( for signature of T ). [ begin counting locations with 1, not with 0, in which case you'll lose the information about T's first symbol... ]
Do the same thing for P, yielding a number sigP.
if sigT mod sigP yields zero, T contains P.
Example: T = "ABRE". A -> 3, B-> 5, R->67 (the 18th odd prime), E->13
so sigT = 3^1 * 5^2 * 67^3 * 13^4 = 644256903225
P = "BR". B->5, R->67
so sigP = 5^1 * 67^2 = 22445
644256903225 mod 22445 = 0, so T contains P.
Computational complexity: depends on complexity of the 2 multiplication steps. sigT will by definition be the largest number ( humongous, actually, for genome sequences ). So you'll need a very efficient multiplication algorithm; the best one I know is Schönhage-Strassen with complexity O(log n log( log n ) ).
Just slurps enormous amounts of memory
Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace