Tracing the Limits of Computation
An anonymous reader writes: For more than 40 years, researchers had been trying to find a better way to compare two arbitrary strings of characters, such as the long strings of chemical letters within DNA molecules. The most widely used algorithm is slow and not all that clever: It proceeds step-by-step down the two lists, comparing values at each step. If a better method to calculate this "edit distance" could be found, researchers would be able to quickly compare full genomes or large data sets, and computer scientists would have a powerful new tool with which they could attempt to solve additional problems in the field.
Yet in a paper presented at the ACM Symposium on Theory of Computing, two researchers from the Massachusetts Institute of Technology put forth a mathematical proof that the current best algorithm was "optimal" — in other words, that finding a more efficient way to compute edit distance was mathematically impossible. But researchers aren't quite ready to record the time of death. One significant loophole remains. The impossibility result is only true if another, famously unproven statement called the strong exponential time hypothesis (SETH) is also true.
Yet in a paper presented at the ACM Symposium on Theory of Computing, two researchers from the Massachusetts Institute of Technology put forth a mathematical proof that the current best algorithm was "optimal" — in other words, that finding a more efficient way to compute edit distance was mathematically impossible. But researchers aren't quite ready to record the time of death. One significant loophole remains. The impossibility result is only true if another, famously unproven statement called the strong exponential time hypothesis (SETH) is also true.
Comparing strings is a case for specialized hardware, not for more sophisticated algorithms.
This article reminded me of learning about string-matching algorithms as a student. When you first naively implement a string-matching algorithm as a student yourself (sans prior research), you'd generally start with what I'd call the string-matching equivalent of a bubble sort - it works, but it's hopelessly inefficient, as you begin a simple search with each new character in the source text, backtracking if the match doesn't occur, and checking the next potential match.
When you try to get a bit more clever, you realize there's probably a way to ensure that you can get examine each character in the searched text once and only once, never having to backtrack at all, and you get something like the Knuth–Morris–Pratt algorithm.
However, it takes a fairly impressive leap of intuition / invention to get to the Boyer–Moore string search algorithm, which actually allows you to skip over checking some characters entirely, which I would have intuitively thought impossible without making that mental leap. Learning about these impressive algorithms was one of my favorite part of the Computer Science curriculum when I was in school.
It will be interesting to see if someone eventually breaks through the current state-of-the-art limits of string comparison in our lifetime. It would be a bit sad if the hypothetical maximum efficiency were already reached, as was predicted by these mathematical models, because the current best algorithms still require a lot of computation time. I've long since devoted myself to more practical topics as a working programmer, but it's fun to peek into what's happening at the bleeding edge of pure computer science research.
Irony: Agile development has too much intertia to be abandoned now.
First, having worked on software for DNA sequence alignment years ago, I don't think anyone thought they could do better than Smith-Waterman (a minor DNA-centric extension of a DP algorithm to find Levenshtein distance). Rather, everyone was using clever implementation techniques (SIMD instructions, GPUs, FPGAs) and heuristics to limit the amount of brute-force aligning. Oh, and lots and lots of hardware.
Second, the summary author thinks that a dynamic programming algorithm isn't clever? They're either quite arrogant or don't know what they're talking about. Quicksort must be downright stupid, eh? Something for the short bus kids to play with.
If you can use a Trie, you'd optimize your comparison space.
Another thing is how we compare Strings is the same problem of compressing Substrings, that is often implemented in Lempel/Ziv and other String compression algorithms.
The paper gives a "Proof By Contradiction" which would violate "The Strong Exponential time Hypothesis". However this Hypothesis itself is still unproven. Not that I hold any authority as to the validity of it all, though :)
The result is truly about limits of computations and not about practical limits. For example, I will happily accept an algorithm that runs in c*2^(a*N) time when a is 1e-100 and c is not big as for practical purposes it runs in constant time. It is fascinating topic on its own why we do not have such algorithms for practical tasks.
Perhaps the only way to avoid the mathematically imposed restrictions on conventional computing machinery would be to (literally) use a different mechanism. One that is inspired by the nature of the problem.
For example, consider the problem of ordering a collection of random-length pieces of uncooked spaghetti. They can be ordered in O(1) time by merely grasping them in your hand and striking them against a table top. Ordering 500 strands takes no more time than 50 (ignoring preparation time). Scaling up is merely a matter of finding a bigger hand or table top.
For comparing strands of DNA consider using a 'chemical computer' based on the transcription of DNA in to messenger RNA, which is the 'algorithm' employed by Nature to encode the DNA sequences. Such a machine, in solution, equipped with customized sequences of interest, could theoretically recognize any desired DNA sequence in O(1) time (up to chemical reaction times).
So a simple solution to a complex problem. I'll leave it to you students to work out the details. ;-)
SETH is a very big assumption, much stronger than even assuming P != NP. Essentially, the exponential time hypothesis (ETH) says that any algorithm which solves instances of 3-SAT https://en.wikipedia.org/wiki/3-SAT will have worst case running time that is exponential. However, it is conceivable that one could have a whole series of algorithms which each solves 3-SAT better and better. The idea is that there's an algorithm a_i which solves 3-SAT in time O(x_i^n) and as i goes to infinity, x_i goes to 1. SETH is the hypothesis that says essentially that that doesn't happen. Many people are not convinced that ETH is true although it certainly looks plausible. I think most people who have thought about this consider the possibility that ETH holds and SETH doesn't hold to be a deeply weird and generally unlikely option.
Strings T ( the "text" ) and P ( the "pattern" ) both use an alphabet A of N symbols. ;-) ]
:-)
Index each of A's symbols with a distinct odd prime number p1...pN. [ only odd primes as I personally don't like 2, the "oddest prime of them all"
String T has length M.
Raise each prime index in T to the power of its location <= M in T, multiply them all, yielding a number sigT ( for signature of T ). [ begin counting locations with 1, not with 0, in which case you'll lose the information about T's first symbol... ]
Do the same thing for P, yielding a number sigP.
if sigT mod sigP yields zero, T contains P.
Example: T = "ABRE". A -> 3, B-> 5, R->67 (the 18th odd prime), E->13
so sigT = 3^1 * 5^2 * 67^3 * 13^4 = 644256903225
P = "BR". B->5, R->67
so sigP = 5^1 * 67^2 = 22445
644256903225 mod 22445 = 0, so T contains P.
Computational complexity: depends on complexity of the 2 multiplication steps. sigT will by definition be the largest number ( humongous, actually, for genome sequences ). So you'll need a very efficient multiplication algorithm; the best one I know is Schönhage-Strassen with complexity O(log n log( log n ) ).
Just slurps enormous amounts of memory
Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
Things like this are always found by people who did not know it was impossible.
If we define the genome being searched as the pool, and the string being searched for as the key; The paper could be completely correct when there is no reuse of the pool or keys. If you search the same pool for multiple different keys I can think of ways to pre-process the pool such that the first search + pre-process time still obeys the performance constraints their paper outlines, but subsequent searches of the same pool for a different key could happen in O log n time, or even constant time.
Ian Ameline
The edit distance between book and back is 2 ... In latin characters. It's 1 with kanji. The edit distance is probably several hundred if you're comparing the two on a per pixel basis instead of per letter. Edit distance varies by scope.
Surely I'm misunderstanding something. It can't be that simple.
Maybe I'm missing something.
ABRE -> A^1 * B^2 * R^3 * E^4
ABER -> A^1 * B^2 * E^3 * R^4 = A^1 * B * B * E^3 * R^2 * R^2 = A^1 * B * E^3 * R^2 * (B^1 * R^2)
So your calculation says that ABER contains BR
And assuming I've done the calculation correctly using your numbers for A B R E, I get ABER = 3320400962775, BR=22445.
3320400962775/22445 = 147934995 = BRE*A
God said, "div D = rho, div B = 0, curl E = -@B/@t, curl H = J + @D/@t," and there was light.
You are absolutely right. I just coded the algorithm in a programming language and noticed the same thing. It needs some extra encoding for symbol successor / predecessor relationships. I'll think about that while walking the dog post an update here ASAP.
Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
So RB is also in ABRE?
Most comments here seem to comment on string comparison, which edit distance isn't.
Not even substring search, or full text search .. each problem requires a different approach in the root. For large text searches you can use PATRICIA tree, for comparison of two strings you cannot do better than O(n) since you have to at least examine each character.
Precomputation is the correct answer to large problems with constants on one side.
Edit distance has its own path of optimization depending on use case.
Usually the 'library' algorithms are optimized like hell, but then sub-optimally scaled to larger problems that can be approached completely differently.
Your numbers will contain much more digits than the original strings, so time to process will go UP. Even computing the product for the P string takes a huge amount of time.
Look at pat tree for a much better solution.
Yes, a nice reminder of string processing problems. The problem they worked on isn't exactly string matching. They are trying to find the minimum "edit distance" between 2 strings. There are a lot of very similar seeming problems in this area.
If all one is trying to do is test whether 2 strings match exactly, first check whether the lengths are equal. If they are, then it might be better to compare characters at the end first, under the idea that similar strings are more likely to match at the beginning. It's not an algorithmic savings, still have to compare each character until a mismatch is found or all characters have been checked. All that idea does is try what might be a more likely location for a mismatch sooner, but that depends entirely upon the data.
Trying to find where a shorter string might occur in a longer is a different problem. As you noted, Boyer-Moore is good. Then, the problem of finding matches for n short strings for n>1 within a long string is solvable in yet other ways, faster than simply applying Boyer-Moore n times.
Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
Agreed on the Patricia trie. Tried that out, coded one myself, the things are indeed lightning fast. Using primes is just a nicety, really...
Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
Often one finds there are domain-specific shortcuts. For example, if we know that "useful" DNA sequences tends to X letters long, then an algorithm that is optimized for matching sequences of X length perhaps can be devised rather than try to match everything to everything.
Sometimes it's better for the org to search say 10,000 objects with an imperfect search tool rather than 500 with a perfect one. The most promising ones can then later be gone over with the fuller algorithm.
I'm not a biology expert, but I find there are shortcuts in other domains once you get to know the domain well enough. Focusing too much on being "generic" up front sometimes blinds one to the possibilities.
Table-ized A.I.
should be "...tend to be X letters long".
Table-ized A.I.
To someone who hasn't taken Calculus yet it may seem like a brute force problem trying to find inflection points and the zero points of a curve. Even the area of an oddly shaped object would be an enormously repetitive process of slicing and adding.
Yet once you know the tricks, it becomes trivial.
It may be impossible but maybe not.
Cwm, fjord-bank glyphs vext quiz
The formal description of the edit distance is different from that described in the article. This is the one used in some cases of DNA or text comparison.
It's not just about changing a symbol, it also includes adding and deleting symbols. That makes simple parallelization impossible.
It's not clear if the mathematical proof cited in the article is the simple one of only changing symbols, or the more complex case of symbol addition and deletion. From the article, it appears that they are talking about the simple case, which makes the more complex algorithm even more intractable.
That does make the simple hardware a little trickier, although for specific purposes there are some tricks you may be able to design for practically that become more efficient as the size of the insert grows. (An entire gene, for example, would be easier per gene to change than a single base pair).
Basically you can make a list of the changes. If you know you inserted a big object at place X, you can dynamically generate any requested index without having to rewrite the whole really long string, for example.
Similarly, you can design your machine to encode the object in a way which makes insertion and deletion easier. For example, allocate an array of blocks of memory to store the object and insert or delete blocks in the middle as it changes. We could call this a "linked list." Or an array with list-like elements.
If you want, you can set it up so that re-writing to an array occurs in spare cycles to eliminate the dynamic overhead created by the linked list or change history.
you can serialize a computer to hash (map-reduce) then the second computer (or thread) to do what you have to do. as long the alphabet is a closed set and smaller than current int(32/64), it's a good strategy for full match.
SETH is true. I know two of them.
So, they are looking to compare two arbitrary strings, right? Since strings are stored as ones and zeroes, I like to think about them as little diodes. Light is on = 1, light is off = 0. This helps me visualize the problem. This article says that the algorithm that is most efficient is a bit-by-bit comparison. In other words, if the first string is a bank of lights on the left, and the second string is a bank of lights on the right, this article says that the simplest process is to do two read operations. That is to say, the computer's "eye" looks left then looks right and determines whether it got the same signal twice (two on's or two off's). This makes sense given the way electricity works. The left lightswitch functions independently of the right lightswitch. You get four cases (left state, right state)
1) On On
2) On Off
3) Off On
4) Off Off
However, I can envision a photonic system with only a single read per bit pair. Again, this comes from imagining bits as lights. One of the really cool things about light is that, as a wave, two identical photons can cancel each other out through destructive interference. If the left bank and the right bank are aligned such that a left light destructively interferes with the matching right light, the computer's "eye" only needs to perform one read function, the combined output of both lights.
1) Off (either because both lights are off, or because both lights being on causes destructive interference)
2) On (because only one light is one, and it experiences no destructive interference)