Slashdot Mirror


Tracing the Limits of Computation

An anonymous reader writes: For more than 40 years, researchers had been trying to find a better way to compare two arbitrary strings of characters, such as the long strings of chemical letters within DNA molecules. The most widely used algorithm is slow and not all that clever: It proceeds step-by-step down the two lists, comparing values at each step. If a better method to calculate this "edit distance" could be found, researchers would be able to quickly compare full genomes or large data sets, and computer scientists would have a powerful new tool with which they could attempt to solve additional problems in the field.

Yet in a paper presented at the ACM Symposium on Theory of Computing, two researchers from the Massachusetts Institute of Technology put forth a mathematical proof that the current best algorithm was "optimal" — in other words, that finding a more efficient way to compute edit distance was mathematically impossible. But researchers aren't quite ready to record the time of death. One significant loophole remains. The impossibility result is only true if another, famously unproven statement called the strong exponential time hypothesis (SETH) is also true.

51 of 82 comments (clear)

  1. The algorithm isn't clever, but scales well. by Ihlosi · · Score: 3, Insightful
    The algorithm may not be very clever, but it scales almost perfectly with the number of cores (as long as you don't have more cores than characters in the string).

    Comparing strings is a case for specialized hardware, not for more sophisticated algorithms.

    1. Re:The algorithm isn't clever, but scales well. by Required+Snark · · Score: 3, Informative
      The formal description of the edit distance is different from that described in the article. This is the one used in some cases of DNA or text comparison.

      Insertion of a single symbol. If a = uv, then inserting the symbol x produces uxv. This can also be denoted e->x, using e to denote the empty string.

      Deletion of a single symbol changes uxv to uv (x->e).

      Substitution of a single symbol x for a symbol y x changes uxv to uyv (x->y).

      It's not just about changing a symbol, it also includes adding and deleting symbols. That makes simple parallelization impossible.

      It's not clear if the mathematical proof cited in the article is the simple one of only changing symbols, or the more complex case of symbol addition and deletion. From the article, it appears that they are talking about the simple case, which makes the more complex algorithm even more intractable.

      --
      Why is Snark Required?
    2. Re:The algorithm isn't clever, but scales well. by Anonymous Coward · · Score: 1

      Read and understand the problem and your answer will become trivial. The problem you present in your post has nothing to do with the string comparison algorithm in general (or the sequence alignment problem in particular).

    3. Re:The algorithm isn't clever, but scales well. by Sique · · Score: 2

      Especially for DNA comparisation, this shortcut won't work. Moreso, it is antithetic to the task at hand. We want to know the edit distance, or in DNA speak, we want to know how many mutation we need to get from DNA1 to DNA2, for instance to determine relation. The shortcut that we just look if all letters are present and begin and end are ok, and don't care about their sequence in the middle of the word doesn't give us the edit distance, as it hides exactly the answer to the question we asked to begin with. It's ok for determining which words are there, and it speeds up reading, but only if all sequences of letters are actually words we know perfectly well. It would slow down the reading to a crawl if our task was for instance to read a text in a foreign language and translate it with the help of a dictionary. If we then stumble upon scrambled words, it will take ages until we figure out the right sequence of letters to look the word up in the dictionary.

      --
      .sig: Sique *sigh*
    4. Re:The algorithm isn't clever, but scales well. by durrr · · Score: 1

      DNA mutations are more common in certain places(don't ask why, probably related to the geometry of packaged DNA and some other factors). An algo could spot check these places first to give preliminary answers but it would still take until it's finished to get a true answer.

    5. Re:The algorithm isn't clever, but scales well. by complete+loony · · Score: 1

      So don't compare only 2 DNA strings at once, or you can compromise and get good imperfect results. Every time you discover a matching sequence, you can leverage the information you have gained from earlier comparisons. DASH: Searching Compressed DNA (disclosure; A friend's PHD thesis).

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    6. Re:The algorithm isn't clever, but scales well. by owl57 · · Score: 1

      Definition from TFA: For any two sequences x, y, EDIT(x, y) is equal to the minimum, over all sequences z, of the number of deletions and substitutions needed to transform x into z and y into z. This is equivalent to the one you cited: insertion into x is the same thing as deletion from y.

    7. Re: The algorithm isn't clever, but scales well. by Anonymous Coward · · Score: 1

      Thanks for those hot assurances without having actually run the experiments yourself.

    8. Re:The algorithm isn't clever, but scales well. by mcswell · · Score: 1

      This is one of many diffs (ahem) between DNA sequencing and natural language processing. Another is the alphabet size: DNA has an alphabet size of 4, while the alphabets (number of phonemes or graphemes) of natural languages range from a low of 11 (Rotokas and Mura) to a high of perhaps 140-something (!Xu, although the exact number depends on how you count things). Of course written Chinese and Japanese have much higher numbers of graphemes.

      It's also the case that some writing systems don't mark word boundaries (Chinese, Thai for example), in which case the begin/end shortcut won't work at all. Which makes machine processing of such languages quite a bit harder. And of course word boundaries aren't usually indicated in fluent spoken language, except at phrasal pauses or the beginning/ end of utterances.

      And finally, languages have a finite number of wordforms, although that number can be very high in languages that have lots of inflectional morphology.

      That said, variants of the algorithms used for DNA sequencing are also used in computational linguistics.

    9. Re:The algorithm isn't clever, but scales well. by JMJimmy · · Score: 1

      When I was thinking of how it could apply the spacing is the "on/off switch" between genes, the individual genes are the "words". It wasn't overly well thought out, just thinking of how one could use known patterns to shorten processing time.

  2. This reminds me... by Dutch+Gun · · Score: 5, Interesting

    This article reminded me of learning about string-matching algorithms as a student. When you first naively implement a string-matching algorithm as a student yourself (sans prior research), you'd generally start with what I'd call the string-matching equivalent of a bubble sort - it works, but it's hopelessly inefficient, as you begin a simple search with each new character in the source text, backtracking if the match doesn't occur, and checking the next potential match.

    When you try to get a bit more clever, you realize there's probably a way to ensure that you can get examine each character in the searched text once and only once, never having to backtrack at all, and you get something like the Knuth–Morris–Pratt algorithm.

    However, it takes a fairly impressive leap of intuition / invention to get to the Boyer–Moore string search algorithm, which actually allows you to skip over checking some characters entirely, which I would have intuitively thought impossible without making that mental leap. Learning about these impressive algorithms was one of my favorite part of the Computer Science curriculum when I was in school.

    It will be interesting to see if someone eventually breaks through the current state-of-the-art limits of string comparison in our lifetime. It would be a bit sad if the hypothetical maximum efficiency were already reached, as was predicted by these mathematical models, because the current best algorithms still require a lot of computation time. I've long since devoted myself to more practical topics as a working programmer, but it's fun to peek into what's happening at the bleeding edge of pure computer science research.

    --
    Irony: Agile development has too much intertia to be abandoned now.
    1. Re:This reminds me... by Anonymous Coward · · Score: 1

      However, it takes a fairly impressive leap of intuition / invention to get to the Boyer–Moore string search algorithm

      Then you would think I'm a fucking "genius". When I was 10 years old I implemented Jump Lists for my string search algorithm. Not because my 33Mhz system was too slow for the naive implementation, but because I was tired of waiting for QBASIC to stop thrashing memory. I was implementing my own native x86 compiler using the only interface I had available (QBASIC). A C compiler cost 1/3rd of what a PC did back then. So, without prior training in CS I imagined how one might transform a text program into machine code and implemented it with the tools I had. All because I was tired of creating .exe files via hex editor, and BASIC interpretors were too slow (at the time) for graphically intensive games.

      Why string search? Isn't a tree better for tokenizing code? Why not then recursive descent? Well, because of stack overflow. My compiler was iterative and worked by doing string searches on a normalized token stream ("code string"). This "code string" was a linked list of "symbols" represented by start and end offsets into the source text. The parser performed a string search across these "tokens" and replaced runs of the symbols with list items containing a "native string" (machine code) corresponding to the "code string". Adjacent blocks of "native" symbol "strings" were joined. The process repeated until the entire "code string" had been converted into a single block of native code, and that was my EXE file.

      To add compiler optimisations I just added more pattern matching to this mixed list of "code string" symbols and "native string". Optimization was then performed as a normal function of compilation. The iterations then continued until no more pattern matches could be made AND there were no more "code string" symbols (only a "native string"). If no matches could be found and some portion of code string remained, that was your syntax or compiler error -- Which you could sometimes correct and resume compilation (to keep compile time from being wasted -- a feature still not present in most any "modern" compiler). A special case was when "external" symbols were needed. Then the mixed "code" and "native" string would be saved to a file, and the next source file processed. When all files were either native code or mixed with external symbols, then the program was complete and required one more "assembly" pass to generate the EXE. I didn't know it but I had created a linker and intermediary "object code" -- And to this day it doesn't matter that I didn't use these terms.

      Rather than piss away valuable computer time generating tokens (made more valuable by slow ass QBASIC), I merely removed formatting white-space and comments from the program, and the result was my normalized "code string" to begin compilation of via "string" search (and replace).

      It would be almost a decade before I would be allowed to take a course on compiler theory in school. This also marked the point at which I dropped out of Academia altogether. I was paying to be taught things I already knew, and that's fucking retarding. Their graduate-level algorithms were not even as optimal for parallelism as my unlearned ignorant childhood compiler implementations. They taught next to nothing of recursive pattern matching (this being the true foundation of compilation in all systems which do so, be they digital or DNA). The time spent praising the achievements of dead men which a child can intuit in an afternoon made me want to puke. I felt the whole system was designed to demoralize the thinker and make them feel small and useless and obedient to those who dolled out knowledge, rather than to place the thinker on the same footing as those great minds that came before.

      Point being: Just because YOU were told that you are not smart enough to simply ponder at a problem and solve it (even optimally) without formal education, doesn't mean

    2. Re:This reminds me... by Zeroko · · Score: 1

      So I was not the only one to do that repeated-string-mapping method of compilation (although I did mine in high school (using C for bootstrapping, rather than QBASIC, IIRC)...earlier I was too busy playing with Visual Basic, which had its own compiler)...interesting.

      I never tried to make my compiler optimizing, despite having a plan in my head to do so, because the binary for the compiler itself already fit in 64KB that a .COM file required, & I never did anything interesting enough with it for time to be an issue...& then other projects called.

  3. Inaccurate Summary by Anonymous Coward · · Score: 3, Interesting

    First, having worked on software for DNA sequence alignment years ago, I don't think anyone thought they could do better than Smith-Waterman (a minor DNA-centric extension of a DP algorithm to find Levenshtein distance). Rather, everyone was using clever implementation techniques (SIMD instructions, GPUs, FPGAs) and heuristics to limit the amount of brute-force aligning. Oh, and lots and lots of hardware.

    Second, the summary author thinks that a dynamic programming algorithm isn't clever? They're either quite arrogant or don't know what they're talking about. Quicksort must be downright stupid, eh? Something for the short bus kids to play with.

  4. not a practical limit by fpoling · · Score: 1

    The result is truly about limits of computations and not about practical limits. For example, I will happily accept an algorithm that runs in c*2^(a*N) time when a is 1e-100 and c is not big as for practical purposes it runs in constant time. It is fascinating topic on its own why we do not have such algorithms for practical tasks.

  5. Spaghetti sort by jdagius · · Score: 1

    Perhaps the only way to avoid the mathematically imposed restrictions on conventional computing machinery would be to (literally) use a different mechanism. One that is inspired by the nature of the problem.

    For example, consider the problem of ordering a collection of random-length pieces of uncooked spaghetti. They can be ordered in O(1) time by merely grasping them in your hand and striking them against a table top. Ordering 500 strands takes no more time than 50 (ignoring preparation time). Scaling up is merely a matter of finding a bigger hand or table top.

    For comparing strands of DNA consider using a 'chemical computer' based on the transcription of DNA in to messenger RNA, which is the 'algorithm' employed by Nature to encode the DNA sequences. Such a machine, in solution, equipped with customized sequences of interest, could theoretically recognize any desired DNA sequence in O(1) time (up to chemical reaction times).

    So a simple solution to a complex problem. I'll leave it to you students to work out the details. ;-)

    1. Re:Spaghetti sort by jdagius · · Score: 1

      I'm also pretty sure that designing and building any kind of computer can't be done in O(1) time. I'm calling that "preparation time" (which you were instructed to ignore).

    2. Re:Spaghetti sort by serviscope_minor · · Score: 1

      The mechanism isn't different and your spaghetti based computer is still stuck with the same limitations as anything else.

      What you're describing is a special purpose instruction for ordering a bunch of random length bits of spaghetti in parallel, and then throwing more and more hardware at the problem. In fact, the amount of hardware scales as O(N) with the amount of spaghetti.

      This isn't actually a new mechanism at all: complexity classes also deal with parallelising decision problems. For example NC complexity ones are "easy" to parallelise in the sense that polynomial complexity is "easy".

      There are important algorithms in P which aren't in NC, such as linear programming. That means there's no "easy" way to parallelise it for the definitions of NC. That means for example that if you throw a polynomial amount of hardware at a linear programming problem (i.e. n^k CPUs for size n input and an arbitrary k) you will still not be able to solve the problem in log(n) time.

      Based on your description, spaghetti-sorting is in NC.

      Likewise with the DNA chemical hardware. All that's got access to is some methods for very efficiently doing certain things in sequence and other methods for very efficiently doing things massively in parallel.

      While those methods can no doubt solve certain problems much more quickly than conventional serial or even conventional parallel computers, there are certain problems that they simply can't help with.

      --
      SJW n. One who posts facts.
    3. Re:Spaghetti sort by vikingpower · · Score: 1

      Your method doesn't "sort" the spaghetti. It aligns their ends against the same reference ( the table top ), which has nothing to do with sorting. They will still be, in your hand, in haphazard "order", i.e. there will be no order at all, as the shortest may very well end up immediately next to the longest.

      --
      Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
    4. Re:Spaghetti sort by jdagius · · Score: 1

      > While those methods can no doubt solve certain problems much more quickly than
      > conventional serial or even conventional parallel computers, there are certain problems that
      > they simply can't help with.

      The price paid for surpassing "optimal" general solutions of a problem is that they won't work on all instances of the problem. Quite often that is an acceptable price. It's a trade-off.

    5. Re:Spaghetti sort by jdagius · · Score: 1

      > It aligns their ends against the same reference ( the table top ),
      > which has nothing to do with sorting.

      But in the process of alignment, a linear ordering is imposed in one dimension, along the axis of the spaghetti rods, such that if you traverse this axis towards the table top, you are guaranteed access to the members of an ordered collection, tallest member first. Quite often that is an acceptable solution, especially if you are only interested in the tallest strands. Perhaps not a general solution to the sorting problem.

      Mathematicians insist on perfect, general solutions. Scientists and engineers are often quite satisfied with "trade-off" solutions that work for the problem at hand.

    6. Re:Spaghetti sort by vikingpower · · Score: 1

      I AM an engineer, thank you. A linear ordering along one dimension is called a "partial ordering", which is often quite useless, even for an engineer :-/

      --
      Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
    7. Re:Spaghetti sort by tepples · · Score: 1

      Alignment is O(n) in the length of rods. Querying by eye is O(n^.5) in the number of rods, and querying by touch is O(n) in the length of rods.

    8. Re:Spaghetti sort by PlusFiveTroll · · Score: 1

      To sort spaghetti by size, doing it by hand sounds very inefficient.

      A sorting method where it slides down a plane with slats that increase in size until the entire length can fall into a slot, much like a method for sorting coins, sounds a way to solve both the length and order problem. Of course that will never scale 'faster' than the number of sorting elements you use.

    9. Re:Spaghetti sort by jdagius · · Score: 1

      No. It is a _total_ ordering over the relation "=" in the sense that all pair of elements in the linear are comparable, i.e. elements a and b must satisfy "a=b" or "b=a", which is the requirement for "totality".

      In a _partial_ ordering (such as a collection of tree-structured elements) not all elements are necessarily comparable using "=", e.g. elements in different paths.

      In plainer English: the strands of a spaghetti-sorted collection are totally ordered by length when traversing along the strand axis towards the table. I.e. each element traversed is greater or equal in length to _all_ ('totality') of the remaining elements yet to be visited.

    10. Re:Spaghetti sort by jdagius · · Score: 1

      ...oops forgot to escape my LT brackets. Substitute "<=" wherever you see "="

    11. Re:Spaghetti sort by blue+trane · · Score: 1

      "In fact, the amount of hardware scales as O(N) with the amount of spaghetti."

      No, because if you double each side in a cube, you get an eight-fold increase in volume. 2 x the material to extend the cube gives you 8 x the spaghetti you can fit in it.

    12. Re:Spaghetti sort by serviscope_minor · · Score: 1

      No, because if you double each side in a cube, you get an eight-fold increase in volume.

      The area of the table you need is roughly equal to the area of the stacked bunch of spaghetti. Since each strand is an approximately fixed diameter, they're about the same area each.

      If you double the number of strands, then you need a flat surface of twice the area. So, the hardware requirements are O(N).

      If you want to include the *length* of spaghetti instead of just the number of strands then it gets more complex. However, eventually there's a significant mass to support which means you need a significant support structure. Once the mass increases far enough things get very tricky indeed and you need a vast amount of stuff to support it.

      --
      SJW n. One who posts facts.
    13. Re:Spaghetti sort by blue+trane · · Score: 1

      Why wouldn't you need length? You're sorting on length. You need something to hold the spaghetti strands in an upright position, so you need volume. Volume increases faster than the material needed to bound it does.

      http://www.tiem.utk.edu/~gross...

      Surface area = 6 * length^2

      Volume = length^3

      The more hardware you add to make the cube bigger, the much more spaghetti you can sort. The hardware increases less than O(n).

      Even if other hardware components increase at O(n) (not sure of that even), because your container hardware increases at less than O(n) your total hardware increases less than O(n). You get an advantage because nature increases volume faster than its bounding lengths.

      Related: Ultimate Free Lunch

    14. Re:Spaghetti sort by serviscope_minor · · Score: 1

      In your O(n) which n are you talking about?

      The GP seemed to be referring to the number of strands. The minimum increase is O(n) with the number of strands.

      If you're talking about volume, it gets rather trickier. A 5 mile high stack of spaghetti won't stand up under it's own weight. The trouble is the mass increases with the cube too.

      --
      SJW n. One who posts facts.
    15. Re:Spaghetti sort by mcswell · · Score: 1

      Wait: finding the single longest strand is O(1). Sorting them is O(n), because you have to keep picking out the longest strand.

  6. SETH is a pretty big assumption by JoshuaZ · · Score: 2

    SETH is a very big assumption, much stronger than even assuming P != NP. Essentially, the exponential time hypothesis (ETH) says that any algorithm which solves instances of 3-SAT https://en.wikipedia.org/wiki/3-SAT will have worst case running time that is exponential. However, it is conceivable that one could have a whole series of algorithms which each solves 3-SAT better and better. The idea is that there's an algorithm a_i which solves 3-SAT in time O(x_i^n) and as i goes to infinity, x_i goes to 1. SETH is the hypothesis that says essentially that that doesn't happen. Many people are not convinced that ETH is true although it certainly looks plausible. I think most people who have thought about this consider the possibility that ETH holds and SETH doesn't hold to be a deeply weird and generally unlikely option.

    1. Re:SETH is a pretty big assumption by mcswell · · Score: 1

      Prove it and you'll be famous.

      (Or you can write a note in the margin of a book on algorithmic complexity saying that the margin is too small to contain your proof.)

  7. Here is an idea to play with: Gödel notation by vikingpower · · Score: 2

    Strings T ( the "text" ) and P ( the "pattern" ) both use an alphabet A of N symbols.
    Index each of A's symbols with a distinct odd prime number p1...pN. [ only odd primes as I personally don't like 2, the "oddest prime of them all" ;-)  ]
    String T has length M.
    Raise each prime index in T to the power of its location <= M in T, multiply them all, yielding a number sigT ( for signature of T ). [ begin counting locations with 1, not with 0, in which case you'll lose the information about T's first symbol... ]

    Do the same thing for P, yielding a number sigP.

    if sigT mod sigP yields zero, T contains P.

    Example: T = "ABRE". A -> 3, B-> 5, R->67 (the 18th odd prime), E->13
    so sigT = 3^1 * 5^2 * 67^3 * 13^4 = 644256903225

    P = "BR". B->5, R->67
    so sigP = 5^1 * 67^2 = 22445

    644256903225 mod 22445 = 0, so T contains P.

    Computational complexity: depends on complexity of the 2 multiplication steps. sigT will by definition be the largest number ( humongous, actually, for genome sequences ). So you'll need a very efficient multiplication algorithm; the best one I know is Sch&#246;nhage-Strassen with complexity O(log n log( log n ) ).

    Just slurps enormous amounts of memory :-)

    --
    Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
  8. How many times are you searching? by ameline · · Score: 1

    If we define the genome being searched as the pool, and the string being searched for as the key; The paper could be completely correct when there is no reuse of the pool or keys. If you search the same pool for multiple different keys I can think of ways to pre-process the pool such that the first search + pre-process time still obeys the performance constraints their paper outlines, but subsequent searches of the same pool for a different key could happen in O log n time, or even constant time.

    --
    Ian Ameline
  9. I wonder... by dmgxmichael · · Score: 1
    Fascinating article. This stuck out:

    If Williams or anyone else can prove the existence of an edit-distance algorithm that runs even moderately faster than normal, SETH is history.

    The edit distance between book and back is 2 ... In latin characters. It's 1 with kanji. The edit distance is probably several hundred if you're comparing the two on a per pixel basis instead of per letter. Edit distance varies by scope.

    Surely I'm misunderstanding something. It can't be that simple.

    1. Re:I wonder... by locofungus · · Score: 1

      For a given algorithm, what is it's worst case running time across all possible inputs for that algorithm (of which there must be an infinite number)

      The answer the algorithm gives doesn't matter. It's how long you might have to wait for that answer that is important.

      --
      God said, "div D = rho, div B = 0, curl E = -@B/@t, curl H = J + @D/@t," and there was light.
    2. Re:I wonder... by locofungus · · Score: 1

      (of which there must be an infinite number)

      I didn't put this very well. What I mean is that there must be larger and larger cases.

      For the question "what is the edit distance between two n character strings taken from alphabet L" n can grow without bound. If there is a limit to n then, mathematically at least, the problem can be solved in O(1) via lookup.

      We often need the general case algorithm even when our inputs are finite. For example, there may be a fundamental limit to how many "characters" there are in DNA (I don't know if there is or isn't) but even assuming there is, the real size of DNA precludes the practicality of building lookup tables to calculate edit distance in O(1)

      --
      God said, "div D = rho, div B = 0, curl E = -@B/@t, curl H = J + @D/@t," and there was light.
  10. Re:Here is an idea to play with: Gödel notati by locofungus · · Score: 1

    Maybe I'm missing something.

    ABRE -> A^1 * B^2 * R^3 * E^4
    ABER -> A^1 * B^2 * E^3 * R^4 = A^1 * B * B * E^3 * R^2 * R^2 = A^1 * B * E^3 * R^2 * (B^1 * R^2)

    So your calculation says that ABER contains BR

    And assuming I've done the calculation correctly using your numbers for A B R E, I get ABER = 3320400962775, BR=22445.

    3320400962775/22445 = 147934995 = BRE*A

    --
    God said, "div D = rho, div B = 0, curl E = -@B/@t, curl H = J + @D/@t," and there was light.
  11. Re:Here is an idea to play with: Gödel notati by vikingpower · · Score: 1

    You are absolutely right. I just coded the algorithm in a programming language and noticed the same thing. It needs some extra encoding for symbol successor / predecessor relationships. I'll think about that while walking the dog post an update here ASAP.

    --
    Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
  12. edit distance, not just matching by bzipitidoo · · Score: 1

    Yes, a nice reminder of string processing problems. The problem they worked on isn't exactly string matching. They are trying to find the minimum "edit distance" between 2 strings. There are a lot of very similar seeming problems in this area.

    If all one is trying to do is test whether 2 strings match exactly, first check whether the lengths are equal. If they are, then it might be better to compare characters at the end first, under the idea that similar strings are more likely to match at the beginning. It's not an algorithmic savings, still have to compare each character until a mismatch is found or all characters have been checked. All that idea does is try what might be a more likely location for a mismatch sooner, but that depends entirely upon the data.

    Trying to find where a shorter string might occur in a longer is a different problem. As you noted, Boyer-Moore is good. Then, the problem of finding matches for n short strings for n>1 within a long string is solvable in yet other ways, faster than simply applying Boyer-Moore n times.

    --
    Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
    1. Re:edit distance, not just matching by Dutch+Gun · · Score: 1

      Yeah, I realize they're not the same problem... I was just reminiscing a bit about a slightly different problem that I spent some time thinking about as a student.

      Another string-related problem I had fun figuring out (essentially re-inventing a variation of the Levenshtein distance formula) was a closest-match algorithm, say, for common search strings that are slightly misspelled, or as a spell-checker finding a closest potential match. A decent solution to this problem isn't terribly hard, but does require a bit of ingenuity. I'd imagine the folks at Google have a pretty impressive algorithm for doing this.

      For whatever reason, these sorts of problems feel like CS-oriented puzzles to me, and I enjoy working on them. Often, if I see an algorithm which intrigues me, I'll expend some brain power theorizing how they work before I research how other people have approached or solved the problem so far. Reverse image searching and closest image matching were one example of this. I puzzled for quite a while about the best way to boil an image down to simple database-searchable numerical values that can be quickly queried and checked for ranges.

      --
      Irony: Agile development has too much intertia to be abandoned now.
    2. Re:edit distance, not just matching by mcswell · · Score: 1

      "first check whether the lengths are equal": I suppose if the string length is marked at the beginning of the string, that makes sense. But if not (e.g. the end of the string is marked by a null byte), doesn't that just slow things down? Because you have to traverse the strings twice: once to measure their length, and once to check for identity. I suppose it depends on the constant factor.

      "it might be better to compare characters at the end first, under the idea that similar strings are more likely to match at the beginning": I think that depends entirely on the nature of the strings. If they're from a language that uses suffixes, maybe; if they're from a language that uses prefixes (Bantu, Athabaskan), probably not. It also presumes that you have a pointer to the end of the strings, which depends on your data representation. For DNA, I would guess it makes no difference at all which end you start at.

    3. Re:edit distance, not just matching by bzipitidoo · · Score: 1

      A terminating null byte is a horrible way to denote the end of a string, as C/C++ designers eventually realized. The cost and space to maintain a separate value for the length of a string is O(1) no matter what operation is done. Not only does it take O(n) to find the terminating null, there's the additional complication of how to embed a null in the middle of a string, without terminating it. Use an escape sequence? Just don't allow the terminating character in the string? No modern string manipulation library uses termination characters. So, yeah, checking for equal lengths takes O(1), unless you're using the old string.h C library.

      --
      Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
  13. Re:Here is an idea to play with: Gödel notati by vikingpower · · Score: 1

    Agreed on the Patricia trie. Tried that out, coded one myself, the things are indeed lightning fast. Using primes is just a nicety, really...

    --
    Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
  14. Domain-specific shortcuts by Tablizer · · Score: 1

    Often one finds there are domain-specific shortcuts. For example, if we know that "useful" DNA sequences tends to X letters long, then an algorithm that is optimized for matching sequences of X length perhaps can be devised rather than try to match everything to everything.

    Sometimes it's better for the org to search say 10,000 objects with an imperfect search tool rather than 500 with a perfect one. The most promising ones can then later be gone over with the fuller algorithm.

    I'm not a biology expert, but I find there are shortcuts in other domains once you get to know the domain well enough. Focusing too much on being "generic" up front sometimes blinds one to the possibilities.

  15. [Correction] Re:Domain-specific shortcuts by Tablizer · · Score: 1

    if we know that "useful" DNA sequences tends to X letters long,

    should be "...tend to be X letters long".

  16. Maybe by CODiNE · · Score: 1

    To someone who hasn't taken Calculus yet it may seem like a brute force problem trying to find inflection points and the zero points of a curve. Even the area of an oddly shaped object would be an enormously repetitive process of slicing and adding.

    Yet once you know the tricks, it becomes trivial.

    It may be impossible but maybe not.

    --
    Cwm, fjord-bank glyphs vext quiz
  17. Abstract it away with change tracking by Etherwalk · · Score: 1

    The formal description of the edit distance is different from that described in the article. This is the one used in some cases of DNA or text comparison.

    Insertion of a single symbol. If a = uv, then inserting the symbol x produces uxv. This can also be denoted e->x, using e to denote the empty string.

    Deletion of a single symbol changes uxv to uv (x->e).

    Substitution of a single symbol x for a symbol y x changes uxv to uyv (x->y).

    It's not just about changing a symbol, it also includes adding and deleting symbols. That makes simple parallelization impossible.

    It's not clear if the mathematical proof cited in the article is the simple one of only changing symbols, or the more complex case of symbol addition and deletion. From the article, it appears that they are talking about the simple case, which makes the more complex algorithm even more intractable.

    That does make the simple hardware a little trickier, although for specific purposes there are some tricks you may be able to design for practically that become more efficient as the size of the insert grows. (An entire gene, for example, would be easier per gene to change than a single base pair).

    Basically you can make a list of the changes. If you know you inserted a big object at place X, you can dynamically generate any requested index without having to rewrite the whole really long string, for example.

    Similarly, you can design your machine to encode the object in a way which makes insertion and deletion easier. For example, allocate an array of blocks of memory to store the object and insert or delete blocks in the middle as it changes. We could call this a "linked list." Or an array with list-like elements.

    If you want, you can set it up so that re-writing to an array occurs in spare cycles to eliminate the dynamic overhead created by the linked list or change history.

  18. anything complex than O(n) by fkodama · · Score: 1

    you can serialize a computer to hash (map-reduce) then the second computer (or thread) to do what you have to do. as long the alphabet is a closed set and smaller than current int(32/64), it's a good strategy for full match.

  19. SETH exists by kwbauer · · Score: 1

    SETH is true. I know two of them.