Slashdot Mirror


Tracing the Limits of Computation

An anonymous reader writes: For more than 40 years, researchers had been trying to find a better way to compare two arbitrary strings of characters, such as the long strings of chemical letters within DNA molecules. The most widely used algorithm is slow and not all that clever: It proceeds step-by-step down the two lists, comparing values at each step. If a better method to calculate this "edit distance" could be found, researchers would be able to quickly compare full genomes or large data sets, and computer scientists would have a powerful new tool with which they could attempt to solve additional problems in the field.

Yet in a paper presented at the ACM Symposium on Theory of Computing, two researchers from the Massachusetts Institute of Technology put forth a mathematical proof that the current best algorithm was "optimal" — in other words, that finding a more efficient way to compute edit distance was mathematically impossible. But researchers aren't quite ready to record the time of death. One significant loophole remains. The impossibility result is only true if another, famously unproven statement called the strong exponential time hypothesis (SETH) is also true.

82 comments

  1. The algorithm isn't clever, but scales well. by Ihlosi · · Score: 3, Insightful
    The algorithm may not be very clever, but it scales almost perfectly with the number of cores (as long as you don't have more cores than characters in the string).

    Comparing strings is a case for specialized hardware, not for more sophisticated algorithms.

    1. Re:The algorithm isn't clever, but scales well. by Required+Snark · · Score: 3, Informative
      The formal description of the edit distance is different from that described in the article. This is the one used in some cases of DNA or text comparison.

      Insertion of a single symbol. If a = uv, then inserting the symbol x produces uxv. This can also be denoted e->x, using e to denote the empty string.

      Deletion of a single symbol changes uxv to uv (x->e).

      Substitution of a single symbol x for a symbol y x changes uxv to uyv (x->y).

      It's not just about changing a symbol, it also includes adding and deleting symbols. That makes simple parallelization impossible.

      It's not clear if the mathematical proof cited in the article is the simple one of only changing symbols, or the more complex case of symbol addition and deletion. From the article, it appears that they are talking about the simple case, which makes the more complex algorithm even more intractable.

      --
      Why is Snark Required?
    2. Re:The algorithm isn't clever, but scales well. by jellomizer · · Score: 0

      No not really. It is easily parallelized yes. But you really don't improve big O speed like you do with sort algorithms.
      String compare will always be O(n)
      Now based of of the data you may be able to speed it up say your string of data contains particular patterns where you can group as a one letter such as compressing the string and compare the two compressed data, or say with DNA you just compress the data to 4 bit for value so you are searching more optimally. However you are still going in linear speed.

      That said linear speed isn't that bad, but compared to log speed it gets poky

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    3. Re:The algorithm isn't clever, but scales well. by JMJimmy · · Score: 0

      The human brain is known to take shortcuts to do string comparisons so why can't a computer to a certain degree? If memory serves, and I apologize if I'm off it's the end of a long day, when reading English at least the brain will read the first letter and the last letter and approximate the length of word to make the connection to the word stored in the brain. This apparently works up to a character limit then the brain chunks the words so compliment and complement might get confused as cplt+length for both or split into cit+length vs cet+length depending on the person. While this doesn't have any effect on unknown strings, it could shorten known string comparisons. ie: English has 1m+ words but fewer than 300 start with an X, only 2 that end in X so in length+3 character comparisons I've either correctly identified "Xerox" or "Xanax" or I have determined it's an unknown string and needs a full "best algorithm" comparison which would have been just as slow. Obviously that's a simplified example but it took 40% of the comparison away (and added index lookups)

    4. Re:The algorithm isn't clever, but scales well. by Anonymous Coward · · Score: 1

      Read and understand the problem and your answer will become trivial. The problem you present in your post has nothing to do with the string comparison algorithm in general (or the sequence alignment problem in particular).

    5. Re:The algorithm isn't clever, but scales well. by Anonymous Coward · · Score: 0

      Take sorting

      No matter how well Bubblesort can scale, it is still quite damn s-l-o-wwww ...

    6. Re:The algorithm isn't clever, but scales well. by Anonymous Coward · · Score: 0

      it scales almost perfectly with the number of cores

      Such an algorithm does not exist. Even the best data-parallel algorithms will scale following roughly a logarithmic curve. Sure, you might get a 1.9x with 2 cores and 3.8 for 4 cores, but even without having run the experiments myself, I can assure you that the speedup with 16 cores will be closer to 10x than 16x, and 32 cores will be closer to 16x than 32x.

    7. Re:The algorithm isn't clever, but scales well. by Sique · · Score: 2

      Especially for DNA comparisation, this shortcut won't work. Moreso, it is antithetic to the task at hand. We want to know the edit distance, or in DNA speak, we want to know how many mutation we need to get from DNA1 to DNA2, for instance to determine relation. The shortcut that we just look if all letters are present and begin and end are ok, and don't care about their sequence in the middle of the word doesn't give us the edit distance, as it hides exactly the answer to the question we asked to begin with. It's ok for determining which words are there, and it speeds up reading, but only if all sequences of letters are actually words we know perfectly well. It would slow down the reading to a crawl if our task was for instance to read a text in a foreign language and translate it with the help of a dictionary. If we then stumble upon scrambled words, it will take ages until we figure out the right sequence of letters to look the word up in the dictionary.

      --
      .sig: Sique *sigh*
    8. Re:The algorithm isn't clever, but scales well. by Anonymous Coward · · Score: 0

      No matter how well Bubblesort can scale,

      And, well, it doesn't scale. No matter how many cores you throw at it, the required number of passes is still dictated by the input.

    9. Re:The algorithm isn't clever, but scales well. by durrr · · Score: 1

      DNA mutations are more common in certain places(don't ask why, probably related to the geometry of packaged DNA and some other factors). An algo could spot check these places first to give preliminary answers but it would still take until it's finished to get a true answer.

    10. Re:The algorithm isn't clever, but scales well. by Anonymous Coward · · Score: 0

      The human brain is known to take shortcuts to do string comparisons so why can't a computer to a certain degree?

      Because we rely on computers to get perfect answers and shortcuts don't do that. You might find a shortcut to compare strings based on some sort of pattern but you will exclude all non-matching patterns as a result and end up stifling research because we rely on those comparisons to find things we miss or cannot find, not to find specific things.

    11. Re:The algorithm isn't clever, but scales well. by complete+loony · · Score: 1

      So don't compare only 2 DNA strings at once, or you can compromise and get good imperfect results. Every time you discover a matching sequence, you can leverage the information you have gained from earlier comparisons. DASH: Searching Compressed DNA (disclosure; A friend's PHD thesis).

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    12. Re:The algorithm isn't clever, but scales well. by owl57 · · Score: 1

      Definition from TFA: For any two sequences x, y, EDIT(x, y) is equal to the minimum, over all sequences z, of the number of deletions and substitutions needed to transform x into z and y into z. This is equivalent to the one you cited: insertion into x is the same thing as deletion from y.

    13. Re:The algorithm isn't clever, but scales well. by Anonymous Coward · · Score: 0

      The human brain is known to take shortcuts to do string comparisons so why can't a computer to a certain degree?

      People do it, and it's called reduction.

    14. Re:The algorithm isn't clever, but scales well. by Anonymous Coward · · Score: 0

      You are correct. Your intuition of using the structure of vocabulary to reduce comparisons is indeed very clever and useful. However, If you don't know a foreign language, and you were asked to compare edit distances for strings of characters from such a language, your shortcuts wont work. You will need knowledge about the inherent structure in its vocabulary. For a random set of strings, you cannot assume that they share such structural similarities.

    15. Re: The algorithm isn't clever, but scales well. by Anonymous Coward · · Score: 1

      Thanks for those hot assurances without having actually run the experiments yourself.

    16. Re:The algorithm isn't clever, but scales well. by Anonymous Coward · · Score: 0

      Unless your container tends to almost always be already sorted. Which is a really bad case for some other popular sorts, and is also a common case.

    17. Re:The algorithm isn't clever, but scales well. by Anonymous Coward · · Score: 0

      What popular sort has a bad case for almost-sorted input?

      Quicksort has variants that fail on this case - but those variants aren't popular for exactly that reason. There are also quicksort variants that has (almost) sorted as a good case. For example, by something as simple as picking the pivot in the center of your (sub)array.

      Merge sort, heap sort insertion sort and counting sort has no problem with almost sorted data.

    18. Re:The algorithm isn't clever, but scales well. by mcswell · · Score: 1

      This is one of many diffs (ahem) between DNA sequencing and natural language processing. Another is the alphabet size: DNA has an alphabet size of 4, while the alphabets (number of phonemes or graphemes) of natural languages range from a low of 11 (Rotokas and Mura) to a high of perhaps 140-something (!Xu, although the exact number depends on how you count things). Of course written Chinese and Japanese have much higher numbers of graphemes.

      It's also the case that some writing systems don't mark word boundaries (Chinese, Thai for example), in which case the begin/end shortcut won't work at all. Which makes machine processing of such languages quite a bit harder. And of course word boundaries aren't usually indicated in fluent spoken language, except at phrasal pauses or the beginning/ end of utterances.

      And finally, languages have a finite number of wordforms, although that number can be very high in languages that have lots of inflectional morphology.

      That said, variants of the algorithms used for DNA sequencing are also used in computational linguistics.

    19. Re:The algorithm isn't clever, but scales well. by JMJimmy · · Score: 1

      When I was thinking of how it could apply the spacing is the "on/off switch" between genes, the individual genes are the "words". It wasn't overly well thought out, just thinking of how one could use known patterns to shorten processing time.

  2. This reminds me... by Dutch+Gun · · Score: 5, Interesting

    This article reminded me of learning about string-matching algorithms as a student. When you first naively implement a string-matching algorithm as a student yourself (sans prior research), you'd generally start with what I'd call the string-matching equivalent of a bubble sort - it works, but it's hopelessly inefficient, as you begin a simple search with each new character in the source text, backtracking if the match doesn't occur, and checking the next potential match.

    When you try to get a bit more clever, you realize there's probably a way to ensure that you can get examine each character in the searched text once and only once, never having to backtrack at all, and you get something like the Knuth–Morris–Pratt algorithm.

    However, it takes a fairly impressive leap of intuition / invention to get to the Boyer–Moore string search algorithm, which actually allows you to skip over checking some characters entirely, which I would have intuitively thought impossible without making that mental leap. Learning about these impressive algorithms was one of my favorite part of the Computer Science curriculum when I was in school.

    It will be interesting to see if someone eventually breaks through the current state-of-the-art limits of string comparison in our lifetime. It would be a bit sad if the hypothetical maximum efficiency were already reached, as was predicted by these mathematical models, because the current best algorithms still require a lot of computation time. I've long since devoted myself to more practical topics as a working programmer, but it's fun to peek into what's happening at the bleeding edge of pure computer science research.

    --
    Irony: Agile development has too much intertia to be abandoned now.
    1. Re:This reminds me... by Anonymous Coward · · Score: 0

      However, it takes a fairly impressive leap of intuition / invention to get to the Boyer–Moore string search algorithm, which actually allows you to skip over checking some characters entirely, which I would have intuitively thought impossible without making that mental leap.

      The next leap is suffix trees, which effectively allow you to consider all positions of the string at the same time. Suffix trees require only O(n) preprocessing time for the text and none for the search patterns.

    2. Re:This reminds me... by Anonymous Coward · · Score: 1

      However, it takes a fairly impressive leap of intuition / invention to get to the Boyer–Moore string search algorithm

      Then you would think I'm a fucking "genius". When I was 10 years old I implemented Jump Lists for my string search algorithm. Not because my 33Mhz system was too slow for the naive implementation, but because I was tired of waiting for QBASIC to stop thrashing memory. I was implementing my own native x86 compiler using the only interface I had available (QBASIC). A C compiler cost 1/3rd of what a PC did back then. So, without prior training in CS I imagined how one might transform a text program into machine code and implemented it with the tools I had. All because I was tired of creating .exe files via hex editor, and BASIC interpretors were too slow (at the time) for graphically intensive games.

      Why string search? Isn't a tree better for tokenizing code? Why not then recursive descent? Well, because of stack overflow. My compiler was iterative and worked by doing string searches on a normalized token stream ("code string"). This "code string" was a linked list of "symbols" represented by start and end offsets into the source text. The parser performed a string search across these "tokens" and replaced runs of the symbols with list items containing a "native string" (machine code) corresponding to the "code string". Adjacent blocks of "native" symbol "strings" were joined. The process repeated until the entire "code string" had been converted into a single block of native code, and that was my EXE file.

      To add compiler optimisations I just added more pattern matching to this mixed list of "code string" symbols and "native string". Optimization was then performed as a normal function of compilation. The iterations then continued until no more pattern matches could be made AND there were no more "code string" symbols (only a "native string"). If no matches could be found and some portion of code string remained, that was your syntax or compiler error -- Which you could sometimes correct and resume compilation (to keep compile time from being wasted -- a feature still not present in most any "modern" compiler). A special case was when "external" symbols were needed. Then the mixed "code" and "native" string would be saved to a file, and the next source file processed. When all files were either native code or mixed with external symbols, then the program was complete and required one more "assembly" pass to generate the EXE. I didn't know it but I had created a linker and intermediary "object code" -- And to this day it doesn't matter that I didn't use these terms.

      Rather than piss away valuable computer time generating tokens (made more valuable by slow ass QBASIC), I merely removed formatting white-space and comments from the program, and the result was my normalized "code string" to begin compilation of via "string" search (and replace).

      It would be almost a decade before I would be allowed to take a course on compiler theory in school. This also marked the point at which I dropped out of Academia altogether. I was paying to be taught things I already knew, and that's fucking retarding. Their graduate-level algorithms were not even as optimal for parallelism as my unlearned ignorant childhood compiler implementations. They taught next to nothing of recursive pattern matching (this being the true foundation of compilation in all systems which do so, be they digital or DNA). The time spent praising the achievements of dead men which a child can intuit in an afternoon made me want to puke. I felt the whole system was designed to demoralize the thinker and make them feel small and useless and obedient to those who dolled out knowledge, rather than to place the thinker on the same footing as those great minds that came before.

      Point being: Just because YOU were told that you are not smart enough to simply ponder at a problem and solve it (even optimally) without formal education, doesn't mean

    3. Re:This reminds me... by Anonymous Coward · · Score: 0

      I do think you're a fucking penis. Ohhhh, sorry. You said "genius".

    4. Re: This reminds me... by Anonymous Coward · · Score: 0

      U mad bro-skee? Don't hate on other peoples achievements. You mad because your resume is a quarter of a page long.

    5. Re: This reminds me... by Anonymous Coward · · Score: 0

      Length doesn't matter. It'a all about the girth^H^H^H^H^H quality of what you've done.

    6. Re:This reminds me... by Anonymous Coward · · Score: 0

      ...Known as "the discovery method." It has the small problem that people not quite as smart as the people who found the answers in the mists of history are going to take longer to find the answers, unless someone SHOWS them the answer.

    7. Re:This reminds me... by Zeroko · · Score: 1

      So I was not the only one to do that repeated-string-mapping method of compilation (although I did mine in high school (using C for bootstrapping, rather than QBASIC, IIRC)...earlier I was too busy playing with Visual Basic, which had its own compiler)...interesting.

      I never tried to make my compiler optimizing, despite having a plan in my head to do so, because the binary for the compiler itself already fit in 64KB that a .COM file required, & I never did anything interesting enough with it for time to be an issue...& then other projects called.

  3. Inaccurate Summary by Anonymous Coward · · Score: 3, Interesting

    First, having worked on software for DNA sequence alignment years ago, I don't think anyone thought they could do better than Smith-Waterman (a minor DNA-centric extension of a DP algorithm to find Levenshtein distance). Rather, everyone was using clever implementation techniques (SIMD instructions, GPUs, FPGAs) and heuristics to limit the amount of brute-force aligning. Oh, and lots and lots of hardware.

    Second, the summary author thinks that a dynamic programming algorithm isn't clever? They're either quite arrogant or don't know what they're talking about. Quicksort must be downright stupid, eh? Something for the short bus kids to play with.

    1. Re:Inaccurate Summary by Anonymous Coward · · Score: 0

      Anything above quasi-linear (specially if it's in both, time AND space) is not clever. O(nlogn) is somewhat clever, O(n) is definitely clever, O(logn) is very clever.

    2. Re:Inaccurate Summary by Anonymous Coward · · Score: 0

      You mean taking a problem that should be NP, and pulling it down into P space using dynamic programming isn't clever? I think you have an unreasonable definition of clever.

    3. Re:Inaccurate Summary by Anonymous Coward · · Score: 0

      You mean taking a problem that should be NP, and pulling it down into P space using dynamic programming isn't clever? I think you have an unreasonable definition of clever.

      Err, if the problem can be "pulled down" into P-space, then the problem was never NP to begin with.

  4. Tries by Anonymous Coward · · Score: 0

    If you can use a Trie, you'd optimize your comparison space.

    Another thing is how we compare Strings is the same problem of compressing Substrings, that is often implemented in Lempel/Ziv and other String compression algorithms.

  5. Strong Exponential Time Hypothesis is unproven by tommeke100 · · Score: 0

    The paper gives a "Proof By Contradiction" which would violate "The Strong Exponential time Hypothesis". However this Hypothesis itself is still unproven. Not that I hold any authority as to the validity of it all, though :)

  6. not a practical limit by fpoling · · Score: 1

    The result is truly about limits of computations and not about practical limits. For example, I will happily accept an algorithm that runs in c*2^(a*N) time when a is 1e-100 and c is not big as for practical purposes it runs in constant time. It is fascinating topic on its own why we do not have such algorithms for practical tasks.

  7. Spaghetti sort by jdagius · · Score: 1

    Perhaps the only way to avoid the mathematically imposed restrictions on conventional computing machinery would be to (literally) use a different mechanism. One that is inspired by the nature of the problem.

    For example, consider the problem of ordering a collection of random-length pieces of uncooked spaghetti. They can be ordered in O(1) time by merely grasping them in your hand and striking them against a table top. Ordering 500 strands takes no more time than 50 (ignoring preparation time). Scaling up is merely a matter of finding a bigger hand or table top.

    For comparing strands of DNA consider using a 'chemical computer' based on the transcription of DNA in to messenger RNA, which is the 'algorithm' employed by Nature to encode the DNA sequences. Such a machine, in solution, equipped with customized sequences of interest, could theoretically recognize any desired DNA sequence in O(1) time (up to chemical reaction times).

    So a simple solution to a complex problem. I'll leave it to you students to work out the details. ;-)

    1. Re:Spaghetti sort by Anonymous Coward · · Score: 0

      Scaling up is merely a matter of finding a bigger hand or table top.

      I'm pretty sure this does not have O(1) complexity.

    2. Re:Spaghetti sort by jdagius · · Score: 1

      I'm also pretty sure that designing and building any kind of computer can't be done in O(1) time. I'm calling that "preparation time" (which you were instructed to ignore).

    3. Re:Spaghetti sort by serviscope_minor · · Score: 1

      The mechanism isn't different and your spaghetti based computer is still stuck with the same limitations as anything else.

      What you're describing is a special purpose instruction for ordering a bunch of random length bits of spaghetti in parallel, and then throwing more and more hardware at the problem. In fact, the amount of hardware scales as O(N) with the amount of spaghetti.

      This isn't actually a new mechanism at all: complexity classes also deal with parallelising decision problems. For example NC complexity ones are "easy" to parallelise in the sense that polynomial complexity is "easy".

      There are important algorithms in P which aren't in NC, such as linear programming. That means there's no "easy" way to parallelise it for the definitions of NC. That means for example that if you throw a polynomial amount of hardware at a linear programming problem (i.e. n^k CPUs for size n input and an arbitrary k) you will still not be able to solve the problem in log(n) time.

      Based on your description, spaghetti-sorting is in NC.

      Likewise with the DNA chemical hardware. All that's got access to is some methods for very efficiently doing certain things in sequence and other methods for very efficiently doing things massively in parallel.

      While those methods can no doubt solve certain problems much more quickly than conventional serial or even conventional parallel computers, there are certain problems that they simply can't help with.

      --
      SJW n. One who posts facts.
    4. Re:Spaghetti sort by vikingpower · · Score: 1

      Your method doesn't "sort" the spaghetti. It aligns their ends against the same reference ( the table top ), which has nothing to do with sorting. They will still be, in your hand, in haphazard "order", i.e. there will be no order at all, as the shortest may very well end up immediately next to the longest.

      --
      Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
    5. Re:Spaghetti sort by jdagius · · Score: 1

      > While those methods can no doubt solve certain problems much more quickly than
      > conventional serial or even conventional parallel computers, there are certain problems that
      > they simply can't help with.

      The price paid for surpassing "optimal" general solutions of a problem is that they won't work on all instances of the problem. Quite often that is an acceptable price. It's a trade-off.

    6. Re:Spaghetti sort by jdagius · · Score: 1

      > It aligns their ends against the same reference ( the table top ),
      > which has nothing to do with sorting.

      But in the process of alignment, a linear ordering is imposed in one dimension, along the axis of the spaghetti rods, such that if you traverse this axis towards the table top, you are guaranteed access to the members of an ordered collection, tallest member first. Quite often that is an acceptable solution, especially if you are only interested in the tallest strands. Perhaps not a general solution to the sorting problem.

      Mathematicians insist on perfect, general solutions. Scientists and engineers are often quite satisfied with "trade-off" solutions that work for the problem at hand.

    7. Re:Spaghetti sort by vikingpower · · Score: 1

      I AM an engineer, thank you. A linear ordering along one dimension is called a "partial ordering", which is often quite useless, even for an engineer :-/

      --
      Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
    8. Re:Spaghetti sort by tepples · · Score: 1

      Alignment is O(n) in the length of rods. Querying by eye is O(n^.5) in the number of rods, and querying by touch is O(n) in the length of rods.

    9. Re:Spaghetti sort by PlusFiveTroll · · Score: 1

      To sort spaghetti by size, doing it by hand sounds very inefficient.

      A sorting method where it slides down a plane with slats that increase in size until the entire length can fall into a slot, much like a method for sorting coins, sounds a way to solve both the length and order problem. Of course that will never scale 'faster' than the number of sorting elements you use.

    10. Re:Spaghetti sort by jdagius · · Score: 1

      No. It is a _total_ ordering over the relation "=" in the sense that all pair of elements in the linear are comparable, i.e. elements a and b must satisfy "a=b" or "b=a", which is the requirement for "totality".

      In a _partial_ ordering (such as a collection of tree-structured elements) not all elements are necessarily comparable using "=", e.g. elements in different paths.

      In plainer English: the strands of a spaghetti-sorted collection are totally ordered by length when traversing along the strand axis towards the table. I.e. each element traversed is greater or equal in length to _all_ ('totality') of the remaining elements yet to be visited.

    11. Re:Spaghetti sort by jdagius · · Score: 1

      ...oops forgot to escape my LT brackets. Substitute "<=" wherever you see "="

    12. Re:Spaghetti sort by blue+trane · · Score: 1

      "In fact, the amount of hardware scales as O(N) with the amount of spaghetti."

      No, because if you double each side in a cube, you get an eight-fold increase in volume. 2 x the material to extend the cube gives you 8 x the spaghetti you can fit in it.

    13. Re:Spaghetti sort by serviscope_minor · · Score: 1

      No, because if you double each side in a cube, you get an eight-fold increase in volume.

      The area of the table you need is roughly equal to the area of the stacked bunch of spaghetti. Since each strand is an approximately fixed diameter, they're about the same area each.

      If you double the number of strands, then you need a flat surface of twice the area. So, the hardware requirements are O(N).

      If you want to include the *length* of spaghetti instead of just the number of strands then it gets more complex. However, eventually there's a significant mass to support which means you need a significant support structure. Once the mass increases far enough things get very tricky indeed and you need a vast amount of stuff to support it.

      --
      SJW n. One who posts facts.
    14. Re:Spaghetti sort by blue+trane · · Score: 1

      Why wouldn't you need length? You're sorting on length. You need something to hold the spaghetti strands in an upright position, so you need volume. Volume increases faster than the material needed to bound it does.

      http://www.tiem.utk.edu/~gross...

      Surface area = 6 * length^2

      Volume = length^3

      The more hardware you add to make the cube bigger, the much more spaghetti you can sort. The hardware increases less than O(n).

      Even if other hardware components increase at O(n) (not sure of that even), because your container hardware increases at less than O(n) your total hardware increases less than O(n). You get an advantage because nature increases volume faster than its bounding lengths.

      Related: Ultimate Free Lunch

    15. Re:Spaghetti sort by serviscope_minor · · Score: 1

      In your O(n) which n are you talking about?

      The GP seemed to be referring to the number of strands. The minimum increase is O(n) with the number of strands.

      If you're talking about volume, it gets rather trickier. A 5 mile high stack of spaghetti won't stand up under it's own weight. The trouble is the mass increases with the cube too.

      --
      SJW n. One who posts facts.
    16. Re:Spaghetti sort by mcswell · · Score: 1

      Wait: finding the single longest strand is O(1). Sorting them is O(n), because you have to keep picking out the longest strand.

  8. SETH is a pretty big assumption by JoshuaZ · · Score: 2

    SETH is a very big assumption, much stronger than even assuming P != NP. Essentially, the exponential time hypothesis (ETH) says that any algorithm which solves instances of 3-SAT https://en.wikipedia.org/wiki/3-SAT will have worst case running time that is exponential. However, it is conceivable that one could have a whole series of algorithms which each solves 3-SAT better and better. The idea is that there's an algorithm a_i which solves 3-SAT in time O(x_i^n) and as i goes to infinity, x_i goes to 1. SETH is the hypothesis that says essentially that that doesn't happen. Many people are not convinced that ETH is true although it certainly looks plausible. I think most people who have thought about this consider the possibility that ETH holds and SETH doesn't hold to be a deeply weird and generally unlikely option.

    1. Re:SETH is a pretty big assumption by Anonymous Coward · · Score: 0

      NP-complete problems would be trivially solved if you could manipulate time.

      Some interesting reading

    2. Re:SETH is a pretty big assumption by Anonymous Coward · · Score: 0

      If you could manipulate time, yes a bunch of current problems would be trivial. But you would create a new class of difficult problems.

    3. Re:SETH is a pretty big assumption by Anonymous Coward · · Score: 0

      My unproven theory: upper bound of n^O(log(n)) for NP.

      (Think about ways to generate such a bound, and you'll probably guess what algorithm I have in mind.)

    4. Re:SETH is a pretty big assumption by mcswell · · Score: 1

      Prove it and you'll be famous.

      (Or you can write a note in the margin of a book on algorithmic complexity saying that the margin is too small to contain your proof.)

  9. Here is an idea to play with: Gödel notation by vikingpower · · Score: 2

    Strings T ( the "text" ) and P ( the "pattern" ) both use an alphabet A of N symbols.
    Index each of A's symbols with a distinct odd prime number p1...pN. [ only odd primes as I personally don't like 2, the "oddest prime of them all" ;-)  ]
    String T has length M.
    Raise each prime index in T to the power of its location <= M in T, multiply them all, yielding a number sigT ( for signature of T ). [ begin counting locations with 1, not with 0, in which case you'll lose the information about T's first symbol... ]

    Do the same thing for P, yielding a number sigP.

    if sigT mod sigP yields zero, T contains P.

    Example: T = "ABRE". A -> 3, B-> 5, R->67 (the 18th odd prime), E->13
    so sigT = 3^1 * 5^2 * 67^3 * 13^4 = 644256903225

    P = "BR". B->5, R->67
    so sigP = 5^1 * 67^2 = 22445

    644256903225 mod 22445 = 0, so T contains P.

    Computational complexity: depends on complexity of the 2 multiplication steps. sigT will by definition be the largest number ( humongous, actually, for genome sequences ). So you'll need a very efficient multiplication algorithm; the best one I know is Sch&#246;nhage-Strassen with complexity O(log n log( log n ) ).

    Just slurps enormous amounts of memory :-)

    --
    Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
  10. compare by Anonymous Coward · · Score: 0

    Things like this are always found by people who did not know it was impossible.

  11. How many times are you searching? by ameline · · Score: 1

    If we define the genome being searched as the pool, and the string being searched for as the key; The paper could be completely correct when there is no reuse of the pool or keys. If you search the same pool for multiple different keys I can think of ways to pre-process the pool such that the first search + pre-process time still obeys the performance constraints their paper outlines, but subsequent searches of the same pool for a different key could happen in O log n time, or even constant time.

    --
    Ian Ameline
  12. I wonder... by dmgxmichael · · Score: 1
    Fascinating article. This stuck out:

    If Williams or anyone else can prove the existence of an edit-distance algorithm that runs even moderately faster than normal, SETH is history.

    The edit distance between book and back is 2 ... In latin characters. It's 1 with kanji. The edit distance is probably several hundred if you're comparing the two on a per pixel basis instead of per letter. Edit distance varies by scope.

    Surely I'm misunderstanding something. It can't be that simple.

    1. Re:I wonder... by locofungus · · Score: 1

      For a given algorithm, what is it's worst case running time across all possible inputs for that algorithm (of which there must be an infinite number)

      The answer the algorithm gives doesn't matter. It's how long you might have to wait for that answer that is important.

      --
      God said, "div D = rho, div B = 0, curl E = -@B/@t, curl H = J + @D/@t," and there was light.
    2. Re:I wonder... by locofungus · · Score: 1

      (of which there must be an infinite number)

      I didn't put this very well. What I mean is that there must be larger and larger cases.

      For the question "what is the edit distance between two n character strings taken from alphabet L" n can grow without bound. If there is a limit to n then, mathematically at least, the problem can be solved in O(1) via lookup.

      We often need the general case algorithm even when our inputs are finite. For example, there may be a fundamental limit to how many "characters" there are in DNA (I don't know if there is or isn't) but even assuming there is, the real size of DNA precludes the practicality of building lookup tables to calculate edit distance in O(1)

      --
      God said, "div D = rho, div B = 0, curl E = -@B/@t, curl H = J + @D/@t," and there was light.
  13. Re:Here is an idea to play with: Gödel notati by locofungus · · Score: 1

    Maybe I'm missing something.

    ABRE -> A^1 * B^2 * R^3 * E^4
    ABER -> A^1 * B^2 * E^3 * R^4 = A^1 * B * B * E^3 * R^2 * R^2 = A^1 * B * E^3 * R^2 * (B^1 * R^2)

    So your calculation says that ABER contains BR

    And assuming I've done the calculation correctly using your numbers for A B R E, I get ABER = 3320400962775, BR=22445.

    3320400962775/22445 = 147934995 = BRE*A

    --
    God said, "div D = rho, div B = 0, curl E = -@B/@t, curl H = J + @D/@t," and there was light.
  14. Re:Here is an idea to play with: Gödel notati by vikingpower · · Score: 1

    You are absolutely right. I just coded the algorithm in a programming language and noticed the same thing. It needs some extra encoding for symbol successor / predecessor relationships. I'll think about that while walking the dog post an update here ASAP.

    --
    Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
  15. Re:Here is an idea to play with: Gödel notati by Anonymous Coward · · Score: 0

    So RB is also in ABRE?

  16. NOT string comparison .. by Anonymous Coward · · Score: 0

    Most comments here seem to comment on string comparison, which edit distance isn't.

    Not even substring search, or full text search .. each problem requires a different approach in the root. For large text searches you can use PATRICIA tree, for comparison of two strings you cannot do better than O(n) since you have to at least examine each character.
    Precomputation is the correct answer to large problems with constants on one side.
    Edit distance has its own path of optimization depending on use case.

    Usually the 'library' algorithms are optimized like hell, but then sub-optimally scaled to larger problems that can be approached completely differently.

  17. Re:Here is an idea to play with: Gödel notati by Anonymous Coward · · Score: 0

    Your numbers will contain much more digits than the original strings, so time to process will go UP. Even computing the product for the P string takes a huge amount of time.
    Look at pat tree for a much better solution.

  18. edit distance, not just matching by bzipitidoo · · Score: 1

    Yes, a nice reminder of string processing problems. The problem they worked on isn't exactly string matching. They are trying to find the minimum "edit distance" between 2 strings. There are a lot of very similar seeming problems in this area.

    If all one is trying to do is test whether 2 strings match exactly, first check whether the lengths are equal. If they are, then it might be better to compare characters at the end first, under the idea that similar strings are more likely to match at the beginning. It's not an algorithmic savings, still have to compare each character until a mismatch is found or all characters have been checked. All that idea does is try what might be a more likely location for a mismatch sooner, but that depends entirely upon the data.

    Trying to find where a shorter string might occur in a longer is a different problem. As you noted, Boyer-Moore is good. Then, the problem of finding matches for n short strings for n>1 within a long string is solvable in yet other ways, faster than simply applying Boyer-Moore n times.

    --
    Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
    1. Re:edit distance, not just matching by Dutch+Gun · · Score: 1

      Yeah, I realize they're not the same problem... I was just reminiscing a bit about a slightly different problem that I spent some time thinking about as a student.

      Another string-related problem I had fun figuring out (essentially re-inventing a variation of the Levenshtein distance formula) was a closest-match algorithm, say, for common search strings that are slightly misspelled, or as a spell-checker finding a closest potential match. A decent solution to this problem isn't terribly hard, but does require a bit of ingenuity. I'd imagine the folks at Google have a pretty impressive algorithm for doing this.

      For whatever reason, these sorts of problems feel like CS-oriented puzzles to me, and I enjoy working on them. Often, if I see an algorithm which intrigues me, I'll expend some brain power theorizing how they work before I research how other people have approached or solved the problem so far. Reverse image searching and closest image matching were one example of this. I puzzled for quite a while about the best way to boil an image down to simple database-searchable numerical values that can be quickly queried and checked for ranges.

      --
      Irony: Agile development has too much intertia to be abandoned now.
    2. Re:edit distance, not just matching by mcswell · · Score: 1

      "first check whether the lengths are equal": I suppose if the string length is marked at the beginning of the string, that makes sense. But if not (e.g. the end of the string is marked by a null byte), doesn't that just slow things down? Because you have to traverse the strings twice: once to measure their length, and once to check for identity. I suppose it depends on the constant factor.

      "it might be better to compare characters at the end first, under the idea that similar strings are more likely to match at the beginning": I think that depends entirely on the nature of the strings. If they're from a language that uses suffixes, maybe; if they're from a language that uses prefixes (Bantu, Athabaskan), probably not. It also presumes that you have a pointer to the end of the strings, which depends on your data representation. For DNA, I would guess it makes no difference at all which end you start at.

    3. Re:edit distance, not just matching by bzipitidoo · · Score: 1

      A terminating null byte is a horrible way to denote the end of a string, as C/C++ designers eventually realized. The cost and space to maintain a separate value for the length of a string is O(1) no matter what operation is done. Not only does it take O(n) to find the terminating null, there's the additional complication of how to embed a null in the middle of a string, without terminating it. Use an escape sequence? Just don't allow the terminating character in the string? No modern string manipulation library uses termination characters. So, yeah, checking for equal lengths takes O(1), unless you're using the old string.h C library.

      --
      Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
  19. Re:Here is an idea to play with: Gödel notati by vikingpower · · Score: 1

    Agreed on the Patricia trie. Tried that out, coded one myself, the things are indeed lightning fast. Using primes is just a nicety, really...

    --
    Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
  20. Domain-specific shortcuts by Tablizer · · Score: 1

    Often one finds there are domain-specific shortcuts. For example, if we know that "useful" DNA sequences tends to X letters long, then an algorithm that is optimized for matching sequences of X length perhaps can be devised rather than try to match everything to everything.

    Sometimes it's better for the org to search say 10,000 objects with an imperfect search tool rather than 500 with a perfect one. The most promising ones can then later be gone over with the fuller algorithm.

    I'm not a biology expert, but I find there are shortcuts in other domains once you get to know the domain well enough. Focusing too much on being "generic" up front sometimes blinds one to the possibilities.

  21. [Correction] Re:Domain-specific shortcuts by Tablizer · · Score: 1

    if we know that "useful" DNA sequences tends to X letters long,

    should be "...tend to be X letters long".

  22. Maybe by CODiNE · · Score: 1

    To someone who hasn't taken Calculus yet it may seem like a brute force problem trying to find inflection points and the zero points of a curve. Even the area of an oddly shaped object would be an enormously repetitive process of slicing and adding.

    Yet once you know the tricks, it becomes trivial.

    It may be impossible but maybe not.

    --
    Cwm, fjord-bank glyphs vext quiz
  23. Abstract it away with change tracking by Etherwalk · · Score: 1

    The formal description of the edit distance is different from that described in the article. This is the one used in some cases of DNA or text comparison.

    Insertion of a single symbol. If a = uv, then inserting the symbol x produces uxv. This can also be denoted e->x, using e to denote the empty string.

    Deletion of a single symbol changes uxv to uv (x->e).

    Substitution of a single symbol x for a symbol y x changes uxv to uyv (x->y).

    It's not just about changing a symbol, it also includes adding and deleting symbols. That makes simple parallelization impossible.

    It's not clear if the mathematical proof cited in the article is the simple one of only changing symbols, or the more complex case of symbol addition and deletion. From the article, it appears that they are talking about the simple case, which makes the more complex algorithm even more intractable.

    That does make the simple hardware a little trickier, although for specific purposes there are some tricks you may be able to design for practically that become more efficient as the size of the insert grows. (An entire gene, for example, would be easier per gene to change than a single base pair).

    Basically you can make a list of the changes. If you know you inserted a big object at place X, you can dynamically generate any requested index without having to rewrite the whole really long string, for example.

    Similarly, you can design your machine to encode the object in a way which makes insertion and deletion easier. For example, allocate an array of blocks of memory to store the object and insert or delete blocks in the middle as it changes. We could call this a "linked list." Or an array with list-like elements.

    If you want, you can set it up so that re-writing to an array occurs in spare cycles to eliminate the dynamic overhead created by the linked list or change history.

  24. anything complex than O(n) by fkodama · · Score: 1

    you can serialize a computer to hash (map-reduce) then the second computer (or thread) to do what you have to do. as long the alphabet is a closed set and smaller than current int(32/64), it's a good strategy for full match.

  25. SETH exists by kwbauer · · Score: 1

    SETH is true. I know two of them.

  26. Photonic Computing Electronic Computing? by Anonymous Coward · · Score: 0

    So, they are looking to compare two arbitrary strings, right? Since strings are stored as ones and zeroes, I like to think about them as little diodes. Light is on = 1, light is off = 0. This helps me visualize the problem. This article says that the algorithm that is most efficient is a bit-by-bit comparison. In other words, if the first string is a bank of lights on the left, and the second string is a bank of lights on the right, this article says that the simplest process is to do two read operations. That is to say, the computer's "eye" looks left then looks right and determines whether it got the same signal twice (two on's or two off's). This makes sense given the way electricity works. The left lightswitch functions independently of the right lightswitch. You get four cases (left state, right state)

    1) On On
    2) On Off
    3) Off On
    4) Off Off

    However, I can envision a photonic system with only a single read per bit pair. Again, this comes from imagining bits as lights. One of the really cool things about light is that, as a wave, two identical photons can cancel each other out through destructive interference. If the left bank and the right bank are aligned such that a left light destructively interferes with the matching right light, the computer's "eye" only needs to perform one read function, the combined output of both lights.

    1) Off (either because both lights are off, or because both lights being on causes destructive interference)
    2) On (because only one light is one, and it experiences no destructive interference)