Genetic Algorithm Improves Shellsort
gstover writes "As a personal programming project recently, I used a genetic algorithm to find a better sequence of increments for Shellsort. It found at least one sequence that out-performs the best previously known by about 5 percent. Here
is an article about it (also here).
I believe this is the first time a
genetic algorithm has actually improved Shellsort."
You'll be included in THE book & be famous
Mark
Personally, I am quite impressed with the concept of Genetic Algorithms.
I could imagine your shell sort could be applied to database technologies by creating a separate low priority thread that would generate the "gene pool" to improve the indexing of the tables in the background.
On a side note I am learning about genetic algorithms with hopes that I may work on it when I take my Masters in Computer Science.
Archie - CIO-for-hire
This may be the first time something potentially important hit Slashdot before being reported elsewhere! :-)
May we never see th
quicksort doesn't preserve the order of duplicate keys, IIRC
Votez ecolo : Chiez dans l'urne !
There may be some cases where shellsort is more desirable for the exact data being sorted, I don't really know for sure. The importance of this is that he has used a GA to better the optimization work of humans on shellsort. He has laid the groundwork and circumstantial proof out for others to do the same with other algorithms. Of course he evolved a set of constants more than an algorithm itself.
The next logical place to go with this work, IMHO:
1) Invent a concise fake machine language for sorting algorithms (a convenience to make the rest of this easier). It should have basic instructions used in manipulating and sorting arrays (move, compare, copy, branching, etc...).
2) Write a "sort engine" that takes as input algorithms written in your fake language and uses them to sort things (outputting some performance numbers).
3) Implement every known array sorting algorithm you can find in your little fake machine sort language.
4) Let a GA evolve the whole algorithm by arbitrarily replacing bytes with other valid bytes from your specialized assembler language, starting with all the known sort algs as parents. Let it run until it stabilizes, using a relatively high mutation rate.
Of course, the big problem is that if your language implements any kind of looping construct, or any other way that code can be re-executed (and it will almost have to), then you face a "halting problem" when testing each new child. The pratical workaround is of course to know that any reasonable algorithm must finish the sort in a certain bounded amount of cpu cycles, and terminate any children who take longer.
5) Translate the winning candidate(s) custom machien source back into a generic language like C, and puzzle over exactly why it works so damn well.
11*43+456^2
Quicksort is slow if the data it is sorting are already almost sorted, a situation which (IIRC) shellshort is particularly efficient at dealing with.
Bottom-up mergesort does.
Theta(n log(n)) instead of Theta(n^2) like the (standard) Quicksort.
(Shellsort has Theta(n log(n)^2), IIRC)
"Between strong and weak, between rich and poor [...], it is freedom which oppresses and the law which sets free"
"Notice that the mating process has nothing to do with the constraints placed on valid organisms."
I wish more women would read Slashdot. Chalk one up for the "Size Doesn't Matter" team.
"I chose single-point crossover as the method of mating because it is simple & generic."
I choose doggie style for the same reasons.
GA is good for a lot of things. For instance, it was used to redesign diesel engines to be more efficient.
The big problem with GA though, IIRC, is that the resulting solution is often incomprehensible to a human. I believe Bill Joy did some work with GAs and had comments along those lines (sorry, I couldn't find the quote). Consider for a moment though trying to troubleshoot code generated by a computer. Bad variable names would be just the start of your problems. The logic patterns employed would be essentially random to a human. Many of the patterns would be vestigial and wouldn't even be relevant, but you wouldn't even know that. Identifying the primary execution paths would be a huge chore... never mind actually understanding the basis for why the generated solution actually works.
How comfortable would you be deploying a solution (hardware or software) where the fundamental design isn't even understood? How the heck do you fix such a thing once it's deployed?
Please mod this post only if you think others should/n't read this. I have enough ego^H^H^Hkarma. Thanks!
One of my professors at Midwestern State University (texas) has done this before. Although it seems you may have improved on his solution. (but I couldnt find a copy of his paper)
"Faster Shellsort Sequences: A Genetic Algorithm Application", Proceedings of the International Society for Computers and Their Applications(ISCA), April 7-9, 1999. (With Shashidhar Yachavaram) Presented at the conference, Cancun Mexico.
In Melanie Mitchell's wonderful book, "An Introduction to Genetic Algorithms", page 21-27, The author describes using genetic algorithms to find optimal patterns for "sorting networks". They are used for sorting chips (hardware) to sort in chunks of say 16 or 64 units. The order of comparisons is important. Using genetic algorithms, they were able to come very close to the best sequences found by humans. (I don't know if hardware sorting is still used. I had the impression that it is a mainframe thing.)
Thus, optimizing sorting algorithms with GA is not new in itself.
Table-ized A.I.
You're generalising, which is an easy mistake. Quicksort is good but for many data sets, it is very likely suboptimal and another alogorithm would perform better. Shellsort is good for small data sets. In fact, a good quicksort implementation may use shellsort once the divisions get small enough.
Another example:, quicksort is poor at sorting strings. This is because you have to perform a full strcmp() (or equivalent) at each comparison. For this reason, a modified radix sort is often used.
Samplesort vs Timsort.
Intro
-----
"This describes an adaptive, stable, natural mergesort, modestly called
timsort (hey, I earned it ). It has supernatural performance on many
kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
as few as N-1), yet as fast as Python's previous highly tuned samplesort
hybrid on random arrays."
From a GA fan, and seeing all the articles on distributed computing...
You may also be interested in Tierra. Open-ended networked evolution - it's pretty nifty.
Got to love watching the computer solve problems for you. Ahh...
Keep your packets off my GNU/Girlfriend!
The comb sort is a variation of the bubble sort algorithm that achieves N log(N) execution times, similar to heapsort. The algorithm is simple and memory efficient, compared to Quicksort.
1 /c ombsort.htm
http://cs.clackamas.cc.or.us/molatore/cs260Spr0
http://world.std.com/~jdveale/combsort.htm
A paper about genetic algorithms, written by a guy named Gene.
How comfortable would you be deploying a solution (hardware or software) where the fundamental design isn't even understood? How the heck do you fix such a thing once it's deployed?
:)
I'm just happy my parents didn't have the same concerns when they "deployed" me!
People couldn't type. We realized: Death would eventually take care of this.
Genetic algorithms are interesting in that they can become more efficient and modify themselves. I don't mean to sound like a ludite but is this a good way to develop algorithms?
/evolves you don't get the same output from the same input.. I guess thats the point, but I don't want my programs running like that.
Is the efficiency gained worth the difficulty in proving the new genetical enhanced algorithm works?
Its hard enough with vigorous testing to prove simple programs work without fail without introducing some random "Genetic" changes that may break the algorithm for some input cases.
One of the problems with AI and self modifing algorithms is that as it learns
But you'd better make damn sure that your GA doesn't improve the algorithm specifically for your "virtual machine". I.e. if pointer arithmetics were very slow (pulling that example out of my ass, btw.), your GA might tend to penalize algorithms using them more often then others.
True true. Probably the answer would be instead measuring the real execution time in your engine, meaure it in number of various operations, and weight them by how expensive those operations are on a typical modern 32 bit processor.
11*43+456^2
Shell Sort works by roughly sorting, then sorting a little finer, and so on, until you finely sort every piece of data, and hopefully at this final and potentially most time-consuming step, there isn't a lot of work to be done.
The thing is, there's no platonic God-like perfect mathmatically-proven best choice of how fine or rough the early iterations should be. No matter what increments you use (within certain constraints) Shell Sort will still work every single time (the last step at increment 1 guarantees it) -- it's just a matter of performance. Nobody knows what increments work best, and you can't just mathmatically figure it out.
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
I was extrapolating to the "AI Future" of much more complex algorithms being evaluated (evolved) in more fundamental ways then just changing parameters.
My frame of mind is has changed. Right now is I'm working on very time constrained real time stuff, and when I have to do it in X seconds I like to know what the worse case is, not best average case.
The genetic algorithm sounds like and iterative test to try and find the optimum increments for shell sort. Really nothing revolutionary.
I couldn't recall from the test description, but it doesn't seem that different arrays to sort were used to evolve the parameters of the sort (increment list). If this is the case the solution is just a sort optimized for a particular dataset and of little use.
I would be interested to see how well the algorithm would work elvolved against different random data sets (almost in order, always reverse order, random).
And then compared with others Standard increments.
They could have a competion like they have to come up with good prisoners delima algorithms, but instead come up with the best shell sort increments...
I guess it's a nice thing that shellsort is getting improved, and I do realize that this is great for genetic research, but, pardon me for being an uneducated ignorant bastard, does it have any practical value, seeing as quicksort is much faster than shellsort? Or am I wrong?
Shell sort has better worst-case running-time than quick sort does.
Most programming these days is all about the common case, but in real-time programming it's all about predictability. Thus there the mergesort is the standard choice over quicksort.
1 Earth is warming, 2 It's us, 3 it's royally bad, 4 we need to take action NOW
I guess it's a nice thing that shellsort is getting improved, and I do realize that this is great for genetic research, but, pardon me for being an uneducated ignorant bastard, does it have any practical value, seeing as quicksort is much faster than shellsort? Or am I wrong?
Quicksort is good for a lot of cases, but not for all. Yet one more good example (not mentioned here yet) would be the presense of knowledge or how unsorted things really are. Meaning that if you know something extra about the "unsorted" state of the data quicksort can be outperformed easily.
Classic example is to arrange bank payments based on the actual pruchase date using the credict card. As each store sends the bill with some variation the bills arrive in time unsorted. On the average they're still almost sorted, but not quite. For such a case quicksort would perform quite horribly in respect to what could be done.
True randomness is far more seldom encountered, thus other means for sorting can also be implemented more often that common people think.
1 Earth is warming, 2 It's us, 3 it's royally bad, 4 we need to take action NOW
You might want to check out Genetic Programming instead. It is the method of evolving programs instead of evolving solutions like Genetic Algorithm described here.
There are some inherit disadvantages of GP, while being much more powerful than GA. One of them being the problem of code growth, usually after some generations, the organisms gets to be too large and starting to cosume too much memory and cpu resources to be feasible. One of the papers I wrote while taking Evolutionary Computation class while in college was the investigation and a 'solution/improvement' to the code growth problem.
geek page at KY speaks
It depends on what you're measuring.
While Quicksort is O(nlogn) in typical use, its worst case performance is O(n), although most serious implementations would use a variation such as Introsort to limit the worst case behaviour.
Mergesort is O(nlogn) in all cases. However, the constant factor is usually significantly greater than Quicksort, making the latter preferable for most uses.
Shell sort is significantly slower for most situations, though IIRC it's pretty good with data that's already almost sorted.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
Trust me, you do not want to suddenly unleash an efficient algorithm for factorising into large primes on the world. For a start, security as we know it would more-or-less disappear overnight, if only because so much important data has been encrypted using algorithms that assume such factorisation to be hard.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
- A lot of simplification of a program can be automated. Some of that simplification work will require significant work to implement, but it's doable. Other parts of the simplification can be done by changing the fitness function once you have an efficient algorithm: In addition to performance or whatever criteria used, you add measurements for program size and complexity as secondary influences on the fitness.
- Robustness of the language the program is generated in can be strengthened. Typically GP uses custom domain specific languages. This both simplifies understanding, and prevent crashes.
- The experiments can be rerun with additions to the language used that group constructs that are complex but understood in order to attempt to simplify the solutions. The population can be steered towards using these constructs by rewarding them in the fitness function.
All in all, GP is very promising for certain types of problems, but it shouldn't be understood at "press this button and out comes a polished working program", rather as a tool to approach problems that are hard to analyse (especially where there is no 100% "correct" solution, or a correct solution is extremely hard to find, but gradual approximation has the potential of working well enough). GA and GP also has great potential wherever requirements change slowly over time and the software needs to adapt, as a way of preventing manual tweaking.
Imagine a lift control GA, for instance (actually used as an example in Dr Dobbs some time ago), where a near optimal solution could be used as a starting point for the lifts behavior. However "traffic patterns" in a building change - new tenants move in, old ones move out, departments move and create changes in which floors have the heaviest traffic, and which floors the traffic is between, and a GA could quite possibly be used to automatically adapt to such changes.
On a sidenote, before anyone raise concerns over letting a "random" program run a lift, the whole point here is that a GA isn't random. You use a hard coded algorithm that enforce rules for what is acceptable and what not. For a lift, of course, you'd have low level requirements such as breaking at a specific speed, and high level requirements such as newer changing direction before all the floors requested in the current direction have been visited.
The GA will only adapt the parameters you choose to surrender control of, such as whether or not the lifts should move when not requested (some of them moving to a specific floor for lunch time, for instance), and how long to wait before doing so.
I truly and sincerely believe that you should think about that claim, and then think a whole lot harder about what you'd like to see replace it. There are plenty of places in the world, still, where the protections that we take for granted don't exist. I'm guessing you wouldn't like to be there a whole lot, and releasing all the stuff that's currently protected is a one-way trip.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
You've seen and heard a lot about terrorism lately, I assume. When the intelligence agencies screw up, and something like Sept 11 happens, you hear all about it. How many terrorist incidents do you think such agencies prevent for every one that happens? I'm betting it's quite a few.
I realise that a lot of intelligence officers just read the local papers and use their brains, but I'm guessing that on-the-ground in-with-the-bad-guys people are still pretty important as well. If you blow the cover of every single intelligence officer in the world, who do you think it's going to hurt more, us or the bad guys?
Next up... You have a bank account, right? How would you feel about not only letting anyone who wanted to see the details, but letting any script kiddie with half a brain play with the money itself? If you remove the encryption that's used when things like transfers are done electronically, what exactly is going to stop them?
Now, I don't know whether either of these specific things is a valid example. I kinda assume that military and serious finance types have more than just a couple of prime numbers to protect their data. Then again, screw ups do happen, so maybe it's not so much more. It doesn't really matter anyway. Even if these exact things wouldn't crumble if you released an efficient prime factorising algorithm, there are numerous other things along the same lines that are bound to. RSA is a well-known and widely used algorithm that would be cracked instantly, for a start.
Now, I don't know about you, but personally I rate my personal safety, the security of my finances, the confidentiality of my health records and such pretty highly. Right about now, they're reasonably protected as far as I know. If you release something that cracks such a popular and fundamental encryption technique overnight, that all disappears. Me, I reckon I'd prefer living in the West to living in a third world country where the only people who have any control are the guys with the biggest guns, but maybe I'm just dumb like that...
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
Unfortunately, the sort of society you're describing couldn't come about overnight. Those sorts of changes would take many decades to get right even starting from an organised beginning. In the period that would follow the breakdown of encryption, I suspect what you'd actually see would be a period of panic and relative anarchy, a clampdown into police states to control it, and subsequently many years of corruption in governments desperate to retain control until they could reestablish the mechanisms that were there before using alternative means. The entire game doesn't change at all -- breaking one crypto algorithm isn't anything like enough to catalyse that -- it just gets really ****ty for a while.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.