Advanced Data Structures?
mdf356 asks: "It's been 5 years since I left graduate school and started designing and writing software for a living. After 5 years of writing operating systems code, I feel like I've forgotten some of the more advanced data structures I used to know. The next time an interesting problem arises, I'd like to have more in my toolbox than hashes, linked lists, heaps and various binary and n-way trees. I'd like something short and sweet, more in the line of the standard C book. Algorithm Design by Kleinberg and Tardos looks likely to be too basic, but I haven't read it (I'd like to avoid paying $90 for something that won't meet my needs). CLR is far too large and almost exclusively covers basic territory. Tarjan's Data Structures book looks like it has potential, but seems focused on network algorithms, which are unlikely to be applicable to the kernel programming I do. What are some good reference books on more advanced data structures and algorithms, particularly ones with potential applicability to an operating systems kernel?"
Well, let's be fair, when you get to advanced data structures, it's kind of up to you to really do all the imagining that makes them great. Don't dismiss CLR too quickly, as the book has a lot of great concepts that, if you're serious about algorithms and data structures, are quite useful as a starting toolbox.
Honestly though, when I went through college, the very last thing they tried to shove down my throat wasn't advanced data structures but instead Design Patterns. I'm not talking about algorithm design but instead just designs in general. This book might not do it in C but the ideas are pretty language independent. It's your 'cookbook' of designs laid out in class diagrams and sometimes more documentation is provided. Anyways, what I would actually suggest is that you combine a book that covers C structures with book and what you would have is a way for you to make a set of complex classes that emulate a proxy. Or you have an adapter class that is a "advanced data structure." I guess what I'm trying to say is that I don't see the point of paying attention to advanced data structures when it's really the way these data structures interact that is complex and important.
The big problem is that, in order to make design patterns work for you, you have to go through a lot of documentation and diagrams. Take this for example, you're writing an OS code and there's a part of the operating system that's going to be changing frequently depending on processor (I don't know a lot about OS development), let's say the scheduler. So what do you do? Do you just willy-nilly throw the scheduler and start going? Well, what would be best is that if you encapsulated the scheduler inside a package or library so that it could easily be exchanged for another one. You try to keep it small and modularize the classes and data structures that change from those that don't. Then, a new processor comes out that might demand a new scheduler and you open up this mini-project, edit/debug it and roll it out as something that just replaces the classes and libraries on all your deployed versions. Sounds like common sense, right? Well, for some it is and for some it isn't but to me this interaction is a complication of data structures and I think that the answer you're looking for isn't necessarily a book called "advanced data structures" just a few that explain how they work.
My work here is dung.
Algorithms in C, Parts 1-5 (Bundle): Fundamentals, Data Structures, Sorting, Searching, and Graph Algorithms (3rd Edition) by Robert Sedgewick
The Art of Computer Programming, by Donald E. Knuth
Two seminal works on the subject. It really depends on your needs however. Nearly anything you would need can be built off of the subjects in those books, but for more specialized datastructures you will probably have to move to specialized books.
You use linked lists in your kernel?!?
"No matter where you go, there you are." -- Buckaroo Banzai
WTF? People who try to move cleverness in low-level critical code should be shot. Look, there's a time and a place for everything, but this is the wrong approach for the problemspace you're working in. You didn't say which kernel you're working with, but odds are extremely good that it already comes with a set of pre-defined macros or functions for handling data structures. Use them. They're much better tested and understood than the stuff you want to play with, and much less likely to make your codevelopers hate you.
I don't want to be a stick in the mud, but seriously, this just isn't the place to start experimenting.
Dewey, what part of this looks like authorities should be involved?
I was taught Computer Algorithms / C++ by Horowitz, Sahni and Rajasekaran( I believe they have a pseudocode version) and it was great. The seemed to be a perfect balance between the implementation and the ideas behind them.
Data Structures and Algorithms (ISBN 0201000237) by Aho, Ullman and Hopcroft is a classic that belongs on every programmer's bookshelf (hell, keep it on your desk!).
And most programmers do. Read some code out in the industry and you'd think that 90% of the programmers out there never took a data structures class. Even if you're using a language like Java or C++ you really should know the implementation details of each so you can make informed decisions about the trade-offs in performance and maintainability. If you can't visualize the layout of your data and how you'll be accessing it, you really won't be able to pick the best structure for the job (Or roll your own if none of the usual ones fit.)
I can't think of a C project I've worked on where I didn't have to write a basic data structure like a stack or a linked list. On a couple of them the programmer passed data around as char data and used offsets to get at the data in them. On one, the programmer didn't even know about structs (Or null string termination.)
Having a "Convenient" language really isn't much of an excuse for not knowing this stuff. My data structures prof deliberatly chose an inconvenient language (BASIC) to show data structure implmentations. You can implement any structure in terms of array if you have to.
It'd be nice if more programmers would actually put some thought into their code before they go and start writing stuff. I wouldn't seek to discourage that...
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
Network algorithms are useful in an astoundishing number of places, especially in those cases where all flows are either 0 or 1. You just need to expand you idea about what the networks and flows represent.
E.g., my professor used network algorithms when writing the software that matches medical schools and students. Each school prioritizes it's candidate students, each student prioritizes his schools. The program finds the best matches. He said it had taken many hours (12 hours?) to do originally, but with the network flow it was a few minutes on an old kaypro.
Another example was determining when teams are mathematically eliminated from playoffs. It sounds like a simple problem, but it isn't since a team could lose a game yet remain in the running because a different team lost its own game.
A decade ago these algorithms would be overkill for a kernel. Today... I can definitely see how these algorithms could make a difference in some places.
For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
"Lisp uses only nested linked lists, that can be interpreted as binary trees."
As much as people keep saying it, this is totally false... Common Lisp does not even have lists as an intrinsic type. What it does have are cons cells, several kinds of numbers, symbols, characters, strings, several kinds of vectors and arrays, hash tables, structures, instances, and lexical closures among other types native to the language.
Lists are to Lisp what Strings are to C: a convention built upon more basic structures. In the case of C, a string is a convention built up on null terminated arrays of characters. In the case of Lisp, a list is a series of chained together cons cells.
Yeah, you pretty much eliminate all tree structures, and hence all graphs structures, once you demand better than O(lg n) performance.
The only way to improve on this in slower data structures is to use radix-style keying to cheat, so that your lookup keys tell you exactly how to find the value you're looking for. But this isn't really an "advanced" data structure, just an optimization of a basic one. (In fact, a lot of "advanced" data structures are just such application-specific optimizations; radix keying is obviously dependent on domain knowledge of your data set.)
Almost all interesting data structures (in terms of implementation; lists are good enough for 90% of cases) are trees, or sometimes graphs, and even those are rarely needed in the majority of applications. The exceptions are probably so specialized that you'll never find a single book to cover even an interesting subset. Your best bet at that point is a focused literature search (if you have access to article databases), or perhaps a Google.
You don't really need advanced data structures so much as advanced algorithms for using those data structures. The basic mechanisms for assembling and accessing a list, tree, graph, or hash table are basically the same for all algorithms. There's some clever work on, for example, lock-free data structures for concurrent programming that would probably be useful for a kernel programmer, but that's still not a matter of new data structures, but rather having full knowledge of the concurrency model of the multiprocessor you're working with, and optimizing for it, which is why it's mostly of interest to kernel programmers, and other people coding to the metal.
A few years ago, I took Chris Okasaki's "Advanced Data Structures" course at Columbia. Looking at the Wikipedia Data Structures article others have mentioned, it's pretty good, but doesn't include, e.g. Pairing Heaps, which have excellent real-world performance. You might want to check out Chris's book Purely Functional Data Structures (ISBN 0521663504). Even though you're not working with a functional language, it's a good catalog of some recent data structures, and from what I saw in the class, he understands them at a very deep level.
I find it fascinating that a kernel developer would be concerned with O() rather than the actual runtime. I thought kernel developers in general just put an arbitrary upper bound so their "O(n^2) && n -lt 1024" was faster than the "O(log n)" for any n.
Now if what you are actually looking for is some fun how about embedding neko in the kernel, because while you can add some cool things like say tries (prefix trees) the real performance bottleneck is between apps and the kernel. Neko could safely call getpid() in the kernel 1000x more times than any user app could. This opens a lot of possibilites such as user code safely reading kernel structures directly with nearly zero overhead, functioning as event callbacks, etc.
As an example, take something like beagle where it ideally gets notified of every file change and then, at some later time re-index the files. Do you really want every file modification doing a kernel-user-kernel switch? Or some hideous complicated caching API to specifically send these events all in a batch? (inotify I am looking at you...). Think a generic BPF.
Purely Functional Datastructures by Okasaki.
Doesn't it make you feel good to know that our freedoms are protected by politicans, lawyers and journalists.
> As much as people keep saying it, this is totally false... Common Lisp does not even have lists as an intrinsic type.
Funny how LISP stands for LISt Processing. And this thread is about data structures, not data types. The cons cell might be the central data type in LISP, but the nested list is the central data structure.
Just think about it. '(a b c) will be recognised as a 3 element list by any LISP programmer, and not as 2 cons cells plus 3 atoms (which would be equally correct). Now you can also construct binary trees with cons cells, but that is more complicated and therefore a bit rare.
Thomas
"Just think about it. '(a b c) will be recognised as a 3 element list by any LISP programmer, and not as 2 cons cells plus 3 atoms (which would be equally correct). "
Actually no... it's 3 cons cells. The syntax (a b c) is shorthand for (a . (b . ( c . ()))). This kind of misunderstanding is precisely why any competent (key word) LISP programmer understands the difference between lists and cons cells I was referring to in my original post. The closest structure with only two cons cells is this (a b . c).
Now, As long as we're being pedantic, what you actually wrote, '(a b c), is shorthand for (quote (a b c)). The numbers of cons cells and atoms in that case is a 'exercise for the reader', but suffice it to say that both your numbers are wrong.
Although mentioned in a negative light,
This technique is important when you have to move data between big-endian and little-endian processors. It lets you write portable I/O routines.
"you just made yourself look like an ass."
Dude, I was arguing on Slashdot about the number and proper usage of data types in Lisp, of all things... I think looking like an ass was pretty much a part of the deal from the get go.
"But you wear it so well! "
:-)
...... Nothing, they are semantically equivalent to *(a + b) and *(b + a).
:-)
Well, we're all 'special' in our own unique ways.
To be frank, I wouldn't have been quite so harsh (ass-like, if you wish) if he hadn't just happened to make one of mistakes that proved exactly my point about the difference between cons cells and lists.
"Based on your argument C doesn't have arrays, just pointers with some syntactic sugar."
I'd actually agree with that.... A bit of trivia: in C, what's the difference between a[b] and b[a]?
Keep in mind that I'm not saying any of this is a bad thing: in the case of Lisp lists, the added flexibilty that comes from their representation can be tremendously useful. However, it does complicate things. Just as an example, consider what happens when you define a function that traverses a list. Since Lisp lists can't be identified as being finite until you traverse them, you need to either be willing to tolerate functions that don't end when passed certain kinds of 'lists' or include extra baggage to detect when they're in an infinite list. Lisp also doesn't make the guarantee that the last 'next list' in a list is even a list at all: another edge case that's introduced by the fact that Lisp doesn't have first class lists.
Something else to consider is that Lisp doesn't even have to have cons cells at all. There's nothing to keep them from being represented internally as structures, arrays, or even closures. In other words, not only does the LISt Processing language not have lists as a first class type, it doesn't even have to have their component cons cells as a first class type. Lists can entirely be implemented using types the original poster seems to believe Lisp doesn't even have at all. Which brings me to the second reason I was an ass in my response... my experience has been that blanket statements tend to oversimplify things and can ultimately lead to bad decisions. If this guy has really reduced Lisp to nested lists, Perl to hash tables, Embedded programming to arrays, and Databases to B-trees, I question his ability to think critically in any amount of depth about software engineering problems. Of course, he could also be the best developer on the planet... After all, this is Slashdot, so you do have allow for the possibility of incompletely thought through 'pissing contest' posts. Which is the third and final reason I was somewhat an ass.
At the very least the original poster has shown the presence of mind not to keep posting in this thread..