Slashdot Mirror


Probing Hash Tables?

David Rusenko asks: "I've been taking a datastructures class at CMU as part of a summer CS program. One of these structures we have gone over is hash tables. After going through many different probing methods (linear, quadratic), multiple hash functions, and double hashing, I was all too curious to know if these are the best methods currently known. Some other interesting ideas came up, such as using the Fibonacci numbers for probing, but I haven't had time to test them yet. Any comments?"

3 of 48 comments (clear)

  1. Re:Read the Bible by Tom7 · · Score: 4, Interesting

    Yes, it certainly has changed... real world considerations are much different from the memory-poor days when knuth invented mix and all of his insane bit-twiddling techniques. The theoretical analysis of basically any sort of sensible hashing will wind up resulting in constant time access... so it's really a question of addressing the constant factors on modern machines (as well as the ease in implementing these different schemes!). Knuth is quite outdated in that respect.

  2. Who cares? by Screaming+Lunatic · · Score: 5, Insightful
    It's a table. Which means it'll be faster than a list. That's all you have to worry about.

    What's that about a little bit of knowledge and it being dangerous. In school you'll beat every algorithm under the sun to death. In the real world you'll link to the STL and use a hash map.

    Write well-designed, clean, maintainable code. Then profile it. If your table lookup blips on the profiler, then *THINK* about optimizing it. After you've *THOUGHT* about optimizing it, then decide if it makes sense to squeeze the time and effort into the schedule.

    Random quotes:

    Premature optimization is the root of all evil. -- Knuth

    There is never a best solution, only tradeoffs to consider. --Eberly

    The best optimizer is between your ears. --Abrash

    The Ferrai's are only gravy, honest! --Carmack

    Alright I threw the last one in for shits and giggles. Don't blame me, I can't get through a Sunday night without drinking a poop load of beer. Anyway, the point is that there is a difference between *FAST* and *FAST ENOUGH*. If it's fast enough, who cares.

    [CYNICISM]

    Unless you're an academic and you're looking for funding.

    [/CYNICISM]

  3. shining all the light onto just one bucket by epine · · Score: 4, Interesting

    I hate to be churlish about this, but none of the early posts have addressed the core issues. Knuth's treatment is rather narrow. Buckets have very little to do with it.

    The question to address is your key structure. Ideally you have the notion of your keys in cannonical representation. In object oriented contexts, the byte representation of your objects is not necessarily cannonical.

    Next step is to analyze the size and distribution of your key space in cannonical representation, using as a function of some N which represents the scale of the problem instance.

    At this point, if your the size of your key space is a weak function of N life is easy. Weak functions are where the average bit size of your cannonical keys is logarithmic in N or at worst sqrt(N). This represents the order of key entropy extraction.

    A best case scenario is where all your keys are eight byte fully randomized GUIDs. Entropy extraction from your keys can be handled in just a few machine instructions. Worst case: you have to gather your key entropy by traversing deeply linked object hierarchies, in which case the efficiency of bucket access is swamped by the cost of constructing the bucket index.

    When life is ugly, the games begin. There are no end of variations on how you can arrange to collect enough entropy (most of the time) for each key (most of the time) quickly enough (most of the time).

    An extreme measure involves memoizing your key entropy with all the hassles that entails of making sure every modification to the key structure maintains key hash correspondence. If key modification is rare, you can really brute force this. If all your keys are always in the table, then keys can't change without rehashing (so one term starts to absorb another). On the other hand, if you have many keys to check, but few keys to check against, you might get sucked into finding clever and/or complicated methods for maintaining your key entropy memos.

    If you have concocted a model with known degeneracies, who pays on the degenerate case? This is a game of hot potato, which again presents endless options, most of which are difficult to formally analyze (supposing you even have a sufficient model of your key space).

    There are human hazards when you enter into this terrain not to be taken lightly. For some unknown reason a rather large slice of the population cannot comprehend any event more certain than 99% or less certain that 1%. pow(2,-6) is rounded up by these people to 1%, which in Murphy's calculas is as good as sunk. Nothing you can do about it. Sooner or later the pointy haired what-ifers will grind you down. Especially if they studied arithmetic for a couple of years as an undergraduate thinking it was mathematics.

    Let's suppose now that you've done the dirty deed and concocted some method to extract at least log K bits of uniform key entropy (most of the time) where K somewhere between the total number of keys you might need to check and the total number of keys you might need to store. Only now does bucket management begin to matter.

    Probably the best thing to do at this point is to slice the world into rough orders of magnitude: less than 10 CPU clock cycles to on-chip L1/L2 caches, 10-100 CPU clock cycles to on chip L3/off chip L3 cache, 100-1000 CPU cycles to external memory, 1000-10,000 cycles via interprocess communication/message passing, 10,000+ cycles for data structures not necessarily memory resident.

    With a GHz class CPU, 1000 cycles still allows you to check one million keys per second. Do you really need to hone this down another 2.5 orders of magnitude? Sometimes you can if you really want to.

    I could go on for a long time yet, but I've covered most of the ground that the rest of this thread has largely ignored.

    On a more theoretical level, it is even possible to construct perfect hashing schemes with space efficiency near the Shannon limit for data sets where the average key entropy is less than one bit. Did someone mention Markov models for speech recognition? Forget Knuth volume 3. It has a lot more to do with Knuth volume 4.