English 中文(简体)
Why isn t randomized probing more popular in hash table implementations?
原标题:

According to various sources, such as Wikipedia and various .edu websites found by Google, the most common ways for a hash table to resolve collisions are linear or quadratic probing and chaining. Randomized probing is briefly mentioned but not given much attention. I ve implemented a hash table that uses randomized probing to resolve collisions. Assuming there is a collision, resolution works as follows:

  1. The full (32-bit) hash of an object is used to seed a linear congruential random number generator.
  2. The generator generates 32-bit numbers and the modulus is taken to determine where in the hash table to probe next.

This has the very nice property that, regardless of how many hash collisions there are in modulus space, lookup and insertion times are expected to be O(1) as long as there are few collisions in full 32-bit hash space. Because the probe sequence is pseudo-random, no clustering behavior results from modulus space collisions, unlike with linear probing. Because the entire system is open-addressed and doesn t use linked lists anywhere, you don t need to perform a memory allocation on each insertion, unlike chaining.

Furthermore, because the size of the hash is usually the size of the address space (32 bits on 32-bit machines), it is simply impossible to fit enough items in address space to cause large numbers of hash collisions in full 32-bit hash space under a good hashing scheme.

Why, then, is randomized probing such an unpopular collision resolution strategy?

最佳回答

Python s dictionary implementation does this. A very nice comment in dictobject.c says:

...
The first half of collision resolution is to visit table indices via this
recurrence:

    j = ((5*j) + 1) mod 2**i

For any initial j in range(2**i), repeating that 2**i times generates each
int in range(2**i) exactly once (see any text on random-number generation for
proof).
...

Sure looks like a linear congruential RNG to me!

Note that the full state of such an RNG is only i bits--has to be, to avoid revisiting entries--so you can t meaningfully use "[t]he full (32-bit) hash of an object" to seed the RNG. Python initially seeds j with i bits from the hash. If there is another collision, it grabs another 5 bits from the hash and throws those into the mix. (Read the rest of that comment, particularly where it talks about PERTURB_SHIFT.) It continues that way, adding more bits with each collision, until it has used up the whole hash code. This way Python uses a decent amount of whatever randomness the hash code offers, and the code is simple and fast.

This is some of the finest code I ve ever read. It s featured in chapter 18 of Beautiful Code. So I d say you re on to something!

问题回答

One of the reasons for using linear lookup (such as double hasing) is cache locality. By making the second (rehash) function to be an addition of a small integer, most chances are that you ll hit the same cache line. It is very significant for large hashes.

Chain hashing is probably used due to its simplicity.

Possible reasons are that linear or quadratic probing

  • have the same worst-case time complexity (O(size of the table))
  • have the same best-case time complexity (O(1))
  • are easier to implement
  • are faster than a good RNG (since speed is a major selling point for hashtables)

But I m not sure. Did you implement your own hashtable with another collision resolution and compare the two under different circumstances? That would be very enlightening.

Wouldn t you have the problem that for insertions into a non-sparsely populated table there s no guarantee that you ll hit all the elements of the hash table before starting to iterate over duplicate elements?

As a result insertion time wouldn t be well defined.

I think the reason random hashing isn t used much is that hash collisions when a small hash value is computed from a 32-bit hash are apt to be rare unless there s something "wrong" with the hash function, and in that case there s a fair likelihood that all 32 bits of the hash function will match (e.g. because only part of the key was used in computing the hash). If hash functions are decent, and load factors are reasonably low, linear and quadratic probing offer good cache locality (remember that the majority of hash collisions will be resolved by looking at only one extra item, which will with both linear and quadratic probes be the one that follows the first guess). Linear probe offers somewhat better performance in the case where all keys map to the same value, and sometimes even if they map to a small number of values. Chain-bucket hashing allows easy item removal.





相关问题
XML-RPC Standard and XML Data Type

I was looking at XML-RPC for a project. And correct me if I m wrong, but it seems like XML-RPC has no XML datatype. Are you supposed to pass as a string? or something else? Am I missing something? ...

Is it exists any "rss hosting" with API for creating feeds

I am creating a desktop app that will create some reports. I want to export these reports as RSS or ATOM feeds. I can easily create feeds with Rome lib for Java. But I have no idea how to spread them. ...

Improving Q-Learning

I am currently using Q-Learning to try to teach a bot how to move in a room filled with walls/obstacles. It must start in any place in the room and get to the goal state(this might be, to the tile ...

High-traffic, Highly-secure web API, what language? [closed]

If you were planning on building a high-traffic, very secure site what language would you use? For example, if you were planning on say building an authorize.net-scale site, that had to handle tons ...

Def, Void, Function?

Recently, I ve been learning different programming langages, and come across many different names to initalize a function construct. For instance, ruby and python use the def keyword, and php and ...

热门标签