English 中文(简体)
Making a perfect hash (all consecutive buckets full), gperf or alternatives?
原标题:

Let s say I want to build a perfect hash table for looking up an array where the predefined keys are 12 Months, thus I would want

hash("January")==0
hash("December")==11

I run my Month names through gperf and got a nice hash function, but it appears to give out 16 buckets(or rather the range is 16)!

#define MIN_HASH_VALUE 3
#define MAX_HASH_VALUE 18
/* maximum key range = 16, duplicates = 0 */

Looking at the generated gperf code, its hash function code does a simple return of len plus char value lookup from a 256 size table. Somehow, in my head I imagined a fancy looking function... :)

What if I want exactly 12 buckets(that is I do not want to skip over unused buckets)? For small sets as this, it really doesn t matter, but when I have 1000 predefined keys and want exactly 1000 buckets in a row?

Can one find a deterministic way to do this?

最佳回答

The only alternative to gperf I know is cmph : http://cmph.sourceforge.net/ but, as Jerome said in the comment, having 16 buckets provides you some speed benefit.

When I first looked at minimal perfect hasihing I found very interesting readings on CiteseerX but I resisted the temptation to try coding one of those solutions myself. I know I would end up with an inferior solution respect to gperf or cmph or, even assuming the solution was comparable, I would have to spend a lot of time on it.

问题回答

I was interested in the answer to this question & came to it via a search for gperf. I tried gperf, but it was very slow on a large input file and thus did not seem suitable. I tried cmph but I wasn t happy with it. It requires building a file which is then loaded into the C program at run time. Further, the program is so fragile (crashes with "segmentation fault" on any kind of mistaken input) that I did not trust it. A further Google search led me to this page, and onward to mph. I downloaded mph and found it is very nice. It has an optional program to generate a C file, called "emitc", and using it like

 mph < systemdictionaryfile | emitc > output.c

worked almost instantly (a few seconds with a dictionary of about 200,000 words) and created a working C file which compiles with no problems. My tests also indicate that it works. I haven t tested the performance of the hashing algorithm yet though.

There are many MPH solutions and algorithms, gerf doesn t yet do MPH s, but I m working on it. Esp. for large sets. See https://gitlab.com/rurban/gperf/-/tree/hashfuncs

The classic cmph has a lot of constant overhead and is only recommended for huge key sets.

There s the NetBSD nbperf and my improved variant: https://github.com/rurban/nbperf which does CHM, CHM3 and BZD, with integer key support, optimizations for smaller key sets and alternate hash functions.

There s Bob Jenkin s generator, and Taj Khattra s mph-1.2.

There are also two perl libraries to generate C lookups, one in PostgresQL (PerfectHash.pm) and one for late perl5 unicode lookups (regen/mph.pl), and a tool to compare various generators: https://github.com/rurban/Perfect-Hash





相关问题
Fastest method for running a binary search on a file in C?

For example, let s say I want to find a particular word or number in a file. The contents are in sorted order (obviously). Since I want to run a binary search on the file, it seems like a real waste ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Tips for debugging a made-for-linux application on windows?

I m trying to find the source of a bug I have found in an open-source application. I have managed to get a build up and running on my Windows machine, but I m having trouble finding the spot in the ...

Trying to split by two delimiters and it doesn t work - C

I wrote below code to readin line by line from stdin ex. city=Boston;city=New York;city=Chicago and then split each line by ; delimiter and print each record. Then in yet another loop I try to ...

Good, free, easy-to-use C graphics libraries? [closed]

I was wondering if there were any good free graphics libraries for C that are easy to use? It s for plotting 2d and 3d graphs and then saving to a file. It s on a Linux system and there s no gnuplot ...

Encoding, decoding an integer to a char array

Please note that this is not homework and i did search before starting this new thread. I got Store an int in a char array? I was looking for an answer but didn t get any satisfactory answer in the ...

热门标签