I m working on a personal project, a file compression program, and am having trouble with my symbol dictionary. I need to store previously encountered byte strings into a structure in such a way that I can quickly check for their existence and retrieve them. I ve been operating under the assumption that a hash table would be best suited for this purpose so my question will be pertaining to hash functions. However, if someone can suggest a better alternative to a hash table, I m all ears. All right. So the problem is that I can t come up with a good hashing key for these byte strings. Everything I think of either has a very uneven distribution, or is takes too long. Here is a list of the situation I m working with:
- All byte strings will be at least two bytes in length.
- The hash table will have a maximum size of 3839, and it is very likely it will fill.
- Testing has shown that, with any given byte, the highest order bit is significantly less likely to be set, as compared to the lower seven bits.
- Otherwise, bytes in the string can be any value from 0 - 255 (I m working with raw byte-data of any format).
- I m working with the C language in a UNIX environment. I d prefer to stick with standard libraries, but it doesn t need to be portable to other OSs. (I.E. unistd.h is fine).
- Security is of NO concern.
- Speed is of a HIGH concern.
- The size isn t of intense concern, as it will NOT be written to file. However, considering the potential size of the byte strings being stored, memory space could become an issue during the compression.