English 中文(简体)
Hasha a bytestring
原标题:Hash a byte string

I m working on a personal project, a file compression program, and am having trouble with my symbol dictionary. I need to store previously encountered byte strings into a structure in such a way that I can quickly check for their existence and retrieve them. I ve been operating under the assumption that a hash table would be best suited for this purpose so my question will be pertaining to hash functions. However, if someone can suggest a better alternative to a hash table, I m all ears. All right. So the problem is that I can t come up with a good hashing key for these byte strings. Everything I think of either has a very uneven distribution, or is takes too long. Here is a list of the situation I m working with:

  1. All byte strings will be at least two bytes in length.
  2. The hash table will have a maximum size of 3839, and it is very likely it will fill.
  3. Testing has shown that, with any given byte, the highest order bit is significantly less likely to be set, as compared to the lower seven bits.
  4. Otherwise, bytes in the string can be any value from 0 - 255 (I m working with raw byte-data of any format).
  5. I m working with the C language in a UNIX environment. I d prefer to stick with standard libraries, but it doesn t need to be portable to other OSs. (I.E. unistd.h is fine).
  6. Security is of NO concern.
  7. Speed is of a HIGH concern.
  8. The size isn t of intense concern, as it will NOT be written to file. However, considering the potential size of the byte strings being stored, memory space could become an issue during the compression.
最佳回答

trie更适合此类事情,因为它让你把自己的符号作为树木储存起来,并迅速使其与价值相符(或拒绝)。

作为奖金,你根本不需要洗衣。 你们一劳永逸地重新储存/检索/完成整个序列,而只留下少量的记忆。

Edit: And as an additional bonus, with only a second parse, you can look up sequences that are "close" to your current sequence, so you can get rid of a sequence and use the previous one for both of them, with some internal notation to hold the differences. That will help you compress files better because:

  1. smaller dictionary means smaller files, you have to write the dictionary to your file
  2. smaller number of items can free up space to hold other, more rare sequences if you add a population cap and you hit it with a large file.
问题回答

暂无回答




相关问题
Fastest method for running a binary search on a file in C?

For example, let s say I want to find a particular word or number in a file. The contents are in sorted order (obviously). Since I want to run a binary search on the file, it seems like a real waste ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Tips for debugging a made-for-linux application on windows?

I m trying to find the source of a bug I have found in an open-source application. I have managed to get a build up and running on my Windows machine, but I m having trouble finding the spot in the ...

Trying to split by two delimiters and it doesn t work - C

I wrote below code to readin line by line from stdin ex. city=Boston;city=New York;city=Chicago and then split each line by ; delimiter and print each record. Then in yet another loop I try to ...

Good, free, easy-to-use C graphics libraries? [closed]

I was wondering if there were any good free graphics libraries for C that are easy to use? It s for plotting 2d and 3d graphs and then saving to a file. It s on a Linux system and there s no gnuplot ...

Encoding, decoding an integer to a char array

Please note that this is not homework and i did search before starting this new thread. I got Store an int in a char array? I was looking for an answer but didn t get any satisfactory answer in the ...

热门标签