I need to be able to search for text in a large number of files (.txt) that are zipped. Compression may be changed to something else or even became proprietary. I want to avoid unpacking all files and compress (encode) the search string and search in compressed files. This should be possible using Huffman compression with the same codebook for all files. I don t want to re-invent the wheel, so .. anyone knows a library that does something like this or Huffman algorithm that is implemented and tested, or maybe a better idea ?



Searching for text in compressed files can be faster than searching for the same thing in uncompressed text files.

一种压缩技术 我看到,为了迅速搜寻而牺牲了一些空间:

  • maintain a dictionary with 2^16 entries of every word in the text. Reserve the first 256 entries for literal bytes, in case you come upon a word that isn t in the dictionary -- even though many large texts have fewer than 32,000 unique words, so they never need to use those literal bytes.
  • Compress the original text by substituting the 16-bit dictionary index for each word.
  • (optional) In the normal case that two words are separated by a single space character, discard that space character; otherwise put all the bytes in the string between words into the dictionary as a special "word" (for example, ". " and ", " and " ") tagged with the "no default spaces" attribute, and then "compress" those strings by replacing them with the corresponding dictionary index.
  • Search for words or phrases by compressing the phrase in the same way, and searching for the compressed string of bytes in the compressed text in exactly the same way you would search for the original string in the original text.


  • each comparison requires comparing fewer bytes -- 2, rather than however many bytes were in that word, and
  • we re doing fewer comparisons, because the compressed file is shorter.

Some kinds of regular expressions can be translated to another regular expression that directly finds items in the compressed file (and also perhaps also finds a few false positives). Such a search also does fewer comparisons than using the original regular expression on the original text file, because the compressed file is shorter, but typically each regular expression comparison requires more work, so it may or may not be faster than the original regex operating on the original text.



不可能有人在压缩的档案中寻找不舒服的座标。 我对你的最佳选择是将文件编成索引。 或许使用卢塞恩?

我在这里可能完全错了,但我认为,没有贬低档案,就没有可靠的办法寻找某种扼杀。 我对压缩算法的理解是,与特定扼杀物相对应的轨道将在很大程度上取决于在未经压缩的档案中显示情况。 你们也许能够找到一个特定档案中某种具体拼凑的编码,但我很相信,这在档案中是不一致的。

这样做是可能的,可以非常有效地进行。 对这一问题进行了大量引人注意的研究,更正式地称为简明数据结构。 我建议研究一些专题: Wave树、调频-指数/RRR、简明的uff阵列。 正如一些出版物所显示的那样,你也可以有效地搜索编码的Huffman。

