English 中文(简体)
快速搜索压缩文本文档
原标题:Fast search in compressed text files

I need to be able to search for text in a large number of files (.txt) that are zipped. Compression may be changed to something else or even became proprietary. I want to avoid unpacking all files and compress (encode) the search string and search in compressed files. This should be possible using Huffman compression with the same codebook for all files. I don t want to re-invent the wheel, so .. anyone knows a library that does something like this or Huffman algorithm that is implemented and tested, or maybe a better idea ?

预先感谢

最佳回答
问题回答

Searching for text in compressed files can be faster than searching for the same thing in uncompressed text files.

一种压缩技术 我看到,为了迅速搜寻而牺牲了一些空间:

  • maintain a dictionary with 2^16 entries of every word in the text. Reserve the first 256 entries for literal bytes, in case you come upon a word that isn t in the dictionary -- even though many large texts have fewer than 32,000 unique words, so they never need to use those literal bytes.
  • Compress the original text by substituting the 16-bit dictionary index for each word.
  • (optional) In the normal case that two words are separated by a single space character, discard that space character; otherwise put all the bytes in the string between words into the dictionary as a special "word" (for example, ". " and ", " and " ") tagged with the "no default spaces" attribute, and then "compress" those strings by replacing them with the corresponding dictionary index.
  • Search for words or phrases by compressing the phrase in the same way, and searching for the compressed string of bytes in the compressed text in exactly the same way you would search for the original string in the original text.

尤其是,寻找一个词通常会减少对压缩案文中16个轨道指数的比较,该指数比在原始案文中寻找该词更快。

  • each comparison requires comparing fewer bytes -- 2, rather than however many bytes were in that word, and
  • we re doing fewer comparisons, because the compressed file is shorter.

Some kinds of regular expressions can be translated to another regular expression that directly finds items in the compressed file (and also perhaps also finds a few false positives). Such a search also does fewer comparisons than using the original regular expression on the original text file, because the compressed file is shorter, but typically each regular expression comparison requires more work, so it may or may not be faster than the original regex operating on the original text.

(原则上,如上所述,你可以将固定的16轨代码替换成可变的Huffman先令代码,因此压缩文档将减少,但处理这些档案的软件将略为缓慢和复杂。)

对于更先进的技术,你可以研究

不可能有人在压缩的档案中寻找不舒服的座标。 我对你的最佳选择是将文件编成索引。 或许使用卢塞恩?

我在这里可能完全错了,但我认为,没有贬低档案,就没有可靠的办法寻找某种扼杀。 我对压缩算法的理解是,与特定扼杀物相对应的轨道将在很大程度上取决于在未经压缩的档案中显示情况。 你们也许能够找到一个特定档案中某种具体拼凑的编码,但我很相信,这在档案中是不一致的。

这样做是可能的,可以非常有效地进行。 对这一问题进行了大量引人注意的研究,更正式地称为简明数据结构。 我建议研究一些专题: Wave树、调频-指数/RRR、简明的uff阵列。 正如一些出版物所显示的那样,你也可以有效地搜索编码的Huffman。





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签