我有一个信息检索应用程序,它可以创建大约10千万比特的比特数组。数组中“set”位的数量变化很大,从完全清除到完全设置。目前,我使用的是一个直接的位数组(<code>java.util.BitSet</code>),所以我的每个位数组都需要几兆字节。
我的计划是查看前N位的基数,然后决定对剩余部分使用什么数据结构。显然,有些数据结构更适合非常稀疏的位阵列,而另一些数据结构则适合大约一半的位被设置时(当大多数位被设置后,我可以使用否定将其视为稀疏的零集)。
- What structures might be good at each extreme?
- Are there any in the middle?
以下是一些限制或提示:
- The bits are set only once, and in index order.
- I need 100% accuracy, so something like a Bloom filter isn t good enough.
- After the set is built, I need to be able to efficiently iterate over the "set" bits.
- The bits are randomly distributed, so run-length–encoding algorithms aren t likely to be much better than a simple list of bit indexes.
- I m trying to optimize memory utilization, but speed still carries some weight.
开源Java实现是有帮助的,但不是绝对必要的。我对基础知识更感兴趣。