如果你只是对最频繁的N字眼感兴趣,而且你不需要确切的话,那么你可以使用一个非常模糊的结构。 我通过Udi Manber听说过这一点。
You create an array of N elements, each element tracks a value and a count, you also keep a counter that indexes into this array. Additionally, you have a map from value to index into that array.
Every time you update your structure with a value (like a word from a stream of text) you first check your map to see if that value is already in your array, if it is you increment the count for that value. If it is not then you decrement the count of whatever element your counter is pointing at and then increment the counter.
这简便,对算法没有任何影响,似乎也会带来任何好处,但对典型的真实数据而言,它往往做得很好。 通常,如果你希望跟踪一下你可能希望以10*N的能力来建立这种结构的顶点,因为其中有许多空洞的值。 利用詹姆斯·博里国王作为投入,本结构将这种结构列为最经常的措辞(特别是顺序):
0 : in
1 : And
2 : shall
3 : of
4 : that
5 : to
6 : he
7 : and
8 : the
9 : I
这里最常用的十大字(顺序):
0 : the , 62600
1 : and , 37820
2 : of , 34513
3 : to , 13497
4 : And , 12703
5 : in , 12216
6 : that , 11699
7 : he , 9447
8 : shall , 9335
9 : unto , 8912
你们看到,这十大字中的9个是正确的,它只利用50个元素的空间。 根据你的使用情况,这里节省的空间可能非常有用。 它也非常快。
这里是执行顶级 N that I used, written in Go:
type Event string
type TopN struct {
events []Event
counts []int
current int
mapped map[Event]int
}
func makeTopN(N int) *TopN {
return &TopN{
counts: make([]int, N),
events: make([]Event, N),
current: 0,
mapped: make(map[Event]int, N),
}
}
func (t *TopN) RegisterEvent(e Event) {
if index, ok := t.mapped[e]; ok {
t.counts[index]++
} else {
if t.counts[t.current] == 0 {
t.counts[t.current] = 1
t.events[t.current] = e
t.mapped[e] = t.current
} else {
t.counts[t.current]--
if t.counts[t.current] == 0 {
delete(t.mapped, t.events[t.current])
}
}
}
t.current = (t.current + 1) % len(t.counts)
}