English 中文(简体)
Algorithm:一个更好的方法计算一个词清单的频率
原标题:Algorithm: A Better Way To Calculate Frequencies of a list of words

This question is actually quite simple yet I would like to hear some ideas before jumping into coding. Given a file with a word in each line, calculating most n frequent numbers.

我只想使用<条码>的第一份也是不幸的事情:地图。 我知道C++的研究人员会说,unordered_map。 这将是非常合理的。

我愿知道,算法方面是否可以增加任何东西,或者这只是一个基本上由谁来选择最佳数据结构的问题。 我在互联网上搜索了该信,读到该信片桌和优先权格言可以提供一种算法,但可使用O(n)。 我假定,执行工作将十分复杂。

Any ideas?

问题回答

用于这项任务的最佳数据结构是:

http://en.wikipedia.org/wiki/Trie

它将超标计表。

There are many different approaches to this question. It would finally depend on the scenario and others factors such as the size of the file (If the file has a billion lines) then a HashMapwould not be an efficient way to do it. Here are some things which you can do depending on your problem:

  1. If you know that the number of unique words are very limited, you can use a TreeMap or in your case std::map.
  2. If the number of words are very large then you can build a trie and keep count of various words in another data structure. This could be a heap (min/max depends on what you want to do) of size n. So you don t need to store all the words, just the necessary ones.

如果你只是对最频繁的N字眼感兴趣,而且你不需要确切的话,那么你可以使用一个非常模糊的结构。 我通过Udi Manber听说过这一点。

You create an array of N elements, each element tracks a value and a count, you also keep a counter that indexes into this array. Additionally, you have a map from value to index into that array. Every time you update your structure with a value (like a word from a stream of text) you first check your map to see if that value is already in your array, if it is you increment the count for that value. If it is not then you decrement the count of whatever element your counter is pointing at and then increment the counter.

这简便,对算法没有任何影响,似乎也会带来任何好处,但对典型的真实数据而言,它往往做得很好。 通常,如果你希望跟踪一下你可能希望以10*N的能力来建立这种结构的顶点,因为其中有许多空洞的值。 利用詹姆斯·博里国王作为投入,本结构将这种结构列为最经常的措辞(特别是顺序):

0 : in
1 : And
2 : shall
3 : of
4 : that
5 : to
6 : he
7 : and
8 : the
9 : I

这里最常用的十大字(顺序):

0 : the ,  62600
1 : and ,  37820
2 : of ,  34513
3 : to ,  13497
4 : And ,  12703
5 : in ,  12216
6 : that ,  11699
7 : he ,  9447
8 : shall ,  9335
9 : unto ,  8912

你们看到,这十大字中的9个是正确的,它只利用50个元素的空间。 根据你的使用情况,这里节省的空间可能非常有用。 它也非常快。

这里是执行顶级 N that I used, written in Go:

type Event string

type TopN struct {
  events  []Event
  counts  []int
  current int
  mapped  map[Event]int
}
func makeTopN(N int) *TopN {
  return &TopN{
    counts: make([]int, N),
    events: make([]Event, N),
    current: 0,
    mapped: make(map[Event]int, N),
  }
}

func (t *TopN) RegisterEvent(e Event) {
  if index, ok := t.mapped[e]; ok {
    t.counts[index]++
  } else {
    if t.counts[t.current] == 0 {
      t.counts[t.current] = 1
      t.events[t.current] = e
      t.mapped[e] = t.current
    } else {
      t.counts[t.current]--
      if t.counts[t.current] == 0 {
        delete(t.mapped, t.events[t.current])
      }
    }
  }
  t.current = (t.current + 1) % len(t.counts)
}

Given a file with a word in each line, calculating most n frequent numbers. ... I ve searched it over the internet and read that hash table and a priority queue might provide an algorithm with O(n)

如果你指“<>n>*s” 那么,这是不可能的。 然而,如果你只是从投入文件的规模上说时间线,那么,用散列表格进行的三边执行将会达到你所希望的目的。

There might be probabilistic approximate algorithms with sublinear memory.





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签