Question

This question is actually quite simple yet I would like to hear some ideas before jumping into coding. Given a file with a word in each line, calculating most n frequent numbers.

我只想使用<条码>的第一份也是不幸的事情:地图。我知道C++的研究人员会说,unordered_map。这将是非常合理的。

我愿知道,算法方面是否可以增加任何东西,或者这只是一个基本上由谁来选择最佳数据结构的问题。我在互联网上搜索了该信,读到该信片桌和优先权格言可以提供一种算法,但可使用O(n)。我假定,执行工作将十分复杂。

Any ideas?

Answer 1

用于这项任务的最佳数据结构是:

http://en.wikipedia.org/wiki/Trie

它将超标计表。

Answer 2

There are many different approaches to this question. It would finally depend on the scenario and others factors such as the size of the file (If the file has a billion lines) then a HashMapwould not be an efficient way to do it. Here are some things which you can do depending on your problem:

If you know that the number of unique words are very limited, you can use a TreeMap or in your case std::map.
If the number of words are very large then you can build a trie and keep count of various words in another data structure. This could be a heap (min/max depends on what you want to do) of size n. So you don t need to store all the words, just the necessary ones.

Answer 3

如果我作出许多选择(但我不知道可能适用哪些其他限制),请not从std开始:map(或unordered_map)。

你在此有两个数据项目,你把一个数据作为时间的关键部分,而另一个数据项目则是时间的另一方面。为此,你可能想像一个Boost Bimap或可能的话Boost MultiIndex 。

此处采用《比图》的一般设想:

#include <boost/bimap.hpp>
#include <boost/bimap/list_of.hpp>
#include <iostream>

#define elements(array) ((sizeof(array)/sizeof(array[0])))

class uint_proxy {
    unsigned value;
public:
    uint_proxy() : value(0) {}
    uint_proxy& operator++() { ++value; return *this; }
    unsigned operator++(int) { return value++; }
    operator unsigned() const { return value; }
};

int main() {    
    int b[]={2,4,3,5,2,6,6,3,6,4};

    boost::bimap<int, boost::bimaps::list_of<uint_proxy> > a;

    // walk through array, counting how often each number occurs:
    for (int i=0; i<elements(b); i++) 
        ++a.left[b[i]];

    // print out the most frequent:
    std::cout << a.right.rbegin()->second;
}

目前,我只印刷了最频繁的编号,但最经常的N级印刷机的频率很高。

Answer 4

如果你只是对最频繁的N字眼感兴趣,而且你不需要确切的话,那么你可以使用一个非常模糊的结构。我通过Udi Manber听说过这一点。

You create an array of N elements, each element tracks a value and a count, you also keep a counter that indexes into this array. Additionally, you have a map from value to index into that array. Every time you update your structure with a value (like a word from a stream of text) you first check your map to see if that value is already in your array, if it is you increment the count for that value. If it is not then you decrement the count of whatever element your counter is pointing at and then increment the counter.

这简便,对算法没有任何影响,似乎也会带来任何好处,但对典型的真实数据而言,它往往做得很好。通常,如果你希望跟踪一下你可能希望以10*N的能力来建立这种结构的顶点,因为其中有许多空洞的值。利用詹姆斯·博里国王作为投入,本结构将这种结构列为最经常的措辞(特别是顺序):

0 : in
1 : And
2 : shall
3 : of
4 : that
5 : to
6 : he
7 : and
8 : the
9 : I

这里最常用的十大字(顺序):

0 : the ,  62600
1 : and ,  37820
2 : of ,  34513
3 : to ,  13497
4 : And ,  12703
5 : in ,  12216
6 : that ,  11699
7 : he ,  9447
8 : shall ,  9335
9 : unto ,  8912

你们看到,这十大字中的9个是正确的,它只利用50个元素的空间。根据你的使用情况,这里节省的空间可能非常有用。它也非常快。

这里是执行顶级 N that I used, written in Go:

type Event string

type TopN struct {
  events  []Event
  counts  []int
  current int
  mapped  map[Event]int
}
func makeTopN(N int) *TopN {
  return &TopN{
    counts: make([]int, N),
    events: make([]Event, N),
    current: 0,
    mapped: make(map[Event]int, N),
  }
}

func (t *TopN) RegisterEvent(e Event) {
  if index, ok := t.mapped[e]; ok {
    t.counts[index]++
  } else {
    if t.counts[t.current] == 0 {
      t.counts[t.current] = 1
      t.events[t.current] = e
      t.mapped[e] = t.current
    } else {
      t.counts[t.current]--
      if t.counts[t.current] == 0 {
        delete(t.mapped, t.events[t.current])
      }
    }
  }
  t.current = (t.current + 1) % len(t.counts)
}

Answer 5

Given a file with a word in each line, calculating most n frequent numbers. ... I ve searched it over the internet and read that hash table and a priority queue might provide an algorithm with O(n)

如果你指“<>n>*s” 那么,这是不可能的。然而,如果你只是从投入文件的规模上说时间线,那么,用散列表格进行的三边执行将会达到你所希望的目的。

There might be probabilistic approximate algorithms with sublinear memory.

友情链接