English 中文(简体)
NumPy 阵列中唯一值的频率计数
原标题:Frequency counts for unique values in a NumPy array

如何有效获取 NumPy 阵列中每个独特值的频率计数?

>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> freq_count(x)
[(1, 5), (2, 3), (5, 1), (25, 1)]
最佳回答

查看 np.bincount :

http://docs.scipy.org/doc/numpy/refer/ reference/ generous/numpy.bincount.html

import numpy as np
x = np.array([1,1,1,2,2,2,5,25,1,1])
y = np.bincount(x)
ii = np.nonzero(y)[0]

然后:

zip(ii,y[ii]) 
# [(1, 5), (2, 3), (5, 1), (25, 1)]

或:

np.vstack((ii,y[ii])).T
# array([[ 1,  5],
         [ 2,  3],
         [ 5,  1],
         [25,  1]])

或您想要将计算和独特值结合起来。

问题回答

使用熊猫模块:

>>> import pandas as pd
>>> import numpy as np
>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> pd.value_counts(x)
1     5
2     3
25    1
5     1
dtype: int64

这是迄今为止最普遍和最有效果的解决办法;令人惊讶的是,它至今尚未公布。

import numpy as np

def unique_count(a):
    unique, inverse = np.unique(a, return_inverse=True)
    count = np.zeros(len(unique), np.int)
    np.add.at(count, inverse, 1)
    return np.vstack(( unique, count)).T

print unique_count(np.random.randint(-10,10,100))

与目前接受的答案不同,它使用任何可分类的数据类型(而不仅仅是正数),并且具有最佳性能;唯一的重大费用是np.unique进行的分类。

numpy.bincount 可能是最好的选择。 如果您的阵列除了小密度整数外还包含其它内容, 最好把它包起来, 这样可以:

def count_unique(keys):
    uniq_keys = np.unique(keys)
    bins = uniq_keys.searchsorted(keys)
    return uniq_keys, np.bincount(bins)

例如:

>>> x = array([1,1,1,2,2,2,5,25,1,1])
>>> count_unique(x)
(array([ 1,  2,  5, 25]), array([5, 3, 1, 1]))

尽管答案已经回答过,但我建议采用不同的方法,使用 numpy.histgma 。这种函数给定一个序列,它返回其元素的频率 < strong > 组合在文件夹 中的频率。

注意 < 强/ 强> : 此示例中它有效, 因为数字是整数。 如果数字是真实数字所在, 那么这个解决方案将无法适用 。

>>> from numpy import histogram
>>> y = histogram (x, bins=x.max()-1)
>>> y
(array([5, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1]),
 array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,  22.,
        23.,  24.,  25.]))

旧的问题, 但我想提供我自己的解决方案, 以我的轮椅测试为基础, 使用普通的 < strong_ code> code > list 而不是 < code> np. array 作为输入( 或首先转到列表) 。

如果您遇到,请查看 < strong> / strong < / strong > 。

def count(a):
    results = {}
    for x in a:
        if x not in results:
            results[x] = 1
        else:
            results[x] += 1
    return results

例如

>>>timeit count([1,1,1,2,2,2,5,25,1,1]) would return:

100 000个环, 最佳值为 3: 2.26 微克/ 每环

>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]))

100 000个环, 最佳值为 3: 8.8 微克/ 每环

>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]).tolist())

100 000 环, 最佳 3: 5.85 微克/ 每环

虽然所接受的答案会比较慢,但 scipy.stats.itemfreq 的解决方案甚至更糟。


更深入的 " 强 " 测试没有证实 " /强 " 的预期。

from zmq import Stopwatch
aZmqSTOPWATCH = Stopwatch()

aDataSETasARRAY = ( 100 * abs( np.random.randn( 150000 ) ) ).astype( np.int )
aDataSETasLIST  = aDataSETasARRAY.tolist()

import numba
@numba.jit
def numba_bincount( anObject ):
    np.bincount(    anObject )
    return

aZmqSTOPWATCH.start();np.bincount(    aDataSETasARRAY );aZmqSTOPWATCH.stop()
14328L

aZmqSTOPWATCH.start();numba_bincount( aDataSETasARRAY );aZmqSTOPWATCH.stop()
592L

aZmqSTOPWATCH.start();count(          aDataSETasLIST  );aZmqSTOPWATCH.stop()
148609L

参考下文对缓存和内存内影响小数据集的其他副作用的评论,这些副作用对大量重复的测试结果产生影响。

import pandas as pd
import numpy as np
x = np.array( [1,1,1,2,2,2,5,25,1,1] )
print(dict(pd.Series(x).value_counts()))

This gives you: {1: 5, 2: 3, 5: 1, 25: 1}

要计算“强度”非内插器 weave.inline 将nompy.unique 与一些C-code结合起来;

import numpy as np
from scipy import weave

def count_unique(datain):
  """
  Similar to numpy.unique function for returning unique members of
  data, but also returns their counts
  """
  data = np.sort(datain)
  uniq = np.unique(data)
  nums = np.zeros(uniq.shape, dtype= int )

  code="""
  int i,count,j;
  j=0;
  count=0;
  for(i=1; i<Ndata[0]; i++){
      count++;
      if(data(i) > data(i-1)){
          nums(j) = count;
          count = 0;
          j++;
      }
  }
  // Handle last value
  nums(j) = count+1;
  """
  weave.inline(code,
      [ data ,  nums ],
      extra_compile_args=[ -O2 ],
      type_converters=weave.converters.blitz)
  return uniq, nums

<强度 > Profile info

> %timeit count_unique(data)
> 10000 loops, best of 3: 55.1 µs per loop

Eelco 纯 noumpy 版本 :

> %timeit unique_count(data)
> 1000 loops, best of 3: 284 µs per loop

<强 > 注

这里有冗余 (unique 也执行某种类型), 意思是代码可以通过将 unique 功能放入 c- code 环中而进一步优化。

多二分频率计数,即计数数数组。

>>> print(color_array    )
  array([[255, 128, 128],
   [255, 128, 128],
   [255, 128, 128],
   ...,
   [255, 128, 128],
   [255, 128, 128],
   [255, 128, 128]], dtype=uint8)


>>> np.unique(color_array,return_counts=True,axis=0)
  (array([[ 60, 151, 161],
    [ 60, 155, 162],
    [ 60, 159, 163],
    [ 61, 143, 162],
    [ 61, 147, 162],
    [ 61, 162, 163],
    [ 62, 166, 164],
    [ 63, 137, 162],
    [ 63, 169, 164],
   array([     1,      2,      2,      1,      4,      1,      1,      2,
         3,      1,      1,      1,      2,      5,      2,      2,
       898,      1,      1,  
import pandas as pd
import numpy as np

print(pd.Series(name_of_array).value_counts())
from collections import Counter
x = array( [1,1,1,2,2,2,5,25,1,1] )
mode = counter.most_common(1)[0][0]

大多数简单的问题之所以变得复杂,是因为在R类的顺序()这样的简单功能在统计结果和降序中都有统计结果,但在不同的俾顿类图书馆中却缺少。但如果我们设计出我们的想法,认为在熊猫中很容易找到所有此类俾顿类的统计顺序和参数,我们可以比在100个不同的地方寻找更快的结果。此外,R和熊猫的开发是携手并进的,因为它们是为同一目的创建的。为了解决这个问题,我采用了可以让我从任何地方找到的代码:

unique, counts = np.unique(x, return_counts=True)
d = { unique :unique,  counts :count}  # pass the list to a dictionary
df = pd.DataFrame(d) #dictionary object can be easily passed to make a dataframe
df.sort_values(by =  count , ascending=False, inplace = True)
df = df.reset_index(drop=True) #optional only if you want to use it further

您可以像这样写入 freq_count :

def freq_count(data):
    mp = dict();
    for i in data:
        if i in mp:
            mp[i] = mp[i]+1
        else:
            mp[i] = 1
    return mp




相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签