如何有效获取 NumPy 阵列中每个独特值的频率计数?
>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> freq_count(x)
[(1, 5), (2, 3), (5, 1), (25, 1)]
如何有效获取 NumPy 阵列中每个独特值的频率计数?
>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> freq_count(x)
[(1, 5), (2, 3), (5, 1), (25, 1)]
查看 np.bincount
:
http://docs.scipy.org/doc/numpy/refer/ reference/ generous/numpy.bincount.html
import numpy as np
x = np.array([1,1,1,2,2,2,5,25,1,1])
y = np.bincount(x)
ii = np.nonzero(y)[0]
然后:
zip(ii,y[ii])
# [(1, 5), (2, 3), (5, 1), (25, 1)]
或:
np.vstack((ii,y[ii])).T
# array([[ 1, 5],
[ 2, 3],
[ 5, 1],
[25, 1]])
或您想要将计算和独特值结合起来。
import numpy as np
x = np.array([1,1,1,2,2,2,5,25,1,1])
unique, counts = np.unique(x, return_counts=True)
>>> print(np.asarray((unique, counts)).T)
[[ 1 5]
[ 2 3]
[ 5 1]
[25 1]]
In [4]: x = np.random.random_integers(0,100,1e6)
In [5]: %timeit unique, counts = np.unique(x, return_counts=True)
10 loops, best of 3: 31.5 ms per loop
In [6]: %timeit scipy.stats.itemfreq(x)
10 loops, best of 3: 170 ms per loop
使用此选项 :
>>> import numpy as np
>>> x = [1,1,1,2,2,2,5,25,1,1]
>>> np.array(np.unique(x, return_counts=True)).T
array([[ 1, 5],
[ 2, 3],
[ 5, 1],
[25, 1]])
原始答复:
使用scipy.stats.itemfreq (警告:退化):
>>> from scipy.stats import itemfreq
>>> x = [1,1,1,2,2,2,5,25,1,1]
>>> itemfreq(x)
/usr/local/bin/python:1: DeprecationWarning: `itemfreq` is deprecated! `itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
array([[ 1., 5.],
[ 2., 3.],
[ 5., 1.],
[ 25., 1.]])
我还对此感兴趣,所以我做了一点业绩比较(使用perfplot ,这是我的一个宠物项目)。 结果:
y = np.bincount(a)
ii = np.nonzero(y)[0]
out = np.vstack((ii, y[ii])).T
远为最快的。 (注意日志缩放 。 )
使用熊猫模块:
>>> import pandas as pd
>>> import numpy as np
>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> pd.value_counts(x)
1 5
2 3
25 1
5 1
dtype: int64
这是迄今为止最普遍和最有效果的解决办法;令人惊讶的是,它至今尚未公布。
import numpy as np
def unique_count(a):
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1)
return np.vstack(( unique, count)).T
print unique_count(np.random.randint(-10,10,100))
与目前接受的答案不同,它使用任何可分类的数据类型(而不仅仅是正数),并且具有最佳性能;唯一的重大费用是np.unique进行的分类。
numpy.bincount
可能是最好的选择。 如果您的阵列除了小密度整数外还包含其它内容, 最好把它包起来, 这样可以:
def count_unique(keys):
uniq_keys = np.unique(keys)
bins = uniq_keys.searchsorted(keys)
return uniq_keys, np.bincount(bins)
例如:
>>> x = array([1,1,1,2,2,2,5,25,1,1])
>>> count_unique(x)
(array([ 1, 2, 5, 25]), array([5, 3, 1, 1]))
尽管答案已经回答过,但我建议采用不同的方法,使用 numpy.histgma
。这种函数给定一个序列,它返回其元素的频率 < strong > 组合在文件夹 中的频率。
注意 < 强/ 强> : 此示例中它有效, 因为数字是整数。 如果数字是真实数字所在, 那么这个解决方案将无法适用 。
>>> from numpy import histogram
>>> y = histogram (x, bins=x.max()-1)
>>> y
(array([5, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1]),
array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.,
12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22.,
23., 24., 25.]))
旧的问题, 但我想提供我自己的解决方案, 以我的轮椅测试为基础, 使用普通的 < strong_ code> code > list code_ / strong > 而不是 < code> np. array code > 作为输入( 或首先转到列表) 。
如果您遇到,请查看 < strong> / strong < / strong > 。
def count(a):
results = {}
for x in a:
if x not in results:
results[x] = 1
else:
results[x] += 1
return results
例如
>>>timeit count([1,1,1,2,2,2,5,25,1,1]) would return:
100 000个环, 最佳值为 3: 2.26 微克/ 每环
>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]))
100 000个环, 最佳值为 3: 8.8 微克/ 每环
>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]).tolist())
100 000 环, 最佳 3: 5.85 微克/ 每环
虽然所接受的答案会比较慢,但 scipy.stats.itemfreq
的解决方案甚至更糟。
更深入的 " 强 " 测试没有证实 " /强 " 的预期。
from zmq import Stopwatch
aZmqSTOPWATCH = Stopwatch()
aDataSETasARRAY = ( 100 * abs( np.random.randn( 150000 ) ) ).astype( np.int )
aDataSETasLIST = aDataSETasARRAY.tolist()
import numba
@numba.jit
def numba_bincount( anObject ):
np.bincount( anObject )
return
aZmqSTOPWATCH.start();np.bincount( aDataSETasARRAY );aZmqSTOPWATCH.stop()
14328L
aZmqSTOPWATCH.start();numba_bincount( aDataSETasARRAY );aZmqSTOPWATCH.stop()
592L
aZmqSTOPWATCH.start();count( aDataSETasLIST );aZmqSTOPWATCH.stop()
148609L
参考下文对缓存和内存内影响小数据集的其他副作用的评论,这些副作用对大量重复的测试结果产生影响。
import pandas as pd
import numpy as np
x = np.array( [1,1,1,2,2,2,5,25,1,1] )
print(dict(pd.Series(x).value_counts()))
This gives you: {1: 5, 2: 3, 5: 1, 25: 1}
要计算“强度”非内插器 强度” - 类似于Eelco Hogendoons的回答,但速度要快得多(机器上5分),我用weave.inline
将nompy.unique
与一些C-code结合起来;
import numpy as np
from scipy import weave
def count_unique(datain):
"""
Similar to numpy.unique function for returning unique members of
data, but also returns their counts
"""
data = np.sort(datain)
uniq = np.unique(data)
nums = np.zeros(uniq.shape, dtype= int )
code="""
int i,count,j;
j=0;
count=0;
for(i=1; i<Ndata[0]; i++){
count++;
if(data(i) > data(i-1)){
nums(j) = count;
count = 0;
j++;
}
}
// Handle last value
nums(j) = count+1;
"""
weave.inline(code,
[ data , nums ],
extra_compile_args=[ -O2 ],
type_converters=weave.converters.blitz)
return uniq, nums
<强度 > Profile info 强度>
> %timeit count_unique(data)
> 10000 loops, best of 3: 55.1 µs per loop
Eelco 纯 noumpy
版本 :
> %timeit unique_count(data)
> 1000 loops, best of 3: 284 µs per loop
<强 > 注 强 >
这里有冗余 (unique
也执行某种类型), 意思是代码可以通过将 unique
功能放入 c- code 环中而进一步优化。
多二分频率计数,即计数数数组。
>>> print(color_array )
array([[255, 128, 128],
[255, 128, 128],
[255, 128, 128],
...,
[255, 128, 128],
[255, 128, 128],
[255, 128, 128]], dtype=uint8)
>>> np.unique(color_array,return_counts=True,axis=0)
(array([[ 60, 151, 161],
[ 60, 155, 162],
[ 60, 159, 163],
[ 61, 143, 162],
[ 61, 147, 162],
[ 61, 162, 163],
[ 62, 166, 164],
[ 63, 137, 162],
[ 63, 169, 164],
array([ 1, 2, 2, 1, 4, 1, 1, 2,
3, 1, 1, 1, 2, 5, 2, 2,
898, 1, 1,
import pandas as pd
import numpy as np
print(pd.Series(name_of_array).value_counts())
from collections import Counter
x = array( [1,1,1,2,2,2,5,25,1,1] )
mode = counter.most_common(1)[0][0]
大多数简单的问题之所以变得复杂,是因为在R类的顺序()这样的简单功能在统计结果和降序中都有统计结果,但在不同的俾顿类图书馆中却缺少。但如果我们设计出我们的想法,认为在熊猫中很容易找到所有此类俾顿类的统计顺序和参数,我们可以比在100个不同的地方寻找更快的结果。此外,R和熊猫的开发是携手并进的,因为它们是为同一目的创建的。为了解决这个问题,我采用了可以让我从任何地方找到的代码:
unique, counts = np.unique(x, return_counts=True)
d = { unique :unique, counts :count} # pass the list to a dictionary
df = pd.DataFrame(d) #dictionary object can be easily passed to make a dataframe
df.sort_values(by = count , ascending=False, inplace = True)
df = df.reset_index(drop=True) #optional only if you want to use it further
您可以像这样写入 freq_count
:
def freq_count(data):
mp = dict();
for i in data:
if i in mp:
mp[i] = mp[i]+1
else:
mp[i] = 1
return mp
Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...
I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...
Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...
Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...
I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...
Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...
Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...
I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...