English 中文(简体)
大量惯用阵列之间的快速计价
原标题:Fast counting matches between large number of integer arrays

我不禁要问,是否有有效的算法来计算大量的惯性阵列之间的配对数量。 Cython

match_ints.pyx

cimport cython
from libc.stdlib cimport calloc, free

import numpy as np
cimport numpy as np

np.import_array()


@cython.wraparound(False)
@cython.boundscheck(False)
@cython.initializedcheck(False)
cdef void count_matches(int[:, ::1] target_arrays, int[::1] ref_array, int[::1] num_matches):

    cdef:
        Py_ssize_t i, j
        Py_ssize_t n = target_arrays.shape[0]
        Py_ssize_t c = target_arrays.shape[1]
        Py_ssize_t nf = ref_array.shape[0]
        Py_ssize_t m = ref_array[nf - 1] + 5
        int * ind = <int *> calloc(m, sizeof(int))
        int k, g

    for i in range(nf):
        ind[ref_array[i]] = 1

    for i in range(n):
        k = 0
        for j in range(c):
            g = target_arrays[i, j]
            if g < m and ind[g] == 1:
                k += 1
        num_matches[i] = k

    free(ind)


cpdef count_num_matches(int[:, ::1] target_arrays, int[::1] ref_array):

    cdef:
        Py_ssize_t n = target_arrays.shape[0]
        int[::1] num_matches = np.zeros(n, dtype=np.int32)

    count_matches(target_arrays, ref_array, num_matches)

    return np.asarray(num_matches)

这里的想法非常简单。 为与参考分类阵列相匹配,按加标顺序排列(sort方法)。 设定指标阵列ind,其长度为参照阵列的最大体积(+5),以避免在范围上形成指数化,同时利用阵列中的惯性并不大。 因此,每一分类都被视为一种指数,在<编码>ind中的相应职位被定为1。 然后通过每一条<代码>具体目标——信息

在配对期间,如果<条码>内d中的索引>为<1>> /代码>,则所有在<条码>上的分类账号均视为索引和配对。

测试方法载于test_main_counts.py

# test_main_counts.py
from match_ints import count_num_matches
import numpy as np


def count_num_matches_main():
    x = np.random.randint(50, 6000, size=(1000000, 40), dtype=np.int32)
    ref_x = np.random.randint(100, 2500, size=800, dtype=np.int32)

    ref_x.sort()

    return count_num_matches(x, ref_x)


if __name__ == "__main__":
     nums = count_num_matches_main()
     print(nums[:10])

The setup file.

from setuptools import setup
from Cython.Build import cythonize
import numpy as np


setup(
    ext_modules=cythonize(
        "match_ints.pyx",
        compiler_directives={
            "language_level": "3",
        }
    ),
    include_dirs=[
        np.get_include()
    ]
)

由于所有分类账都不算大,而且有许多重复(在我的实际应用中,数百万阵列仅包含几千个独特的分类账),因此,任何相关的算法都是为了通过利用不太独特的分类来改善这类问题?

问题回答

在这种情况下,你可以预先设定目标阵列,并形成一种“无意指数”。 每一种可能的价值都产生一系列指标,包括具有以下等价值的指标阵列:

      inverted[] 
val   target_arrays_containing_val
1     2,5,100, 999999  
2     7, 13, 3141592
...
6000  3,111, 222,444,555,888

现在,从<代码>ref_arr到相应阵列的增减栏。

for x in ref_arr:
   for a in inverted[x]:
       num_matches[a] += 1

短期目标阵列(例如,40个)

正如评论中所建议的那样, count脚 count,使用dict子(key子)的固定操作,以找到最低发生率的成瘾者的共同/独特钥匙。

你们可以避免在打造字机时仅仅进行涂.。 至今为止,通过使用一套“een子”的方法,你可以忽略尚未看到的后来名单的价值:

   This is plain python - working code as guideline - you would have to transform that 
to cpython/numpy yourself   
# millions of arrays with small amount of unique ints in it
l1 = [1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3, 4,4,4,4,4,4, 99] 
l2 = [1,1,1,1,1,   2,2,2,2,2,   3,3,3,3,3,   4,4,4,4,4,   98]
l3 = [1,           2,2,         3,3,3]

dicts = []
seen_keys = set(l1) # initial keys from one list, doesn t matter which

# m lists: go once through each list  m times O(n)
for l in [l1,l2,l3]:   # m lists result in m dicts
    curr_d = {} 
    dicts.append(curr_d)

    # o(n) with n=len(list) instead of sorting with O(n*log(n))
    for i in l:
        if i not in seen_keys: continue  # skipable -> missing in ealier list
        # in plain python you can use defaultdict(int) or Counter for speedups
        curr_d.setdefault(i,0) 
        curr_d[i] += 1

# resulting dict for minimal counts of keys that are in ALL lists
total_d = {}
for d in dicts:
    # initial values 
    if not total_d:
        total_d = dict(d.items())
        continue

    # remove all things from total_d that are not in new dict
    # this will reduce runtimes the further you go as the next step has fewer updates
    diffr = total_d.keys() - d.keys()
    for remove in diffr:
        del total_d[remove]

    # reduce count to minial for any key that is in total_d and new dict
    commn =  total_d.keys() & d.keys()
    for c in commn:
        total_d[c] = min(total_d[c],d[c])  # ternary maybe faster

print(total_d) #    total_d.su 

产出:

{1: 1, 2: 2, 3: 3}




相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签