Question

如果这是一个完全不明确的问题,但我想试图在一份清单中找到类似的价值观。实际上,更具体地说,我希望看到,我是否能够把这些项目分得分。

我在晚上知道,我只能拿一个清单,而是看一看它是否相同,如果它们不是完全一样,而是具有某种相似的价值观(或不是)。

例如:

#Batch one 
[1, 10, 20]
[5, 15, 10]
[70, 19, 15]
[50, 40, 20]


#Batch two 
[46, 19, 8]
[6, 14, 8]
[2, 11, 44]

我希望通过两批相互的相似之处来分立。我认为,我只能增加所有数字,然后按总价值加以比较,但我认为这并不奏效,因为[5,6,1000][600,200,211]似乎相似。例如,[5、15、10]和[6、14、8]得分最高。

我认为,要区分每个数值,看看一个百分点的差别,但如果清单具有许多变数(我最终可能有数千个清单,每个变量超过800个),那么这似乎确实是昂贵的。

任何建议?

Answer 1

如何使用?

In a list comprehension:

def distance(lista, listb):
    return sum( (b - a) ** 2 for a,b in zip(lista, listb) ) ** .5

或更具体地说:

def distance(lista, listb):
    runsum = 0.0
    for a, b in zip(lista, listb):
        # square the distance of each
        #  then add them back into the sum
        runsum += (b - a) ** 2  

    # square root it
    return runsum **.5

Answer 2

a = [1, 10, 20]
b = [5, 15, 10]
c = [70, 19, 15]
d = [50, 40, 20]

def sim(seqA, seqB):
    return sum([abs(a - b) for (a, b) in zip(seqA, seqB)])


print sim(a, a) # => 0
print sim(a, b) # => 19
print sim(a, c) # => 83
print sim(a, d) # => 79

Lower number means more similar. 0 means identical.

Answer 3

如果我正确理解你的话,你基本上想看到你有哪组别?

因此,如果你认为你的数据是3D的一组点,那么你会再次设法找到每个组群的分布?

(In other words you want to compare how internally similar the two batches are?)

在该案中,考虑诸如以下一些内容(利用 n加速:

import numpy as np

def spread(group):
    return group.var(axis=0).sum()

group1 = np.array([[1, 10, 20],
                   [5, 15, 10],
                   [70, 19, 15],
                   [50, 40, 20]], dtype=np.float)
group2 = np.array([[46, 19, 8],
                   [6, 14, 8],
                   [2, 11, 44]], dtype=np.float)

print spread(group1), spread(group2)

因此,在这种情况下,第2组最接近。

If, instead, you re interested in finding how "close" the two groups are to each other, then you could compare the distance between their centers

legs = group1.mean(axis=0) - group2.mean(axis=0)
distance = np.sqrt(np.sum(legs**2))

Or are you wanting to find the two "points" within each group that are the closest? (In which case you d use a distance matrix (or a more efficient algorithm for more points...)).

Answer 4

显而易见的解决办法已经在这里。基本上,这相当于计算差异。

既然你提到百分比......(1,2,3]和[101,103,105],那是你们喜欢做最后回答吗? 如果回答首先,那么就永远不会忘记。如果是第二点,你就不得不将差异与平均值实现正常化。

解决办法是:(SquareMean - Mean^2)/Mean^2,其中Mean = (a^2+b^2+c^2)/3,Mean = (a+b+c)/3。

Answer 5

我不知道如何,但我想到的是试图使用标准偏差,因为(理论上)类似价值观也有类似的偏离?

In this case [5, 15, 10] gets a standard deviation of 5 and [6, 14, 18] gets 6.1101

友情链接