Question

I have a file from which I am reading the data. I need advice on how to design the data structure which does the following: So, the data is of form

id_1::id_2::similiarity_score

现在,虽然数据形式如此,但也意味着:

id_2::id_1::same_similiarity_Score

So, what I want is a datastructure which when I use in program. So lets say I want to use this data in order to find which two items are similar

object.maxSimiliarity(object_id_1)
returns object_id_2 # has max score

但是,该物体_id_1也可在数据库中产品_id_2栏......

在数据库中,可以有以下两种形式:

 object_id_1:: object_id_2::score
 or object_id2::object_id_1::score

因此,我不想以这样的方式设计这种数据结构。

k_1, k_2:: value <--> k_2,k_1::value

Answer 1

It seems to me that you could use the scores to build lists of best to worst matches:

d = {
      id1 :  [id_best_match_to_id1, id_next_best_match_to_id1, ..., id_worst_match_to_id1],
      id2 :  [id_best_match_to_id2, id_next_best_match_to_id2, ..., id_worst_match_to_id2],
     ...
}

如果需要保留相似的分数,则采用<代码>(id_best_match_to_id1, 相似性_score_to_id1)的标记清单。

我看不出利用这种相似性的方法,即:<条码>模拟(x,y)=模拟(y,x)<代码>。

Answer 2

这类事情的一般trick计是找到一种通俗化,这一功能将特定阶层的所有成员描绘成同一个目标。在此情况下,您可以通过对头两个部分进行分类来实现:B:A:核心到A:B:核心,而A:B:核心。

Answer 3

The data look very much like nodes and edges of a weighted graph. If a is similar to b with a score 5.0, and similar to c with a score 1.0, you might visualise it thus:

                                    a
                                   / 
                                  /   
                                5.0   1.0
                                /       
                               b         c

Networkx is a python lib that provides ready-made graph objects and algorithms. Loading up your data into a weighted multigraph (that is, it supports multiple connections between nodes A--B and B--A is trivial. After that, getting the most similar object given an object id is a case of finding the node, finding it s most weighted edge and returning the node at the end of it.

import networkx as nx

## Test data
data = """
a::b::2
b::a::3
a::c::5
b::e::1
"""
rows = (row.split( :: ) for row in data.split())


class Similarity(object):
    def __init__(self, data):
        self.g = nx.MultiGraph()
        self.load(data)

    def load(self, data):
        ## Turn the row into data suitable for networkx graph
        rows = ((row[0], row[1], float(row[2])) for row in data)
        self.g.add_weighted_edges_from(rows)

    def most_similar(self, obj_id):
        ## Get edges from obj_id node
        edges = self.g.edges_iter(obj_id, data=True)
        ## Sort by weight, get first, get joined node
        return sorted([(i[0], i[1], i[2].get( weight , 0)) for i in edges])[-1][1]


sc = Similarity(rows)
sc.most_similar( a ) ##  c 
## Add some more data linking a --> f with a high score
sc.load([( a ,  f , 10)])
sc.most_similar( a ) ##  f

友情链接