The data look very much like nodes and edges of a weighted graph. If a
is similar to b
with a score 5.0
, and similar to c
with a score 1.0
, you might visualise it thus:
a
/
/
5.0 1.0
/
b c
Networkx is a python lib that provides ready-made graph objects and algorithms. Loading up your data into a weighted multigraph (that is, it supports multiple connections between nodes A--B
and B--A
is trivial. After that, getting the most similar object given an object id is a case of finding the node, finding it s most weighted edge and returning the node at the end of it.
import networkx as nx
## Test data
data = """
a::b::2
b::a::3
a::c::5
b::e::1
"""
rows = (row.split( :: ) for row in data.split())
class Similarity(object):
def __init__(self, data):
self.g = nx.MultiGraph()
self.load(data)
def load(self, data):
## Turn the row into data suitable for networkx graph
rows = ((row[0], row[1], float(row[2])) for row in data)
self.g.add_weighted_edges_from(rows)
def most_similar(self, obj_id):
## Get edges from obj_id node
edges = self.g.edges_iter(obj_id, data=True)
## Sort by weight, get first, get joined node
return sorted([(i[0], i[1], i[2].get( weight , 0)) for i in edges])[-1][1]
sc = Similarity(rows)
sc.most_similar( a ) ## c
## Add some more data linking a --> f with a high score
sc.load([( a , f , 10)])
sc.most_similar( a ) ## f