Question

This question already has answers here:

Find the similarity metric between two strings (16 answers)

Closed 8 months ago.

I want to find string similarity between two strings. en.wikipedia has examples of some of them. code.google has a Python implementation of Levenshtein distance.
Is there a better algorithm, (and hopefully a Python library), under these constraints:

I want to do fuzzy matches between strings. eg matches( Hello, All you people , hello, all You peopl ) should return True
False negatives are acceptable, False positives, except in extremely rare cases are not.
This is done in a non realtime setting, so speed is not (much) of concern.
[Edit] I am comparing multi word strings.

除了Levenshtein距离(或Levenshtein比率)以外,对我的案件来说,其他东西是否是一种更好的算法?

Answer 1

在谢夫菲尔德大学,有巨大的资源来强化类似的指标。它有各种衡量标准清单(超越了Levenshtein),并公开执行。象许多这样的景象应该很容易适应到灰色。

rel=“noreferer” http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmets.html

名单上的一条:

Hamming distance
Levenshtein distance
Needleman-Wunch distance or Sellers Algorithm
and many more...

Answer 2

我认识到,这并非同一件事,但这还不够:

>>> import difflib
>>> a =  Hello, All you people 
>>> b =  hello, all You peopl 
>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())
>>> seq.ratio()
0.97560975609756095

您可以发挥这一作用。

def similar(seq1, seq2):
    return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9

>>> similar(a, b)
True
>>> similar( Hello, world ,  Hi, world )
False

Answer 3

该分机将计算两个阵列的分法、Levenshtein、Sørensen和Jaccard相似值。在以下几页中,我对一系列利益涉及栏目:[3]和[4]sv. (>pip安放python-Levenshtein和pip安装距离:

import codecs, difflib, Levenshtein, distance

with codecs.open("titles.tsv","r","utf-8") as f:
    title_list = f.read().split("
")[:-1]

    for row in title_list:

        sr      = row.lower().split("	")

        diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
        lev     = Levenshtein.ratio(sr[3], sr[4]) 
        sor     = 1 - distance.sorensen(sr[3], sr[4])
        jac     = 1 - distance.jaccard(sr[3], sr[4])

        print diffl, lev, sor, jac

Answer 4

我将使用Levenshtein的距离,或所谓的Damerau的距离(考虑到变迁),而不是由于两个原因((1) “足够”的“动力方案”和“全方略”的C代码,(2) 众所周知的行为,例如: Levenshtein satisfies the triangle inequality and so can be used in e.g. a Burkhard-Keller.

门槛:你只能将距离和带、距离(1-X) * 最大(大于1)、宽(伸缩2)和调整X(相似因素)以适应紧急情况。选择X的一种方式是,收集对应的样本,计算X,忽略X <的情况;如0.8或0.9,然后按X和眼睛球的排位顺序排列其余部分,并插入正确的结果,并计算出X级不同层次的舱位成本。

注如果我急切地寻找一些东西,并且有高不实的处罚,那么我只能将门槛值定为0.75。

Answer 5

你们指的是什么?

>>> get_close_matches( appel , [ ape ,  apple ,  peach ,  puppy ])
[ apple ,  ape ]
>>> import keyword
>>> get_close_matches( wheel , keyword.kwlist)
[ while ]
>>> get_close_matches( apple , keyword.kwlist)
[]
>>> get_close_matches( accept , keyword.kwlist)
[ except ]

见。

Answer 6

我知道这只字眼,但你可以调整这一比率,以过滤不够相似的扼杀,并将最接近你所期待的扼杀。

也许,你会更有兴趣了解同异质素测量。

我认识到,你所说的速度不是一个问题,但如果你为你的算法处理许多座标,那么以下一点是非常有益的。

def spellcheck(self, sentence):
    #return    .join([difflib.get_close_matches(word, wordlist,1 , 0)[0] for word in sentence.split()])
    return    .join( [ sorted( { Levenshtein.ratio(x, word):x for x in wordlist }.items(), reverse=True)[0][1] for word in sentence.split() ] )

大约20倍于分散。

https://pypi.python.org/pypi/python-Levenshtein/

进口

Answer 7

为避免出现错误的正面,可从图书馆<代码>ngramratio<>/code>查询。

>>> pip install ngramratio

>>> from ngramratio import ngramratio
>>> SequenceMatcherExtended = ngramratio.SequenceMatcherExtended

>>> a =  Hi there 
>>> b =  Hit here 

>>> seq=SequenceMatcherExtended(a=a.lower(), b=b.lower())

>>> seq.ratio()
>>> 0.875
>>> seq.nratio(1) #this replicates `seq.ratio`.
>>> 0.875

>>> seq.nratio(2)
>>> 0.75

>>> seq.nratio(3)
>>> 0.5

nratio(n) only appes n-grams of length >=n。

您可获取n,例如n=2的数值,并创造与Nadia在先前答复中所做的类似功能。

def similar(seq1, seq2):
    return SequenceMatcherExtended(a=seq1.lower(), b=seq2.lower()).nratio(2) > 0.8

>>> similar(a, b)
False
>>> similar( Hi there ,  Hi ther )
True

友情链接