English 中文(简体)
努力在沙尔[杜里]达到类似水平。
原标题:String similarity metrics in Python [duplicate]
  • 时间:2009-09-24 11:43:00
  •  标签:

I want to find string similarity between two strings. en.wikipedia has examples of some of them. code.google has a Python implementation of Levenshtein distance.
Is there a better algorithm, (and hopefully a Python library), under these constraints:

  1. I want to do fuzzy matches between strings. eg matches( Hello, All you people , hello, all You peopl ) should return True
  2. False negatives are acceptable, False positives, except in extremely rare cases are not.
  3. This is done in a non realtime setting, so speed is not (much) of concern.
  4. [Edit] I am comparing multi word strings.

除了Levenshtein距离(或Levenshtein比率)以外,对我的案件来说,其他东西是否是一种更好的算法?

最佳回答

在谢夫菲尔德大学,有巨大的资源来强化类似的指标。 它有各种衡量标准清单(超越了Levenshtein),并公开执行。 象许多这样的景象应该很容易适应到灰色。

rel=“noreferer” http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmets.html

名单上的一条:

  • Hamming distance
  • Levenshtein distance
  • Needleman-Wunch distance or Sellers Algorithm
  • and many more...
问题回答

我认识到,这并非同一件事,但这还不够:

>>> import difflib
>>> a =  Hello, All you people 
>>> b =  hello, all You peopl 
>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())
>>> seq.ratio()
0.97560975609756095

您可以发挥这一作用。

def similar(seq1, seq2):
    return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9

>>> similar(a, b)
True
>>> similar( Hello, world ,  Hi, world )
False

该分机将计算两个阵列的分法、Levenshtein、Sørensen和Jaccard相似值。 在以下几页中,我对一系列利益涉及栏目:[3][4]sv. (>pip安放python-Levenshteinpip安装距离:

import codecs, difflib, Levenshtein, distance

with codecs.open("titles.tsv","r","utf-8") as f:
    title_list = f.read().split("
")[:-1]

    for row in title_list:

        sr      = row.lower().split("	")

        diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
        lev     = Levenshtein.ratio(sr[3], sr[4]) 
        sor     = 1 - distance.sorensen(sr[3], sr[4])
        jac     = 1 - distance.jaccard(sr[3], sr[4])

        print diffl, lev, sor, jac

我将使用Levenshtein的距离,或所谓的Damerau的距离(考虑到变迁),而不是由于两个原因((1) “足够”的“动力方案”和“全方略”的C代码,(2) 众所周知的行为,例如: Levenshtein satisfies the triangle inequality and so can be used in e.g. a Burkhard-Keller.

门槛:你只能将距离和带、距离(1-X) * 最大(大于1)、宽(伸缩2)和调整X(相似因素)以适应紧急情况。 选择X的一种方式是,收集对应的样本,计算X,忽略X <的情况;如0.8或0.9,然后按X和眼睛球的排位顺序排列其余部分,并插入正确的结果,并计算出X级不同层次的舱位成本。

注 如果我急切地寻找一些东西,并且有高不实的处罚,那么我只能将门槛值定为0.75。

你们指的是什么?

>>> get_close_matches( appel , [ ape ,  apple ,  peach ,  puppy ])
[ apple ,  ape ]
>>> import keyword
>>> get_close_matches( wheel , keyword.kwlist)
[ while ]
>>> get_close_matches( apple , keyword.kwlist)
[]
>>> get_close_matches( accept , keyword.kwlist)
[ except ]

为避免出现错误的正面,可从图书馆<代码>ngramratio<>/code>查询。

>>> pip install ngramratio

>>> from ngramratio import ngramratio
>>> SequenceMatcherExtended = ngramratio.SequenceMatcherExtended

>>> a =  Hi there 
>>> b =  Hit here 

>>> seq=SequenceMatcherExtended(a=a.lower(), b=b.lower())

>>> seq.ratio()
>>> 0.875
>>> seq.nratio(1) #this replicates `seq.ratio`.
>>> 0.875

>>> seq.nratio(2)
>>> 0.75

>>> seq.nratio(3)
>>> 0.5

nratio(n) only appes n-grams of length >=n

您可获取n,例如n=2的数值,并创造与Nadia在先前答复中所做的类似功能。

def similar(seq1, seq2):
    return SequenceMatcherExtended(a=seq1.lower(), b=seq2.lower()).nratio(2) > 0.8

>>> similar(a, b)
False
>>> similar( Hi there ,  Hi ther )
True




相关问题
热门标签