English 中文(简体)
相似字符串算法
原标题:
  • 时间:2009-01-16 20:34:27
  •  标签:

我正在寻找一种算法,或者至少是操作理论,关于如何在两个或更多不同的字符串中找到相似的文本...

就像这里提出的问题一样:查找文本相似性的算法,但我的文本字符串只会是几个单词。

Like say I have a string: "Into the clear blue sky" and I m doing a compare with the following two strings: "The color is sky blue" and "In the blue clear sky"

我正在寻找一种算法,可用于匹配两个文本并确定它们的相似度。在我的情况下,拼写和标点将很重要。我不希望它们影响发现真实文本的能力。在上面的例子中,如果颜色参考存储为“天蓝色”,我希望它仍能匹配。但是,列出的第三个字符串应该比第二个更好地匹配。

我确信像Google这样的地方可能会使用类似于“你是指:”功能的东西…

* EDIT *
In talking with a friend, he worked with a guy who wrote a paper on this topic. I thought I might share it with everyone reading this, as there are some really good methods and processes described in it...

这是他的论文链接,我希望它对那些正在阅读这个问题和相关的相似字符串算法有所帮助。

最佳回答

我不能在这里标记两个答案,所以我将回答并标记自己的答案。在大多数情况下,Levenshtein距离似乎是正确的方法。但是,值得提到的是,j_random_hackers的回答也很有价值。我已经使用了LZMA的实现来测试他的理论,并证明这是一个可行的解决方案。在我的原始问题中,我正在寻找一种处理短字符串(2至200个字符)的方法,Levenshtein距离算法可以处理。但是,在问题中没有提到需要比较两个(较大的)字符串(在这种情况下,是适中大小的文本文件),并进行快速检查以查看两者的相似程度。我相信这种压缩技术将起作用,但我还没有研究到哪个点比另一个更好,就样例数据的大小以及所涉及的操作速度/成本而言。我认为对于任何试图解决类似于我在这里所做的字符串问题的人,这个问题给出的许多答案都是有价值的,值得一提。感谢大家的出色回答,希望它们也能为他人服务。

问题回答

莱文斯坦距离不能完全适用,因为你想要允许重新排列。我认为你最好的选择是找到每个单词的莱文斯坦距离作为成本的最佳重新排列。

寻找重新排列的成本,有点像煎饼排序问题。因此,您可以对每组单词进行排列(过滤精确匹配),然后与其他字符串的每个组合进行排列,尝试在每个单词对上最小化排列距离和Levenshtein距离的组合。

edit: Now that I have a second I can post a quick example (all best guesses are on inspection and not actually running the algorithms):

original strings             | best rearrangement w/ lev distance per word
Into the clear blue sky      |    Into the c_lear blue sky 
The color is sky blue        |    is__ the colo_r blue sky

R_dist = dist( 3 1 2 5 4 ) --> 3 1 2 *4 5* --> *2 1 3* 4 5 --> *1 2* 3 4 5 = 3  
L_dist = (2D+S) + (I+D+S) (Total Subsitutions: 2, deletions: 3, insertion: 1)  

请注意,所有翻转都包括范围内的所有元素,我使用的范围是Xi - Xj = +/- 1。

其他例子

original strings             | best rearrangement w/ lev distance per word
Into the clear blue sky      |   Into the clear blue sky 
In the blue clear sky        |   In__ the clear blue sky

R_dist = dist( 1 2 4 3 5 ) -->  1 2 *3 4* 5  = 1
L_dist = (2D) (Total Subsitutions: 0, deletions: 2, insertion: 0)

并展示三个元素的所有可能组合...

The color is sky blue         |    The colo_r is sky blue
In the blue clear sky         |    the c_lear in sky blue

R_dist = dist( 2 4 1 3 5 ) --> *2 3 1 4* 5 --> *1 3 2* 4 5 --> 1 *2 3* 4 5 = 3
L_dist = (D+I+S) + (S) (Total Subsitutions: 2, deletions: 1, insertion: 1)

无论你如何制定成本函数,第二选择的成本将是最低成本,这正是你所希望的!

确定“不考虑顺序的总体相似度”度量的一种方法是使用某种基于压缩的距离。 基本上,大多数压缩算法(例如gzip)的工作方式是沿着字符串扫描,寻找以前出现过的字符串段 - 每次发现这样的段时,都会用(偏移量,长度)对替换它,以标识要使用的早期段。 您可以使用两个字符串压缩效果的度量来检测它们之间的相似性。

假设您有一个函数string comp(string s),它返回s的压缩版本。然后,您可以使用以下表达式作为两个字符串st之间的“相似度分数”:

len(comp(s)) + len(comp(t)) - len(comp(s . t))

其中 . 被视为连接。思路是通过先查看s来衡量您可以将t进一步压缩的程度。如果s == t,那么len(comp(s.t))几乎与len(comp(s))一样大,您将获得高分,而如果它们完全不同,则len(comp(s.t))将非常接近len(comp(s)+comp(t)),您将获得接近零的分数。相似性的中间级别产生中间分数。

实际上,以下公式更好,因为它对称(即得分不会因为哪个字符串是< code > s < / code >和哪个是< code > t < / code >而改变):

2 * (len(comp(s)) + len(comp(t))) - len(comp(s . t)) - len(comp(t . s))

这项技术源自信息论。

优点:已经有好的压缩算法可用,因此你不需要进行太多的编码,它们运行时间线性(或几乎线性)因此速度快。相比之下,涉及单词排列的所有解决方案在单词数量上呈超指数增长(尽管可以承认,在您的情况下,这可能不是一个问题,因为您说只有少数几个单词)。

一种方法(尽管这也许更适合于拼写检查类型的算法)是“编辑距离”,即计算将一个字符串转换为另一个字符串需要多少次编辑。常见的技术在这里找到:

莱文斯坦距离:http://en.wikipedia.org/wiki/Levenshtein_distance

你可能想要研究生物学家用于比较DNA序列的算法,因为他们不得不处理许多相同的事情(一些块可能丢失,或已插入,或只是移动到字符串中的不同位置)。

史密斯-沃特曼(Smith-Waterman)算法可能是一个相当可行的例子,尽管它可能对您的用途来说太慢了。但它可能为您提供一个起点。

我有一个类似的问题,我需要获取字符串中相似字符的百分比。它需要精确的序列,例如,当比较“hello sir”和“sir hello”时,需要给我五个相同的字符,这种情况下它们将是两个“hello”。然后它将获取两个字符串中最长的长度,并给出它们相似程度的百分比。这是我想出来的代码。

int compare(string a, string b){
   return(a.size() > b.size() ? bigger(a,b) : bigger(b,a));
}



int bigger(string a, string b){



int maxcount = 0, currentcount = 0;//used to see which set of concurrent characters were biggest

for(int i = 0; i < a.size(); ++i){

    for(int j = 0; j < b.size(); ++j){

        if(a[i+j] == b[j]){

         ++currentcount;

         }

        else{

            if(currentcount > maxcount){

             maxcount = currentcount;

             }//end if

             currentcount = 0;

            }//end else

        }//end inner for loop

    }//end outer for loop


   return ((int)(((float)maxcount/((float)a.size()))*100));
}

There s another way. Pattern recognition using convolution. Image A is run thru a Fourier transform. Image B also. Now superimposing F(A) over F(B) then transforming this back gives you a black image with a few white spots. Those spots indicate where A matches B strongly. Total sum of spots would indicate an overall similarity. Not sure how you d run an FFT on strings but I m pretty sure it would work.

The difficulty would be to match the strings semantically.

You could generate some kind of value based on the lexical properties of the string. e.g. They bot have blue, and sky, and they re in the same sentence, etc etc... But it won t handle cases where "Sky s jean is blue", or some other odd ball English construction that uses same words, but you d need to parse the English grammar...

To do anything beyond lexical similarity, you d need to look at natural language processing, and there isn t going to be one single algorith that would solve your problem.

Possible approach:

Construct a Dictionary with a string key of "word1|word2" for all combinations of words in the reference string. A single combination may happen multiple times, so the value of the Dictionary should be a list of numbers, each representing the distance between the words in the reference string.

When you do this, there will be duplication here: for every "word1|word2" dictionary entry, there will be a "word2|word1" entry with the same list of distance values, but negated.

For each combination of words in the comparison string (words 1 and 2, words 1 and 3, words 2 and 3, etc.), check the two keys (word1|word2 and word2|word1) in the reference string and find the closest value to the distance in the current string. Add the absolute value of the difference between the current distance and the closest distance to a counter.

If the closest reference distance between the words is in the opposite direction (word2|word1) as the comparison string, you may want to weight it smaller than if the closest value was in the same direction in both strings.

When you are finished, divide the sum by the square of the number of words in the comparison string.

This should provide some decimal value representing how closely each word/phrase matches some word/phrase in the original string.

Of course, if the original string is longer, it won t account for that, so it may be necessary to compute this both directions (using one as the reference, then the other) and average them.

I have absolutely no code for this, and I probably just re-invented a very crude wheel. YMMV.





相关问题
热门标签