English 中文(简体)
  • 时间:2009-01-16 20:34:27
  •  标签:



Like say I have a string: "Into the clear blue sky" and I m doing a compare with the following two strings: "The color is sky blue" and "In the blue clear sky"



* EDIT *
In talking with a friend, he worked with a guy who wrote a paper on this topic. I thought I might share it with everyone reading this, as there are some really good methods and processes described in it...







edit: Now that I have a second I can post a quick example (all best guesses are on inspection and not actually running the algorithms):

original strings             | best rearrangement w/ lev distance per word
Into the clear blue sky      |    Into the c_lear blue sky 
The color is sky blue        |    is__ the colo_r blue sky

R_dist = dist( 3 1 2 5 4 ) --> 3 1 2 *4 5* --> *2 1 3* 4 5 --> *1 2* 3 4 5 = 3  
L_dist = (2D+S) + (I+D+S) (Total Subsitutions: 2, deletions: 3, insertion: 1)  

请注意,所有翻转都包括范围内的所有元素,我使用的范围是Xi - Xj = +/- 1。


original strings             | best rearrangement w/ lev distance per word
Into the clear blue sky      |   Into the clear blue sky 
In the blue clear sky        |   In__ the clear blue sky

R_dist = dist( 1 2 4 3 5 ) -->  1 2 *3 4* 5  = 1
L_dist = (2D) (Total Subsitutions: 0, deletions: 2, insertion: 0)


The color is sky blue         |    The colo_r is sky blue
In the blue clear sky         |    the c_lear in sky blue

R_dist = dist( 2 4 1 3 5 ) --> *2 3 1 4* 5 --> *1 3 2* 4 5 --> 1 *2 3* 4 5 = 3
L_dist = (D+I+S) + (S) (Total Subsitutions: 2, deletions: 1, insertion: 1)


确定“不考虑顺序的总体相似度”度量的一种方法是使用某种基于压缩的距离。 基本上,大多数压缩算法(例如gzip)的工作方式是沿着字符串扫描,寻找以前出现过的字符串段 - 每次发现这样的段时,都会用(偏移量,长度)对替换它,以标识要使用的早期段。 您可以使用两个字符串压缩效果的度量来检测它们之间的相似性。

假设您有一个函数string comp(string s),它返回s的压缩版本。然后,您可以使用以下表达式作为两个字符串st之间的“相似度分数”:

len(comp(s)) + len(comp(t)) - len(comp(s . t))

其中 . 被视为连接。思路是通过先查看s来衡量您可以将t进一步压缩的程度。如果s == t,那么len(comp(s.t))几乎与len(comp(s))一样大,您将获得高分,而如果它们完全不同,则len(comp(s.t))将非常接近len(comp(s)+comp(t)),您将获得接近零的分数。相似性的中间级别产生中间分数。

实际上,以下公式更好,因为它对称(即得分不会因为哪个字符串是< code > s < / code >和哪个是< code > t < / code >而改变):

2 * (len(comp(s)) + len(comp(t))) - len(comp(s . t)) - len(comp(t . s))







我有一个类似的问题,我需要获取字符串中相似字符的百分比。它需要精确的序列,例如,当比较“hello sir”和“sir hello”时,需要给我五个相同的字符,这种情况下它们将是两个“hello”。然后它将获取两个字符串中最长的长度,并给出它们相似程度的百分比。这是我想出来的代码。

int compare(string a, string b){
   return(a.size() > b.size() ? bigger(a,b) : bigger(b,a));

int bigger(string a, string b){

int maxcount = 0, currentcount = 0;//used to see which set of concurrent characters were biggest

for(int i = 0; i < a.size(); ++i){

    for(int j = 0; j < b.size(); ++j){

        if(a[i+j] == b[j]){




            if(currentcount > maxcount){

             maxcount = currentcount;

             }//end if

             currentcount = 0;

            }//end else

        }//end inner for loop

    }//end outer for loop

   return ((int)(((float)maxcount/((float)a.size()))*100));

There s another way. Pattern recognition using convolution. Image A is run thru a Fourier transform. Image B also. Now superimposing F(A) over F(B) then transforming this back gives you a black image with a few white spots. Those spots indicate where A matches B strongly. Total sum of spots would indicate an overall similarity. Not sure how you d run an FFT on strings but I m pretty sure it would work.

The difficulty would be to match the strings semantically.

You could generate some kind of value based on the lexical properties of the string. e.g. They bot have blue, and sky, and they re in the same sentence, etc etc... But it won t handle cases where "Sky s jean is blue", or some other odd ball English construction that uses same words, but you d need to parse the English grammar...

To do anything beyond lexical similarity, you d need to look at natural language processing, and there isn t going to be one single algorith that would solve your problem.

Possible approach:

Construct a Dictionary with a string key of "word1|word2" for all combinations of words in the reference string. A single combination may happen multiple times, so the value of the Dictionary should be a list of numbers, each representing the distance between the words in the reference string.

When you do this, there will be duplication here: for every "word1|word2" dictionary entry, there will be a "word2|word1" entry with the same list of distance values, but negated.

For each combination of words in the comparison string (words 1 and 2, words 1 and 3, words 2 and 3, etc.), check the two keys (word1|word2 and word2|word1) in the reference string and find the closest value to the distance in the current string. Add the absolute value of the difference between the current distance and the closest distance to a counter.

If the closest reference distance between the words is in the opposite direction (word2|word1) as the comparison string, you may want to weight it smaller than if the closest value was in the same direction in both strings.

When you are finished, divide the sum by the square of the number of words in the comparison string.

This should provide some decimal value representing how closely each word/phrase matches some word/phrase in the original string.

Of course, if the original string is longer, it won t account for that, so it may be necessary to compute this both directions (using one as the reference, then the other) and average them.

I have absolutely no code for this, and I probably just re-invented a very crude wheel. YMMV.
