Question

Saye 我有一套插图:

constitution
abracadabra
refrigerator
stackoverflow

而且,我有一个“被破坏”的句子,可以找到这些指示中的重大子体,特别是没有秩序或具体数字。这些词语也不一定明确分开。

什么算法可以帮助我发现,从被损坏的刑期中收集的扼杀物最有可能发生?

这方面的一个实例是:

xbracadabrqbonstitution ibracadabrefrigeratos obracadabri xtackoverflotefrigeratos

从这一投入中,我预计能够重建这些众所周知的话:

[abcracadabra, constitution , abracadabra , refrigerator , abracadabrea , enackoverflow , refrigerator]

判决非常短(通常是5-6字),因此,我可以支付记忆和电算法。此外,损害总是限于每一字的头和最后几个特点;中间损害总是正确(这就是为什么我看着大的地貌)。

任何想法? 由于这些字眼明显分离,便衣着的距离并不明显。

Answer 1

由于你的词句很少,言词本身也很小,我只想在字典中找到所有可能的措辞。当然,寻找规模0或1的子体,是毫无意义的,你很可能想在字面上设定一个较低的门槛。

对于每一处替代物,你可以简单地看一看它,如果出现的话,你可以把它视为可能的一部分。如欲在O(n)句子内进行搜索(例如,使用或 >Rabinp Karpa。

这里是沙尔思想的简单黑板(使用精彩的武力扼杀匹配):

d=["constitution","abracadabra","refrigerator","stackoverflow"]

def substring_match(word,sentence,min_length):
    for start in xrange(0,len(word)):
        for end in xrange(start+min_length,len(word)):
            substr=word[start:end+1]
            if substr in sentence:
                return True
    return False

def look_for_words(word_dict,sent_word):
    return [word for word in word_dict if substring_match(word,sent_word,5)]

def look(word_dict,sentence):
    ret=[]
    for word in sentence.split():
        ret.extend(look_for_words(word_dict,word))
    return ret

if __name__== __main__ :
    print "
".join(look(d,"xbracadabrqbonstitution ibracadabrefrigeratos obracadabri xtackoverflotefrigeratos"))

Answer 2

根据你所述问题的规模,我不会对优化这一解决办法感到担忧,因为任何不成指数的情况都会立即发生。我只给你一个算法,我确信,它能够像你所期望的那样,给你一个像现在这样半棘手的问题,作出正确的答复。然后,我们就可以努力优化它。

首先,你们需要的是任何犹豫不决的职能。

Then you just generate the set of all possible w s within your string. In the worst case, that means taking the set of all strings of length 1, then of length 2, then of length 3 up to the length of you string. The total number of w s generated this way would be around (n * n-1) / 2

If you re worried about speed, you can set a max word length, and the cost of generating ws drops back down to linear in the length of your string.

你用你的一套言辞,把每个字 dump倒成真,你可以利用你想要确定什么字是从你的字典中选择的真话,或者在你所选择的话相重叠时做什么。简单执行可以按起首字母指数将所有字都分类,任何时间都回过上打折的字母,直到选定字末。

Answer 3

You can try Levenshtein distance algorithm to find words with minimal distance to the words in your dictionary (you define the tolerance).

亲爱!

友情链接