English 中文(简体)
如何在两种不同的语言中找到所有具有相同外观的单词?
原标题:How to find all words that have the same appearence in two different languages?

俄语字母表包括许多与英语字母表相同的字母。以下是常见字母列表:L=acekopuxy

现在,给定两个巨大的列表RE,每个列表的形式都是[word_A,word_B,…],其中每个word_N都是小写单词,我想创建一个列表C。它应该只包含E中和R拼写相同的单词。例如,单词cop必须位于C中,因为它位于列表R以及E

有多项式的方法吗?

附言:一个重要的注意事项是:由于字符编码不同,有两个L列表,LE表示英文字母,LR表示俄文字母,但它们的字母外观相同:

LE= acekopuxy 
LR= асекориху 
最佳回答

您可以为此使用集合:

english_set = set(E)
russian_set = set(R)
common_words = english_set.intersection(russian_set)

不过,我不确定我的编码部分是否正确,但如果这意味着看起来相似的字母实际上是不同的字节,例如,你可以在进行交集之前用英文字母替换这些字母来准备俄语列表。

问题回答

你需要自己告诉程序,哪些字符是相似的。由于它们都是不同的Unicode代码点,因此您必须具有这样的映射:

var RE_map = (
  (u c , u u0441 ),
  # ...and so on
)

然后,将所有单词从R翻译成它们的E表示:

for ec, rc in RE_map:
    string = string.replace(rc, ec)

最后检查,如果字符串现在位于E中:

if string in E:
    print "The word exists of characters similar in Latin and Cyrillic."

您可以为此使用正则表达式:

^[acekopuxy]+$

将匹配仅包含这些字符的单词。

import re
regex = re.compile(r"^[acekopuxy]+$", re.I)
output = []
for word in mylist:
    if regex.match(word):
        output.append(word)

您需要使用正确的编码对这两个列表执行此操作。这意味着对于俄语列表,您需要使用等效的字符,如^[u0441u1234…]$

然后,如果你想找到“看起来一样”的单词,你可以使用翻译表将其中一个列表中的单词转换为另一个列表的格式,然后将列表转换为集合,并检查它们的交集。

Eset = set(E)
C = [w for w in R if w.replace(LR,LE) in Eset]

不确定我是否正确理解了这个问题,但假设哈希处理良好,这将在O(n)中运行。





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签