English 中文(简体)
有效搜索文档中的术语列表
原标题:Efficient to search for a list of terms in a document

我有数千个同义词的清单。我还有数万个文件,我想搜索这些术语。使用python(或假编码)什么是有效的方法?

# this would work for single word synonyms, but there are multiple word synonyms too
synonymSet = set([...])
wordsInDocument = set([...])
synonymsInDocument = synonymSet.intersection(wordsInDocument)

# this would work, but sounds slow
matches = []
for document in documents:
    for synonym in synonymSet:
        if synonym in document:
            matches.append(synonym)

Is there a good solution to this problem, or will it just take a while? Thank you in advance

问题回答

从您的同义词列表中建立正则表达式如何 :

import re

pattern = "|".join(synonymList)
regex = re.compile(pattern)

matches = regex.findall(document) # get a list of the matched synonyms
matchedSynonyms = set(matches)    # eliminate duplicates using a set




相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签