English 中文(简体)
模糊匹配Python中大量文本中的字符串(url)
原标题:Fuzzy matching a string within a large body of text in Python (url)

我有一个公司名称列表,还有一个提到公司名称的url列表。

最终目标是查看url,并找出url上有多少公司在我的列表中。

示例URL:http://www.dmx.com/about/our-clients

每个URL的结构都不同,所以我没有一个好的方法来进行正则表达式搜索并为每个公司名称创建单独的字符串。

我想构建一个for循环,从URL的整个内容列表中搜索每个公司。但与短字符串和大量文本相比,Levenstein似乎更适合两个较小的字符串。

这个初学者应该去哪里看?

最佳回答

在我看来,你不需要任何“模糊”匹配。我假设,当你说“url”时,你的意思是“url指向的地址处的网页”。只需使用Python内置的子字符串搜索功能:

>>> import urllib2
>>> webpage = urllib2.urlopen( http://www.dmx.com/about/our-clients )
>>> webpage_text = webpage.read()
>>> webpage.close()
>>> for name in [ Caribou Coffee ,  Express ,  Sears ]:
...     if name in webpage_text:
...         print name, "found!"
... 
Caribou Coffee found!
Express found!
>>> 

如果您担心字符串大小写不匹配,只需将其全部转换为大写即可。

>>> webpage_text = webpage_text.upper()
>>> for name in [ CARIBOU COFFEE ,  EXPRESS ,  SEARS ]:
...     if name in webpage_text:
...         print name,  found! 
... 
CARIBOU COFFEE found!
EXPRESS found!
问题回答

我想在senderle的回答中补充一点,以某种方式规范你的名字可能是有意义的(例如,删除所有特殊字符,然后将其应用于webpage_text和字符串列表。

def normalize_str(some_str):
    some_str = some_str.lower()
    for c in """-? "/{}[]()&!,.`""":
        some_str = some_str.replace(c,"")
    return some_str

如果这还不够好,您可以转到difflib并执行以下操作:

for client in normalized_client_names:
    closest_client = difflib.get_closest_match(client_name, webpage_text,1,0.8)
    if len(closest_client) > 0:
         print client_name, "found as", closest_client[0]

我选择的0.8的任意截止值(Ratcliff/Obershelp)可能过于宽松或强硬;玩一玩。





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签