Question

It s difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.

Closed 11 years ago.

我希望能够进行大规模搜索并替换文档,以便实现文本正常化。

例如:

Find all uses of U.S.A, USA and replace with United States Of America
Find all ampersands (&) and replace with the word and

我还想在系统上添加新的规则,而不必修改任何代码。所以搜索替换配对存储在数据库中,这意味着任何人都可以添加、更新、删除规则。

我一直在与Python re 模块合作,该模块非常漂亮,而且最理想的是,我想将一个图例列表传递给子命令,然后通过每个图例并进行替换。除了对图例列表进行循环,然后为每个图例创建一个常规表达式之外,还有更好的方法来做到这一点吗? 它非常缓慢,效率低下,特别是大文件:

replacements = [
  r( USA , United States Of America ),
  (r U.S.A , United States Of America ),
  (r US of A ,  United States of America )]

for replacement in replacements:
  document = re.sub(replacement[0],replacement[1],document

Answer 1

您的示例中没有一个需要正则表达式。为什么不尝试良好的 ol 字符串替换?

replacements = [
    ( USA , United States Of America ),
    ( U.S.A , United States Of America ),
    ( US of A ,  United States of America )]

for replacement in replacements:
    document = document.replace(replacement[0], replacement[1])

这似乎有点慢,但你应该先用基准来衡量,然后才能排除这个方法。Python很擅长处理这样的问题,结果可能会令你吃惊。

如果您真的需要正则表达式, 编辑这些表达式可能会看到巨大的推动 :

replacements = [
    (re.compile( USA ), United States Of America ),
    (re.compile( U.S.A ), United States Of America ),
    (re.compile( US of A ),  United States of America )]

for pattern, replacement in replacements:
    document = pattern.sub(replacement, document)

这样, Python 就可以省去每次使用这些正则表达式时必须重新编译这些正则表达式的努力。

如果您只需要一些时间的正则表达式, 请考虑在文档中通过两次: 一次是正则表达式, 一次是字符串替换。或者, 如果您需要某些特定的替换顺序, 您可以有类似 :

replacements = [
    (re.compile( USA ), United States Of America ),
    ( foo ,  bar ),
    (re.compile( U.S.A ), United States Of America ),
    ( spam ,  eggs ),
    (re.compile( US of A ),  United States of America )]

for pattern, replacement in replacements:
    try:
        document = pattern.sub(replacement, document)
    except AttributeError:
        document = document.replace(pattern, replacement)

Answer 2

看看Google Refine。

Google Refine, 一个处理乱乱数据的工具

http://code.google.com/p/google-refine/" rel="nofollow" >http://code.google.com/p/google-refine/

Answer 3

我有一个 < a href=> "" "http://norvig.com/big.txt" rel="nofollow" > Big Asss 文件 ,6MB 文本。它汇编了几个项目 Gutenberg 文件。

尝试此 :

reps = [
  (r thousand , >>>>>1,000<<<<< ),
  (r million ,">>>>>1e6<<<<<"),
  (r [Hh]undreds ,">>>>>100 s<<<<<"),
  (r Sherlock , ">>>> SHERLOCK <<<<")
  ]

t1=time.time()
out=[]  
rsMade=0
textLength=0
NewTextLen=0
with open( big.txt ) as BigAssF:
    for line in BigAssF:
        textLength+=len(line)
        for pat, rep in reps:
            NewLine=re.subn(pat,rep,line)     

        out.append(NewLine[0])
        NewTextLen+=len(NewLine[0])
        rsMade+=NewLine[1]

print  Text Length: {:,} characters .format(textLength)
print  New Text Length: {:,} characters .format(NewTextLen)     
print  Replacements Made: {} .format(rsMade)     
print  took {:.4} seconds .format(time.time()-t1)

它的指纹:

Text Length: 6,488,666 characters
New Text Length: 6,489,626 characters
Replacements Made: 96
took 2.22 seconds

这对我来说似乎足够快。

您的代码线可能有些问题:

document = re.sub(替换[0],替换[1],文件

如果这不是打字机的话。

你期望它能有多快?

友情链接