English 中文(简体)
Regex 替换大量搜索替换配对 [已关闭]
原标题:Regex replace for large number of search replace pairs [closed]

我希望能够进行大规模搜索并替换文档,以便实现文本正常化。

例如:

  1. Find all uses of U.S.A, USA and replace with United States Of America
  2. Find all ampersands (&) and replace with the word and

我还想在系统上添加新的规则,而不必修改任何代码。所以搜索替换配对存储在数据库中,这意味着任何人都可以添加、更新、删除规则。

我一直在与Python re 模块合作,该模块非常漂亮,而且最理想的是,我想将一个图例列表传递给子命令,然后通过每个图例并进行替换。除了对图例列表进行循环,然后为每个图例创建一个常规表达式之外,还有更好的方法来做到这一点吗? 它非常缓慢,效率低下,特别是大文件:

replacements = [
  r( USA , United States Of America ),
  (r U.S.A , United States Of America ),
  (r US of A ,  United States of America )]

for replacement in replacements:
  document = re.sub(replacement[0],replacement[1],document
问题回答

您的示例中没有一个需要正则表达式 。 为什么不尝试良好的 ol 字符串替换?

replacements = [
    ( USA , United States Of America ),
    ( U.S.A , United States Of America ),
    ( US of A ,  United States of America )]

for replacement in replacements:
    document = document.replace(replacement[0], replacement[1])

这似乎有点慢,但你应该先用基准来衡量,然后才能排除这个方法。Python很擅长处理这样的问题,结果可能会令你吃惊。

如果您真的需要正则表达式, 编辑这些表达式可能会看到巨大的推动 :

replacements = [
    (re.compile( USA ), United States Of America ),
    (re.compile( U.S.A ), United States Of America ),
    (re.compile( US of A ),  United States of America )]

for pattern, replacement in replacements:
    document = pattern.sub(replacement, document)

这样, Python 就可以省去每次使用这些正则表达式时必须重新编译这些正则表达式的努力 。

如果您只需要一些时间的正则表达式, 请考虑在文档中通过两次: 一次是正则表达式, 一次是字符串替换。 或者, 如果您需要某些特定的替换顺序, 您可以有类似 :

replacements = [
    (re.compile( USA ), United States Of America ),
    ( foo ,  bar ),
    (re.compile( U.S.A ), United States Of America ),
    ( spam ,  eggs ),
    (re.compile( US of A ),  United States of America )]

for pattern, replacement in replacements:
    try:
        document = pattern.sub(replacement, document)
    except AttributeError:
        document = document.replace(pattern, replacement)

看看Google Refine。

Google Refine, 一个处理乱乱数据的工具

http://code.google.com/p/google-refine/" rel="nofollow" >http://code.google.com/p/google-refine/

我有一个 < a href=> "" "http://norvig.com/big.txt" rel="nofollow" > Big Asss 文件 ,6MB 文本。它汇编了几个项目 Gutenberg 文件 。

尝试此 :

reps = [
  (r thousand , >>>>>1,000<<<<< ),
  (r million ,">>>>>1e6<<<<<"),
  (r [Hh]undreds ,">>>>>100 s<<<<<"),
  (r Sherlock , ">>>> SHERLOCK <<<<")
  ]

t1=time.time()
out=[]  
rsMade=0
textLength=0
NewTextLen=0
with open( big.txt ) as BigAssF:
    for line in BigAssF:
        textLength+=len(line)
        for pat, rep in reps:
            NewLine=re.subn(pat,rep,line)     

        out.append(NewLine[0])
        NewTextLen+=len(NewLine[0])
        rsMade+=NewLine[1]

print  Text Length: {:,} characters .format(textLength)
print  New Text Length: {:,} characters .format(NewTextLen)     
print  Replacements Made: {} .format(rsMade)     
print  took {:.4} seconds .format(time.time()-t1) 

它的指纹:

Text Length: 6,488,666 characters
New Text Length: 6,489,626 characters
Replacements Made: 96
took 2.22 seconds

这对我来说似乎足够快。

您的代码线可能有些问题:

document = re.sub(替换[0],替换[1],文件

如果这不是打字机的话。

你期望它能有多快?





相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签