English 中文(简体)
python regex: 捕捉含有空间的多管区部分
原标题:python regex: capture parts of multiple strings that contain spaces

我正试图从看似类似之处的扼杀中抓捕次。

 some string, another string,  

我希望结果匹配小组成为该小组。

( some string ,  another string )

我目前的解决方案

>>> from re import match
>>> match(2 *  (.*?),  ,  some string, another string,  ).groups()
( some string ,  another string )

我在此当然显示的是,与我在实际项目中做的工作相比,复杂性大为降低;我只想使用一种简单(非computed)的reg。 不幸的是,我迄今为止的努力失败了:

(因此,None) ,因为{2}只适用于空间,而不适用于整体:

>>> match( .*?, {2} ,  some string, another string,  )

在重复扼杀周围添加括号的母体具有 com和空间,因此

>>> match( (.*?, ){2} ,  some string, another string,  ).groups()
( another string,  ,)

加上另一套辅助器确实使我感到fix:

>>> match( ((.*?), ){2} ,  some string, another string,  ).groups()
( another string,  ,  another string )

增加一只未捕获的磁力可改善结果,但仍可评估第一部探测仪。

>>> match( (?:(.*?), ){2} ,  some string, another string,  ).groups()
( another string ,)

我觉得我很接近,但我确实似乎找不到适当的方法。

谁能帮助我? 任何其他办法都看不到吗?


www.un.org/Depts/DGACM/index_spanish.htm 第一次回复之后的最新情况:

First up, thank you very much everyone, your help is greatly appreciated! :-)

As I said in the original post, I have omitted a lot of complexity in my question for the sake of depicting the actual core problem. For starters, in the project I am working on, I am parsing large amounts of files (currently tens of thousands per day) in a number (currently 5, soon ~25, possibly in the hundreds later) of different line-based formats. There is also XML, JSON, binary and some other data file formats, but let s stay focussed.

为了应对多种档案格式,并利用其中许多格式基于线,我创建了一个部分通用的“灰色”模块,在另一个文件之后加载一个文档,适用于每个线,并将一个大型数据结构与对应数据相匹配。 该模块是一个原型,生产版本将需要一个C++版本,因为性能原因,该版本将连接到波斯特:Python,并可能在复杂性清单中增加校外方言。

此外,没有重复,但是目前零到70(或这样)之间数额不一, com并非总是 com,尽管我最初说过的话,但reg形的某些部分将不得不暂时计算;我要说,我有理由尝试和减少动态数额,并尽可能有固定的模式。

页: 1 我必须经常使用“。


www.un.org/Depts/DGACM/index_spanish.htm 重复: 我认为,问题的核心在于:是否存在着一种灰复燃,例如,它涉及 cur的重复,使我得以抓住。

 some string, another string,  

into

( some string ,  another string )

?

Hmmm, 可能将其缩小到太远的地方,但你这样做的任何方式都是错误的: D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D.


www.un.org/Depts/DGACM/index_spanish.htm 第二次尝试: 为什么我看不到第一场游行(一些游行)的结果? reg鱼为什么产生配对(指 go有2件),但只剩下1件(第二件)?

即使我使用非数字重复,即使用+而不是{2},问题仍然相同:

>>> match( (?:(.*?), )+ ,  some string, another string,  ).groups()
( another string ,)

而且,这并不是遣返的第二个指示,而是最后一点:

>>> match( (?:(.*?), )+ ,  some string, another string, third string,  ).groups()
( third string ,)

Again, thanks for your help, never ceases to amaze me how helpful peer review is while trying to find out what I actually want to know...

最佳回答

为了总结这一进展,我似乎已经采用最佳解决办法,以动态方式构筑ex形:

>>> from re import match
>>> match(2 *  (.*?),  ,  some string, another string,  ).groups()
( some string ,  another string )

会 议 日 程 和 议 程

2 *  (.*?)

这就是我充满活力的含义。 替代做法

>>> match( (?:(.*?), ){2} ,  some string, another string,  ).groups()
( another string ,)

fails to return 会 议 日 程 和 议 程desired result due to 会 议 日 程 和 议 程fact that (as Glenn and Alan kindly explained)

with match, 会 议 日 程 和 议 程captured content gets overwritten with each repetition of 会 议 日 程 和 议 程capturing group

感谢大家的帮助!

问题回答

除非这一问题比你所解释的要多得多,否则我看不出使用reg。 简单地说,这是为了使用基本扼杀方法:

[s.strip() for s in mys.split( , ) if s.strip()]

或者,如果它必须成为主人:

tuple(s.strip() for s in mys.split( , ) if s.strip())

守则也更加可读。 请告诉我这是否不适用。


http://www.un.org。 奥基先生认为,这个问题比最初似乎要多。 这样做是为了历史目的。 (Guess I m未遵守纪律:)

As described, I think this regex works fine:

import re
thepattern = re.compile("(.+?)(?:,|$)") # lazy non-empty match 
thepattern.findall("a, b, asdf, d")     # until comma or end of line
# Result:
Out[19]: [ a ,   b ,   asdf ,   d ]

The key here is to use findall,而不是match。 您的问题的措辞表明,你倾向于<代码>match,但这只是此处工作的正确工具——其目的在于将相应类别<代码>(>>>>的打折。 由于你的方位数是变数,正确的办法是使用<条码>限定式<>/代码>或<条码>。

如果是这样的话,请使问题更加具体。

<><>Edit>: 如果你必须使用标记而不是名单:

tuple(Out[19])
# Result
Out[20]: ( a ,   b ,   asdf ,   d )
import re

regex = " *((?:[^, ]| +[^, ])+) *, *((?:[^, ]| +[^, ])+) *, *"

print re.match(regex,  some string, another string,  ).groups()
# ( some string ,  another string )
print re.match(regex,   some string, another string,  ).groups()
# ( some string ,  another string )
print re.match(regex,   some string , another string,  ).groups()
# ( some string ,  another string )

没有任何罪行,但你显然有很多东西可以了解有关职业情况,而你所学到的东西最终是,职业管理机构可以胜任这项工作。 页: 1 任务与监管部门是可行的,但是什么? 你说,你有不同档案格式的hundreds! 你们甚至提到了“JSON”和“XML”,它们从根本上说与管制不符。

Do yourself a favor:abes about regexes and Learning pyparsing. Or skipshed and use a independent parser generator such as ANTLR。 在这两种情况下,你可能发现,大多数文件格式的图表已经写成。

I think the core of the problem boils down to: Is there a Python RegEx notation that e.g. involves curly braces repetitions and allows me to capture some string, another string, ?

我不认为存在这种说法。

但是,规章并非仅仅是一个排外问题,即用来界定reg的重新定位。 这也是远洋运输公司的问题,即职能问题。

Unfortunately, I can t use findall as the string from the initial question is only a part of the problem, the real string is a lot longer, so findall only works if I do multiple regex findalls / matches / searches.

你应该毫不拖延地提供更多信息:我们可以更迅速地理解制约因素。 因为我认为,为了回答你提出的问题,findall()确实是:

import re

for line in ( string one, string two,  ,
              some string, another string, third string,  ,
             # the following two lines are only one string
              Topaz, Turquoise, Moss Agate, Obsidian,  
              Tigers-Eye, Tourmaline, Lapis Lazuli,  ):

    print re.findall( (.+?), * ,line)

Result

[ string one ,  string two ]
[ some string ,  another string ,  third string ]
[ Topaz ,  Turquoise ,  Moss Agate ,  Obsidian ,  Tigers-Eye ,  Tourmaline ,  Lapis Lazuli ]

现在,由于你在问题中“略去了许多复杂性”,findall()可能附带不足以维持这一复杂性。 然后将使用finditer(),因为它允许在选择配对群体方面有更多的灵活性。

import re

for line in ( string one, string two,  ,
              some string, another string, third string,  ,
             # the following two lines are only one string
              Topaz, Turquoise, Moss Agate, Obsidian,  
              Tigers-Eye, Tourmaline, Lapis Lazuli,  ):

    print [ mat.group(1) for mat in re.finditer( (.+?), * ,line) ]

得出同样的结果,可以通过撰写其他代替mat.group(1)的表述而加以复杂。





相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签