Question

我正试图从看似类似之处的扼杀中抓捕次。

 some string, another string,

我希望结果匹配小组成为该小组。

( some string ,  another string )

我目前的解决方案

>>> from re import match
>>> match(2 *  (.*?),  ,  some string, another string,  ).groups()
( some string ,  another string )

我在此当然显示的是,与我在实际项目中做的工作相比,复杂性大为降低;我只想使用一种简单(非computed)的reg。不幸的是,我迄今为止的努力失败了:

(因此,None) ,因为{2}只适用于空间,而不适用于整体:

>>> match( .*?, {2} ,  some string, another string,  )

在重复扼杀周围添加括号的母体具有 com和空间,因此

>>> match( (.*?, ){2} ,  some string, another string,  ).groups()
( another string,  ,)

加上另一套辅助器确实使我感到fix:

>>> match( ((.*?), ){2} ,  some string, another string,  ).groups()
( another string,  ,  another string )

增加一只未捕获的磁力可改善结果,但仍可评估第一部探测仪。

>>> match( (?:(.*?), ){2} ,  some string, another string,  ).groups()
( another string ,)

我觉得我很接近,但我确实似乎找不到适当的方法。

谁能帮助我? 任何其他办法都看不到吗?

www.un.org/Depts/DGACM/index_spanish.htm 第一次回复之后的最新情况:

First up, thank you very much everyone, your help is greatly appreciated! :-)

As I said in the original post, I have omitted a lot of complexity in my question for the sake of depicting the actual core problem. For starters, in the project I am working on, I am parsing large amounts of files (currently tens of thousands per day) in a number (currently 5, soon ~25, possibly in the hundreds later) of different line-based formats. There is also XML, JSON, binary and some other data file formats, but let s stay focussed.

为了应对多种档案格式,并利用其中许多格式基于线,我创建了一个部分通用的“灰色”模块,在另一个文件之后加载一个文档,适用于每个线,并将一个大型数据结构与对应数据相匹配。该模块是一个原型,生产版本将需要一个C++版本,因为性能原因,该版本将连接到波斯特:Python,并可能在复杂性清单中增加校外方言。

此外,没有重复,但是目前零到70(或这样)之间数额不一, com并非总是 com,尽管我最初说过的话,但reg形的某些部分将不得不暂时计算;我要说,我有理由尝试和减少动态数额,并尽可能有固定的模式。

页: 1 我必须经常使用“。

www.un.org/Depts/DGACM/index_spanish.htm 重复: 我认为,问题的核心在于:是否存在着一种灰复燃,例如,它涉及 cur的重复,使我得以抓住。

 some string, another string,

into

( some string ,  another string )

?

Hmmm, 可能将其缩小到太远的地方,但你这样做的任何方式都是错误的: D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D. D.

www.un.org/Depts/DGACM/index_spanish.htm 第二次尝试: 为什么我看不到第一场游行(一些游行)的结果? reg鱼为什么产生配对(指 go有2件),但只剩下1件(第二件)?

即使我使用非数字重复,即使用+而不是{2},问题仍然相同:

>>> match( (?:(.*?), )+ ,  some string, another string,  ).groups()
( another string ,)

而且,这并不是遣返的第二个指示,而是最后一点:

>>> match( (?:(.*?), )+ ,  some string, another string, third string,  ).groups()
( third string ,)

Again, thanks for your help, never ceases to amaze me how helpful peer review is while trying to find out what I actually want to know...

Answer 1

为了总结这一进展,我似乎已经采用最佳解决办法,以动态方式构筑ex形:

>>> from re import match
>>> match(2 *  (.*?),  ,  some string, another string,  ).groups()
( some string ,  another string )

会议日程和议程

2 *  (.*?)

这就是我充满活力的含义。替代做法

>>> match( (?:(.*?), ){2} ,  some string, another string,  ).groups()
( another string ,)

fails to return 会议日程和议程desired result due to 会议日程和议程fact that (as Glenn and Alan kindly explained)

with match, 会议日程和议程captured content gets overwritten with each repetition of 会议日程和议程capturing group

感谢大家的帮助!

Answer 2

除非这一问题比你所解释的要多得多,否则我看不出使用reg。简单地说,这是为了使用基本扼杀方法:

[s.strip() for s in mys.split( , ) if s.strip()]

或者,如果它必须成为主人:

tuple(s.strip() for s in mys.split( , ) if s.strip())

守则也更加可读。请告诉我这是否不适用。

http://www.un.org。奥基先生认为,这个问题比最初似乎要多。这样做是为了历史目的。 (Guess I m未遵守纪律:)

Answer 3

As described, I think this regex works fine:

import re
thepattern = re.compile("(.+?)(?:,|$)") # lazy non-empty match 
thepattern.findall("a, b, asdf, d")     # until comma or end of line
# Result:
Out[19]: [ a ,   b ,   asdf ,   d ]

The key here is to use findall,而不是match。您的问题的措辞表明,你倾向于<代码>match,但这只是此处工作的正确工具——其目的在于将相应类别<代码>(>>>>的打折。由于你的方位数是变数,正确的办法是使用<条码>限定式<>/代码>或<条码>。

如果是这样的话,请使问题更加具体。

<><>Edit>: 如果你必须使用标记而不是名单:

tuple(Out[19])
# Result
Out[20]: ( a ,   b ,   asdf ,   d )

Answer 4

import re

regex = " *((?:[^, ]| +[^, ])+) *, *((?:[^, ]| +[^, ])+) *, *"

print re.match(regex,  some string, another string,  ).groups()
# ( some string ,  another string )
print re.match(regex,   some string, another string,  ).groups()
# ( some string ,  another string )
print re.match(regex,   some string , another string,  ).groups()
# ( some string ,  another string )

Answer 5

没有任何罪行,但你显然有很多东西可以了解有关职业情况,而你所学到的东西最终是,职业管理机构可以胜任这项工作。页: 1 任务与监管部门是可行的,但是什么? 你说,你有不同档案格式的hundreds! 你们甚至提到了“JSON”和“XML”,它们从根本上说与管制不符。

Do yourself a favor:abes about regexes and Learning pyparsing. Or skipshed and use a independent parser generator such as ANTLR。在这两种情况下,你可能发现,大多数文件格式的图表已经写成。

Answer 6

I think the core of the problem boils down to: Is there a Python RegEx notation that e.g. involves curly braces repetitions and allows me to capture some string, another string, ?

我不认为存在这种说法。

但是,规章并非仅仅是一个排外问题,即用来界定reg的重新定位。这也是远洋运输公司的问题,即职能问题。

Unfortunately, I can t use findall as the string from the initial question is only a part of the problem, the real string is a lot longer, so findall only works if I do multiple regex findalls / matches / searches.

你应该毫不拖延地提供更多信息:我们可以更迅速地理解制约因素。因为我认为,为了回答你提出的问题,findall()确实是:

import re

for line in ( string one, string two,  ,
              some string, another string, third string,  ,
             # the following two lines are only one string
              Topaz, Turquoise, Moss Agate, Obsidian,  
              Tigers-Eye, Tourmaline, Lapis Lazuli,  ):

    print re.findall( (.+?), * ,line)

Result

[ string one ,  string two ]
[ some string ,  another string ,  third string ]
[ Topaz ,  Turquoise ,  Moss Agate ,  Obsidian ,  Tigers-Eye ,  Tourmaline ,  Lapis Lazuli ]

现在,由于你在问题中“略去了许多复杂性”,findall()可能附带不足以维持这一复杂性。然后将使用finditer(),因为它允许在选择配对群体方面有更多的灵活性。

import re

for line in ( string one, string two,  ,
              some string, another string, third string,  ,
             # the following two lines are only one string
              Topaz, Turquoise, Moss Agate, Obsidian,  
              Tigers-Eye, Tourmaline, Lapis Lazuli,  ):

    print [ mat.group(1) for mat in re.finditer( (.+?), * ,line) ]

得出同样的结果,可以通过撰写其他代替mat.group(1)的表述而加以复杂。

友情链接