English 中文(简体)
regex解析封装到xml中的表
原标题:regex to parse tables wrapped into xml

假设我们有一张桌子:

Key|Val|Flag
01 |AAA| Y
02 |BBB| N
...

以这种方式封装到xml中:

<Data>
  <R><F>Key</F><F>Val</F><F>Flag</F></R>
  <R><F>01</F><F>AAA</F><F>Y</F></R>
  <R><F>02</F><F>BBB</F><F>N</F></R>
  ...
</Data>

显然,可以有更多的列和行。

现在,我想使用单个正则表达式将XML解析回表。

我可以找到所有带有<;F>;([wd]*)</F>,但我需要以某种方式将它们按行分组。

我想到了<;R>;(<;F>;([wd]*)</F>;)*</R>,但是Python实现一无所获。

有人能帮忙编写正则表达式吗?

UPDATE Some context of the question.

我知道有很多XML解析库,但不幸的是,我的环境仅限于标准库。无论如何,感谢所有警告不要使用正则表达式进行XML解析的人。

我需要一些快速而肮脏的解决方案,因此我决定从正则表达式开始,稍后切换到解析。

到目前为止,我有代码:

...
row_p = r <R>(.*?)</R> 
field_p = r <F>(.*?)</F> 
table =   

for row in re.finditer(row_p, xml):
    table +=  | .join(re.findall(field_p, row.group(1))) +  
 

...

它适用于小型数据集(约10000行),但不适用于大于500000行的表。

也许我会调查一下它失败的原因,但下一步我要采取的是——切换到一些标准的XML解析器ElementTree是第一个候选者。

最佳回答
import libxml2

txt =  
<Data>
  <R><F>Key</F><F>Val</F><F>Flag</F></R>
  <R><F>01</F><F>AAA</F><F>Y</F></R>
  <R><F>02</F><F>BBB</F><F>N</F></R>
</Data>
 

rows = []
for elem in libxml2.parseDoc(txt):
    if elem.name ==  R :
        curRow = []
        rows.append(curRow)
    elif elem.name ==  F :
        curRow.append(elem.get_content())

退货:

rows = [[ Key ,  Val ,  Flag ], [ 01 ,  AAA ,  Y ], [ 02 ,  BBB ,  N ]] 
问题回答

强制链接:

使用XML解析器lxml非常好,甚至提供了XPath(以及其他与XML相关的东西)——如果你对oneliner有恋物癖,我相信有一个XPath oneliner可以提取这些元素;)

如果这个问题用Perl标记,我可以为您发布一个解决方案+代码,但由于这是python

无论如何,我建议您加载xml文件,并逐行读取它。循环每一行直到文件结束,然后查找该行中的所有字段。据我所知,python中的匹配项存储在一个数组中。你有了。希望我能给你看代码,但这只是主要想法:

load file
foreach line in <file>
    if regex.match( <F>([wd]*)</F> , line)
        print matches[1] .  |  . matches[2] .  |  . matches[3] . "
"
end loop

免责声明:以上代码只是一个划痕

哦,顺便说一句,如果可能的话,可以使用XML解析器。

lxml is a Pythonic binding for the libxml2 and libxslt libraries. It is unique in that it combines the speed and feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.





相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...