English 中文(简体)
改进从XML中例行提取案文的做法
原标题:Improving text extraction routine from XML

I ve an XML file which contained no. of <TEXT> </TEXT> tags enclosing text.

<TEXT>

<!-- PJG STAG 4703 -->

<!-- PJG ITAG l=94 g=1 f=1 -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=69 g=1 f=1 -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=50 g=1 f=1 -->


<USDEPT>DEPARTMENT OF AGRICULTURE</USDEPT>

<!-- PJG /ITAG -->

<!-- PJG ITAG l=18 g=1 f=1 -->

<USBUREAU>Packers and Stockyards Administration</USBUREAU>
<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=55 g=1 f=1 -->
Amendment to Certification of Central Filing System_Oklahoma
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=11 g=1 f=1 -->
The Statewide central filing system of Oklahoma has been previously certified, pursuant to section 1324 of the Food
Security Act of 1985, on the basis of information submitted by Hannah D. Atkins, Secretary of State, for farm products
produced in that State (52 FR 49056, December 29, 1987).
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->
The certification is hereby amended on the basis of information submitted by John Kennedy, Secretary of State, for
additional farm products produced in that State as follows: Cattle semen, cattle embryos, milo.
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->
This is issued pursuant to authority delegated by the Secretary of Agriculture.
<!-- PJG /ITAG -->

<!-- PJG QTAG 04 -->
<!-- PJG /QTAG -->

<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG ITAG l=21 g=1 f=1 -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=21 g=1 f=4 -->
Authority:
<!-- PJG /ITAG -->

<!-- PJG ITAG l=21 g=1 f=1 -->
 Sec. 1324(c)(2), Pub. L. 99-198, 99 Stat. 1535, 7 U.S.C. 1631(c)(2); 7 CFR 2.18(e)(3), 2.56(a)(3), 55 FR 22795.
<!-- PJG /ITAG -->

<!-- PJG QTAG 02 -->
<!-- PJG /QTAG -->

<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG ITAG l=21 g=1 f=1 -->
Dated: January 21, 1994
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->

<SIGNER>
<!-- PJG ITAG l=06 g=1 f=1 -->
Calvin W. Watkins, Acting Administrator,
<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->
</SIGNER>
<SIGNJOB>
<!-- PJG ITAG l=04 g=1 f=1 -->
Packers and Stockyards Administration.
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->
</SIGNJOB>
<FRFILING>
<!-- PJG ITAG l=40 g=1 f=1 -->
[FR Doc. 94-1847 Filed 1-27-94; 8:45 am]
<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->
</FRFILING>
<BILLING>
<!-- PJG ITAG l=68 g=1 f=1 -->
BILLING CODE 3410-KD-P
<!-- PJG /ITAG -->
</BILLING>

<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG /STAG -->
</TEXT>

我的任务是从这些短文节中抽出案文。 这是我做的:

def getTextFromXML():
    global Text, xmlDoc
    TextNodes = xmlDoc.getElementsByTagName("TEXT")
    docstr =   
    #Text = [TextFromNode(textNode) for textNode in TextNodes]
    for textNode in TextNodes:
        for cNode in textNode.childNodes:
            if cNode.nodeType == Node.TEXT_NODE:
                docstr+=cNode.data
            else:
                for ccNode in cNode.childNodes:
                    if ccNode.nodeType == Node.TEXT_NODE:
                        docstr+=ccNode.data                
        Text.append(docstr)

问题在于它耗费大量时间。 我猜测我的职能效率不高。 没有人会告诉我如何改进?

http://www.un.org。 处理文件Im包含大约6000+<TEXT>文本内容。

最佳回答

lxml比标准python图书馆所包含的xml图书馆更容易使用。 它对C libxml2图书馆具有约束力,因此Im假设该图书馆也更快。

我这样做(用你们的变量名称):

from lxml import etree
with open( some-file.xml ) as f:
    xmlDoc = etree.parse(f)
    root = xmlDoc.getroot()

    Text = []
    for textNode in root.xpath( TEXT ):
        docstr =  
 .join(text.strip() for text in textNode.xpath( */text() | text() ) if text.strip())
        Text.append(docstr)
问题回答

如果你使用lxml(或2.7中的xml.etree),你可以使用.itertext()方法,例如:

s =   .join(elem.itertext())

有了lxml,你还可能使用<>条码>指示(<>>>>xpath功能(可能更快,因为所有工作都是由Blibxml2自己完成的,而不是在下午):

s = elem.xpath( string() )




相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签