English 中文(简体)
Extract strings in python
原标题:

Basically, I want to extract the strings "AAA", "BBB", "CCC", "DDD" from a text file...

...... (other text goes here).....
<TD align="left" class=texttd><font class= textfont >AAA</font></TD>
..... (useless text here).....
<TD align="left" class=texttd><font class= textfont >BBB</font></TD>
....(more text).....
<TD align="left" class=texttd><font class= textfont >CCC</font></TD>
<TD align="left" class=texttd><font class= textfont >DDD</font></TD>
......(more text).....

I want something like if I do:-

data = foo("file.txt")

I get:-

data = [ AAA , BBB , CCC , DDD ]

What is the best possible way? My file is not big...

Basically, I want to extract "remaining upload data transfer" from this file which in HTML looks like THIS

最佳回答

You could write a REGEX but it would be "parsing" the HTML to some extent. The problem with writing regular expressions for HTML is HTML is a mess. It s rarely perfect and this causes problems when you rely on it for data.

I would personally use BeautifulSoup. It does do more than you re asking but also at superfraction of the effort.

问题回答

You want BeautifulSoup:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(your_file)

soup.find("font", "textfont")
def foo():
    input_file = open("myfile.txt",  r )
    input =   .join(input_file.readlines())

    looking_for = [ AAA ,  BBB ,  CCC ,  DDD ]
    have = []

    for thing in looking_for:
        if thing in input:
            have.append(thing)
    return have

In a case like this it s, attempt regex for it ( which will be really had ), use a prewritten library, or do it your self with a f = open() f.read() and your own parser.

If you just want to get the data from inside all of the tags in the HTML document, while dropping all the tags themselves, you could do something like this:

import HTMLParser

class DataOnlyParser(HTMLParser.HTMLParser):
    def parse(self, text):
        self.result = []
        self.feed(text)
        self.close()
        return self.result

    def handle_data(self, data):
        data = data.strip()
        if data:
            self.result.append(data)

p = DataOnlyParser()

data = """
<TD align="left" class=texttd><font class= textfont >AAA</font></TD>
<TD align="left" class=texttd><font class= textfont >BBB</font></TD>
<TD align="left" class=texttd><font class= textfont >CCC</font></TD>
<TD align="left" class=texttd><font class= textfont >DDD</font></TD>
"""

print p.parse(data)
# [ AAA ,  BBB ,  CCC ,  DDD ]

If your selection criteria is more complex though, and/or if the input is malformed, you d probably be better off with a library like lxml.

You do NOT want to use regular expressions to "parse" html. See here.





相关问题
Simple JAVA: Password Verifier problem

I have a simple problem that says: A password for xyz corporation is supposed to be 6 characters long and made up of a combination of letters and digits. Write a program fragment to read in a string ...

Case insensitive comparison of strings in shell script

The == operator is used to compare two strings in shell script. However, I want to compare two strings ignoring case, how can it be done? Is there any standard command for this?

Trying to split by two delimiters and it doesn t work - C

I wrote below code to readin line by line from stdin ex. city=Boston;city=New York;city=Chicago and then split each line by ; delimiter and print each record. Then in yet another loop I try to ...

String initialization with pair of iterators

I m trying to initialize string with iterators and something like this works: ifstream fin("tmp.txt"); istream_iterator<char> in_i(fin), eos; //here eos is 1 over the end string s(in_i, ...

break a string in parts

I have a string "pc1|pc2|pc3|" I want to get each word on different line like: pc1 pc2 pc3 I need to do this in C#... any suggestions??

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签