Question

So I parsed a html page with .findAll (BeautifulSoup) to variable named result. If I type result in Python shell then press Enter, I see normal text as expected, but as I wanted to postprocess this result as string object, I noticed that str(result) returns garbage, like this sample:

xd1x87xd0xb8xd0xbbxd0xbdxd0xb8xd1x86xd0xb0</a><br />
<hr />
</div>

Html page sources is utf-8 编码

How can I handle this?

法典基本上就是这样,如果是:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib.open(url).read())
result = soup.findAll(something)

甲型六氯环己烷

Answer 1

Python 2.6.7 BeautifulSoup.version 3.2.0

这为我工作:

unicode.join(u 
 ,map(unicode,result))

I m pretty sure a result is a BeautifulSoup.ResultSet object, which seems to be an extension of the standard python list

Answer 2

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib.open(url).read())
#findAll should get multiple parsed result
result = soup.findAll(something)
#then iterate result
for line in result:
    #get str value from each line,replace charset with utf-8 or other charset you need
    print line.__str__( charset )

BTW:Beautiful Soup的版本是美丽的-3.2.1

Answer 3

The s not garbage, that s UTF-8-encoded text. 使用Unicode 相反的。

Answer 4

使用:

unicodedata.normalize( NFKC , p.decode()).encode( ascii , ignore )

Unicode has multiple normalization forms That output should not be garbage.
Use the originalEncoding attribute to verify the encoding scheme.
Regarding python s unicode implementations, refer this document (even for the normalization)

友情链接