English 中文(简体)
用不明确来源代码的“彩虹”表提取链接
原标题:Extracting links from HTML table using BeautifulSoup with unclean source code

我试图从中国报纸数据库中删除文章。 这里是一些源代码(原封顶/关键地点):

<base href="http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/web" /><html>
<! -- <%@ page contentType="text/html;charset=GBK" %>
<head>
<meta http-equiv="Content-Language" content="zh-cn">
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<title>概览页面</title>
...
</head>
...
</html>  
</html>

当我试图对表格中的链接做一些直截了当的拆解:

import urllib, urllib2, re, mechanize
from BeautifulSoup import BeautifulSoup
br = mechanize.Browser(factory=mechanize.RobustFactory())
br.addheaders = [( User-agent ,  Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6 )]
br.set_handle_robots(False)

url =  http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/search?%C8%D5%C6%DA=&%B1%EA%CC%E2=&%B0%E6%B4%CE=&%B0%E6%C3%FB=&%D7%F7%D5%DF=&%D7%A8%C0%B8=&%D5%FD%CE%C4=%B9%FA%BC%CA%B9%D8%CF%B5&Relation=AND&sortfield=RELEVANCE&image1.x=27&image1.y=16&searchword=%D5%FD%CE%C4%3D%28%B9%FA%BC%CA%B9%D8%CF%B5%29&presearchword=%B9%FA%BC%CA%B9%D8%CF%B5&channelid=16380 
page = br.open(url)
soup = BeautifulSoup(page)
links = soup.findAll( a ) # links is empty =(

即便在html中也没有发现任何东西,卡族人还是一份空洞的清单。 我认为,这是因为源代码从底土开始,而灰色只承认文件中的两端:底土虾和html。

如何拆除本案中的联系? 感谢你!

最佳回答

BS isnt really developed any long - it would have a look at lxml

多诺特人可以接触这一特殊ur,但我能够利用html碎块(我添加了<条码>a<>>。

>>> soup = lxml.html.document_fromstring(u)
>>> soup.cssselect( a )
>>> soup.cssselect( a )[0].text_content() #for example
问题回答

移除第2行后,联邦统计局发现所有当事方。 我没有找到更好的办法来制止这种情况。

page = br.open(url)
page = page.read().replace( <! -- <%@ page contentType="text/html;charset=GBK" %> ,   )
soup = BeautifulSoup(page)

When your html is very messed up, it s better to clean it up a little first, for instance, in this case, remove everything before , remove everything after (the first) . Download one page, mold it manually to see what is acceptable to beautifulsoup, and then write some regexes to preprocess.





相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签