我试图从中国报纸数据库中删除文章。 这里是一些源代码(原封顶/关键地点):
<base href="http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/web" /><html>
<! -- <%@ page contentType="text/html;charset=GBK" %>
<head>
<meta http-equiv="Content-Language" content="zh-cn">
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<title>概览页面</title>
...
</head>
...
</html>
</html>
当我试图对表格中的链接做一些直截了当的拆解:
import urllib, urllib2, re, mechanize
from BeautifulSoup import BeautifulSoup
br = mechanize.Browser(factory=mechanize.RobustFactory())
br.addheaders = [( User-agent , Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6 )]
br.set_handle_robots(False)
url = http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/search?%C8%D5%C6%DA=&%B1%EA%CC%E2=&%B0%E6%B4%CE=&%B0%E6%C3%FB=&%D7%F7%D5%DF=&%D7%A8%C0%B8=&%D5%FD%CE%C4=%B9%FA%BC%CA%B9%D8%CF%B5&Relation=AND&sortfield=RELEVANCE&image1.x=27&image1.y=16&searchword=%D5%FD%CE%C4%3D%28%B9%FA%BC%CA%B9%D8%CF%B5%29&presearchword=%B9%FA%BC%CA%B9%D8%CF%B5&channelid=16380
page = br.open(url)
soup = BeautifulSoup(page)
links = soup.findAll( a ) # links is empty =(
即便在html中也没有发现任何东西,卡族人还是一份空洞的清单。 我认为,这是因为源代码从底土开始,而灰色只承认文件中的两端:底土虾和html。
如何拆除本案中的联系? 感谢你!