Question

我试图解析一个购物网站上的电子游戏标题列表。然而，因为项目列表都存储在标签内。

文档的这一部分应该解释了如何只解析文档的一部分，但我无法解决。我的代码：

from BeautifulSoup import BeautifulSoup
import urllib
import re

url = "Some Shopping Site"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for a in soup.findAll( a ,{ title :re.compile( .+ ) }):
    print a.string

目前，is打印任何具有非空标题引用的标签内的字符串。但它也在为侧栏中的“特价”商品定价。如果我只能拿产品列表div，我将一举两得。

非常感谢。

Answer 1

哦，天哪，我真傻，我在搜索带有atribute id=products的标签，但它应该是product_list

如果有人来搜索，这里是最终代码。

from BeautifulSoup import BeautifulSoup, SoupStrainer
import urllib
import re


start = time.clock()
url = "http://someplace.com"
html = urllib.urlopen(url).read()
product = SoupStrainer( div ,{ id :  products_list })
soup = BeautifulSoup(html,parseOnlyThese=product)
for a in soup.findAll( a ,{ title :re.compile( .+ ) }):
      print a.string

Answer 2

尝试先搜索产品列表div，然后搜索标题为a的标签：

product = soup.find( div ,{ id :  products })
for a in product.findAll( a ,{ title : re.compile( .+ ) }):
   print a.string

友情链接