Question

我试图用一个简单的 python 脚本从 < a href=" http://mp3skull.com/mp3/linkin_park_faint.html" 网站 < webpage > < /a > 中提取某些链接。我能够成功提取链接, 但现在我想从网页上获取一些更多信息, 如 < code> bibitate, size, dunation 。

我使用下面的xpath 提取上述信息

>>> doc = lxml.html.parse( http://mp3skull.com/mp3/linkin_park_faint.html )
>>> info = doc.xpath(".//*[@id= song_html ]/div[1]/text()")
>>> info[0:7]
[ 
			 ,  
				3.71 mb			 ,  
			 ,  
				3.49 mb			 ,  
			 ,  
				192 kbps ,  2:41 ]

现在我需要的是,对于特定链接,我所需要的信息是以 tuple 的形式生成的,如 (dibate, size, dudule) 。

上面提到的 >xpath 产生所需的信息,但 >il-formated 是无法以任何逻辑实现我所要求的格式的,至少我做不到。

那么,有没有办法实现我格式的输出?

Answer 1

我觉得《美丽苏普》能完成这个任务, 它甚至扭曲了非常糟糕的HTML格式:

http://www.crummy.com/ software/ Beautiful Soup/" rel="no follow" >http://www.crummy.com/ software/ Beautiful Soup/

与“美丽汤”相比, 分析相当容易, 例如:

import bs4
import urllib
soup = bs4.BeautifulSoup(urllib.urlopen( http://mp3skull.com/mp3/linkin_park_faint.html ).read())

print soup.find_all( a )

and have quite good docs: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Answer 2

您可以用 XPath 将所有东西都脱光:

translate(.//*[@id= song_html ]/div[1]/text(), "
	, ",   )

因此,对于你的补充问题,要么:

info[0, len(info)]

完全,或:

info.rfind(" ")

因为翻译留下一个空间字符, 但你可以用任何你想要的取代它。

Addl Info 找到 ", http://www.astro.ufl.edu/~warner/prog/python.html" rel="no follow" >这里

Answer 3

您如何使用正则表达式和 python s re 模块?

http://docs.python.org/library/re.html 可能是必要的。

就从数组获取数据而言, re.match(regex,info[n]) 就足够了,就三而言,python tuple 语法处理它。只需将信息阵列的成员匹配为 re.match 。

import re


matching_re =  .*     # this re matches whole strings, rather than what you need
incoming_value_1 = re.match(matching_re,info[1])
# etc. 
var truple = (incoming_value_1, incoming_value_2, incoming_value_2

友情链接