English 中文(简体)
以特定格式从网页提取信息
原标题:Extract information from a webpage in a particular format

我试图用一个简单的 python 脚本从 < a href=" http://mp3skull.com/mp3/linkin_park_faint.html" 网站 < webpage > < /a > 中提取某些链接。 我能够成功提取链接, 但现在我想从网页上获取一些更多信息, 如 < code> bibitate, size, dunation 。

我使用下面的xpath 提取上述信息

>>> doc = lxml.html.parse( http://mp3skull.com/mp3/linkin_park_faint.html )
>>> info = doc.xpath(".//*[@id= song_html ]/div[1]/text()")
>>> info[0:7]
[ 
			 ,  
				3.71 mb			 ,  
			 ,  
				3.49 mb			 ,  
			 ,  
				192 kbps ,  2:41 ]

现在我需要的是,对于特定链接,我所需要的信息是以 tuple 的形式生成的,如 (dibate, size, dudule)

上面提到的 >xpath 产生所需的信息,但 >il-formated 是无法以任何逻辑实现我所要求的格式的,至少我做不到。

那么,有没有办法实现我格式的输出?

问题回答

我觉得《美丽苏普》能完成这个任务, 它甚至扭曲了非常糟糕的HTML格式:

http://www.crummy.com/ software/ Beautiful Soup/" rel="no follow" >http://www.crummy.com/ software/ Beautiful Soup/

与“美丽汤”相比, 分析相当容易, 例如:

import bs4
import urllib
soup = bs4.BeautifulSoup(urllib.urlopen( http://mp3skull.com/mp3/linkin_park_faint.html ).read())

print soup.find_all( a )

and have quite good docs: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

您可以用 XPath 将所有东西都脱光:

translate(.//*[@id= song_html ]/div[1]/text(), "
	, ",   ) 

因此,对于你的补充问题,要么:

info[0, len(info)] 

完全,或:

info.rfind(" ") 

因为翻译留下一个空间字符, 但你可以用任何你想要的取代它。

Addl Info 找到 ", http://www.astro.ufl.edu/~warner/prog/python.html" rel="no follow" >这里

您如何使用正则表达式和 python s re 模块?

http://docs.python.org/library/re.html 可能是必要的。

就从数组获取数据而言, re.match(regex,info[n]) 就足够了,就三而言,python tuple 语法处理它。只需将信息阵列的成员匹配为 re.match 。

import re


matching_re =  .*     # this re matches whole strings, rather than what you need
incoming_value_1 = re.match(matching_re,info[1])
# etc. 
var truple = (incoming_value_1, incoming_value_2, incoming_value_2




相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签