Question

So what I would like to do is scrape this site: http://boxerbiography.blogspot.com/ and create one HTML page that I can either print or send to my Kindle.

我正在考虑使用Hpricot,但不清楚如何继续工作。

我如何确定,以便重新检查每一环节,获得超文本,要么将其储存在变数中,要么将其丢到主要的超文本页上,然后回到目录上,并保持这样做?

你们不必告诉我如何这样做,而只是我可能想这样做的理论。

我确实必须看一看其中一条的渊源(正好是前文),例如: 查阅来源:http:// Boxerbiography.blogspot.com/2006/12/10-progamer-lim-yohwan-e-sports-icon.html,并人工编排某些标签之间的文字(如h3,p等)?

如果我采取这种做法,我就不得不研究每一章/条款的每个来源,然后这样做。难道不能够打败书写文字的目的吗?

理想的情况是,我希望能说明一下共同提交文件与其他法典之间的区别,而只是案文的摆放(与适当的标题和这样)。

确实希望得到一些指导。

感谢。

Answer 1

I d recomment using Nokogiri instead of Hpricot. It s more robust, uses less resources, fewer bugs, it s easier to use, and faster.

I did some scraping extensively for work on time, and had to switch to Nokogiri, because Hpricot would crash on some pages unexplicably.

检查这种铁路种姓:

http://railscasts.com/episodes/190-cr-scraping-with-nokogiri” rel=“nofollow”http://railscasts.com/episodes/190-cr-scraping-with-nokogiri

并且

http://nokogiri.org/“rel=“nofollow”http://nokogiri.org/

。

http://www.engineyard.com/blog/ 2010_started-with-nokogiri/

友情链接