English 中文(简体)
Scraping hidden HTML (when visible = false) using Hpricot (Ruby on Rails)
原标题:

I ve come across an issue which unfortunately I can t seem to surpass, I m also just a newborn to Ruby on rails unfortunately hence the number of questions

I am attempting to scrape a webpage such as the following:

http://www.yellowpages.com.mt/Malta/Grocers-Mini-Markets-Retail-In-Malta-Gozo.aspx

I would like to scrape The Addresses, Phones and URL of the next Page which in this case is

http://www.yellowpages.com.mt/Malta/Grocers-Mini-Markets-Retail-In-Malta-Gozo+Ismol.aspx

I ve been trying just about anything i could think of but nothing seems to work due to them being set to invisible or so.

The Address is within an h3 tag but it does not appear to be scrap-able. I ve been also looking into ScRUBYt from the following url http://www.rubyrailways.com/ajax-scraping-with-scrubyt-linkedin-google-analytics-yahoo-suggestions/, but i really cant seem to find heads or tails of how to apply them in this case.

I would really appreciate any pointers as this is an obstacle which i really need to surpass in order to move forward on my assignment. Thanks in advance for any help.

问题回答

In the particular example you have given, the elements are not hidden, but loaded via ajax after the page load. So basically what you need is a http client which can run javascript (web browser?) to see those address and other contents.

If you want to really automate the process and scrape the data which is got through ajax or javascript, you can try selenium. Even though it is not developed for that purpose, it serves your needs.

I don t have an answer to your specific question, but I thought I d point to Ryan Bates Railscast episode on screen scraping with ruby: http://railscasts.com/episodes/173-screen-scraping-with-scrapi

He uses a library called scrAPI instead of ScRUBYt, since he couldn t get ScRUBYt working. scrAPI seems to be a bit easier maybe?

I hope this helps somewhat, good luck with your assignment! :)

-John

There is a good script posted at the google group. It seems to extract address, etc. You may want to look at the code for the script page.txt.





相关问题
CSS working only in Firefox

I am trying to create a search text-field like on the Apple website. The HTML looks like this: <div class="frm-search"> <div> <input class="btn" type="image" src="http://www....

image changed but appears the same in browser

I m writing a php script to crop an image. The script overwrites the old image with the new one, but when I reload the page (which is supposed to pickup the new image) I still see the old one. ...

Firefox background image horizontal centering oddity

I am building some basic HTML code for a CMS. One of the page-related options in the CMS is "background image" and "stretch page width / height to background image width / height." so that with large ...

Separator line in ASP.NET

I d like to add a simple separator line in an aspx web form. Does anyone know how? It sounds easy enough, but still I can t manage to find how to do it.. 10x!

热门标签