Question

I m working on a multi-process spider in Python. It should start scraping one page for links and work from there. Specifically, the top-level page contains a list of categories, the second-level pages events in those categories, and the final, third-level pages participants in the events. I can t predict how many categories, events or participants there ll be.

I m at a bit of a loss as to how best to design such a spider, and in particular, how to know when it s finished crawling (it s expected to keep going till it has discovered and retrieved every relevant page).

Ideally, the first scrape would be synchronous, and everything else async to maximise parallel parsing and adding to the DB, but I m stuck on how to figure out when the crawling is finished.

How would you suggest I structure the spider, in terms of parallel processes and particularly the above problem?

Answer 1

I presume you are putting items to visit in a queue, exhausting the queue with workers, and the workers find new items to visit and add them to the queue.

It s finished when all the workers are idle, and the queue of items to visit is empty.

When the workers take advantage of the queue s task_done() method, The main thread can join() the queue to block until it s empty.

Answer 2

You might want to look into Scrapy, an asynchronous (based on Twisted) web-scraper. It looks like for your task, the XPath description for the spider would be pretty easy to define!

Good luck!

(If you really want to do it yourself, maybe consider having small sqlite db that keeps track of whether each page has been hit or not... or if it s reasonable size, just do it in memory... Twisted in general might be your friend for hit.)

友情链接