I m working on a multi-process spider in Python. It should start scraping one page for links and work from there. Specifically, the top-level page contains a list of categories, the second-level pages events in those categories, and the final, third-level pages participants in the events. I can t predict how many categories, events or participants there ll be.
I m at a bit of a loss as to how best to design such a spider, and in particular, how to know when it s finished crawling (it s expected to keep going till it has discovered and retrieved every relevant page).
Ideally, the first scrape would be synchronous, and everything else async to maximise parallel parsing and adding to the DB, but I m stuck on how to figure out when the crawling is finished.
How would you suggest I structure the spider, in terms of parallel processes and particularly the above problem?