I m currently writing a web crawler (using the python framework scrapy).
Recently I had to implement a pause/resume system.
The solution I implemented is of the simplest kind and, basically, stores links when they get scheduled, and marks them as processed once they actually are.
Thus, I m able to fetch those links (obviously there is a little bit more stored than just an URL, depth value, the domain the link belongs to, etc ...) when resuming the spider and so far everything works well.
Right now, I ve just been using a mysql table to handle those storage action, mostly for fast prototyping.
Now I d like to know how I could optimize this, since I believe a database shouldn t be the only option available here. By optimize, I mean, using a very simple and light system, while still being able to handle a great amount of data written in short times
For now, it should be able to handle the crawling for a few dozen of domains, which means storing a few thousand links a second ...
Thanks in advance for suggestions