English 中文(简体)
Most optimized way to store crawler states?
原标题:

I m currently writing a web crawler (using the python framework scrapy).
Recently I had to implement a pause/resume system.
The solution I implemented is of the simplest kind and, basically, stores links when they get scheduled, and marks them as processed once they actually are.
Thus, I m able to fetch those links (obviously there is a little bit more stored than just an URL, depth value, the domain the link belongs to, etc ...) when resuming the spider and so far everything works well.

Right now, I ve just been using a mysql table to handle those storage action, mostly for fast prototyping.

Now I d like to know how I could optimize this, since I believe a database shouldn t be the only option available here. By optimize, I mean, using a very simple and light system, while still being able to handle a great amount of data written in short times

For now, it should be able to handle the crawling for a few dozen of domains, which means storing a few thousand links a second ...

Thanks in advance for suggestions

最佳回答

The fastest way of persisting things is typically to just append them to a log -- such a totally sequential access pattern minimizes disk seeks, which are typically the largest part of the time costs for storage. Upon restarting, you re-read the log and rebuild the memory structures that you were also building on the fly as you were appending to the log in the first place.

Your specific application could be further optimized since it doesn t necessarily require 100% reliability -- if you miss writing a few entries due to a sudden crash, ah well, you ll just crawl them again. So, your log file can be buffered and doesn t need to be obsessively fsync ed.

I imagine the search structure would also fit comfortably in memory (if it s only for a few dozen sites you could probably just keep a set with all their URLs, no need for bloom filters or anything fancy) -- if it didn t, you might have to keep in memory only a set of recent entries, and periodically dump that set to disk (e.g., merging all entries into a Berkeley DB file); but I m not going into excruciating details about these options since it does not appear you will require them.

问题回答

There was a talk at PyCon 2009 that you may find interesting, Precise state recovery and restart for data-analysis applications by Bill Gribble.

Another quick way to save your application state may be to use pickle to serialize your application state to disk.





相关问题
Delphi-5 single-file storage solution?

Is there a Delphi-5 solution to easily integrate single-file storage into existing code? I would like to have files like Java *.jar or Openoffice document files which are zipped/compressed files and ...

iPhone Application records sounds and saves those sounds

My application will record and store some sounds to play them back later. Right now, I can record the sound which is saved in temporary folder. Where is the best way to store sounds to play back ...

/proc/sys/fs/aio-nr is never higher than 1024 (AIO on linux)

I m trying to use async io on linux. As far as i know there re 3 options: kernel calls (io_submit and friends) libRT - uses threads in user space libRTKAIO - wrapper of kernel calls which does not ...

Most optimized way to store crawler states?

I m currently writing a web crawler (using the python framework scrapy). Recently I had to implement a pause/resume system. The solution I implemented is of the simplest kind and, basically, stores ...

Storing key-value pair settings in silverlight

I have a class in silverlight that I would like to store to disk. It contains a couple of basic CLR objects - strings and integers, and a WPF BitmapImage. What is the best way to store this to a file?...

Best way to store and retrieve this..?

I ve been trying all night, and talk of maps, arrays, vectors and hash_maps have filled my head. im just confused now. i posted a previous question here: C++ map really slow? problem was fixed but it ...

热门标签