Question

I m working on a project where I need a mature crawler to do some work, and I m evaluating Nutch for this purpose. My current needs are relatively straightforward: I need a crawler that is able to save the data to disk and I need it to be able to recrawl only the updated resources of a site and skip the parts that are already crawled. Does anyone have any experience working with the Nutch code directly in Java, not via the command line. I would like to start simple: create a crawler (or similar), minimally configure it and start it, nothing fancy. Is there some example for this, or some resource I should be looking at? I m going over the Nutch documentation, but most of it is about command line, search and other stuff. How usable is the Nutch crawling module without the need to index and search? Any help is appreciated. Thanks.

Answer 1

Nutch is very different than what you have ever practiced most probably. Because it is something like a framework it not only has front for query & search, athough solr seems more powerfull than the native Nutch search front end. It also has the crawling part and the indexing (into a Lucene indexe).

如果你想要将拖网用于除搜索以外的其他目的,那么你将需要开发自己的节目,并熟悉Hadoop和地图绘制方案。

你们不相信你们想要做什么,但却不喜欢做什么。解决办法

友情链接