English 中文(简体)
Creating a directive robots.txt [closed]
原标题:
Closed. This question is off-topic. It is not currently accepting answers.

Want to improve this question? Update the question so it s on-topic for Stack Overflow.

Closed 11 years ago.

I have a list of link which I want to get crawled. I would like to all other links the crawler
finds by himself to not be crawled.

Directions I looked into: create a robots.txt which will disallow all pages expect those that exist in my site map. I saw information about how to create such a file which states I can disallow parts of the site by:
Allow: /folder1/myfile.html
Disallow: /folder1/

But the links I do want crawled are not in a particular folder. I can make him a hugh file which is actually a site map, but that doesn t seem reasonable. What would you recommend?

问题回答

The Robots Exclusion Protocol is limited in its URL specification capabilities. I don t know of any published maximum robots.txt file size, but it s generally not expected to be very large. It s just meant to be a recommendation to the crawlers, not an absolute.

You can might consider referencing a sitemap in your robots.txt. The wikipedia page on robots.txt mentions this capability. That would hint to crawlers that support sitemaps the specific URLs you want indexed. I would assume they still follow links on those pages though, so you would still need to specifically disallow anything internally linked that you didn t want crawled.

Again, it would just be a request or recommendation though. Crawlers aren t obligated to follow robots.txt.

If you have the time or energy, organizing your webstie with folders is very helpful in the long run.

As far as the robots.txt is concerned, you can list the disallowed files or folders no problem, but that could be time consuming if you have lots. Robots.txt only have disallowed fields, by the way, so everything is allowed unless discovered otherwise.

See: http://en.wikipedia.org/wiki/Robots_exclusion_standard at the bottom it discusses the use of sitemaps rather than explicit disallow lists.

If the files you want to disallow are scattered around your site and don t follow a particular naming pattern that can be expressed with the simple wildcards that Google, Microsoft, and a few other crawlers support, then your only other option is specifically list each file in a separate Disallow directive in robots.txt. As you indicated, that s a huge job.

If it s important to prevent crawlers from accessing those pages, then you either list each one individually or you rearrange your site to make it easier to block those files that you don t want crawled.





相关问题
Scrapy SgmlLinkExtractor question

I am trying to make the SgmlLinkExtractor to work. This is the signature: SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=( a , area ), attrs=( href )...

Scrapy BaseSpider: How does it work?

This is the BaseSpider example from the Scrapy tutorial: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from dmoz.items import DmozItem class DmozSpider(...

Designing a multi-process spider in Python

I m working on a multi-process spider in Python. It should start scraping one page for links and work from there. Specifically, the top-level page contains a list of categories, the second-level pages ...

What is the best way to crawl a login based sites?

I ve to automate a file download activity from a website (similar to, let s say, yahoomail.com). To reach a page which has this file download link, i ve to login, jump from page to page to provide ...

Twisted errors in Scrapy spider

When I run the spider from the Scrapy tutorial I get these error messages: File "C:Python26libsite-packages wistedinternetase.py", line 374, in fireEvent DeferredList(beforeResults)....

Crawling not working windows2008

We installed a new MOSS 2007 farm on windows 2008 SP2 enviroment. We used SQL2008 too. Configuration is 1 index, 1 FE and 1 server with 2008, all on ESX 4.0. All the Service that need it uses a ...

Is there a list of known web crawlers? [closed]

I m trying to get accurate download numbers for some files on a web server. I look at the user agents and some are clearly bots or web crawlers, but many for many I m not sure, they may or may not be ...

Most optimized way to store crawler states?

I m currently writing a web crawler (using the python framework scrapy). Recently I had to implement a pause/resume system. The solution I implemented is of the simplest kind and, basically, stores ...

热门标签