Question

Closed. This question is off-topic. It is not currently accepting answers.

Want to improve this question? Update the question so it s on-topic for Stack Overflow.

Closed 11 years ago.

I have a list of link which I want to get crawled. I would like to all other links the crawler
finds by himself to not be crawled.

Directions I looked into: create a robots.txt which will disallow all pages expect those that exist in my site map. I saw information about how to create such a file which states I can disallow parts of the site by:
Allow: /folder1/myfile.html Disallow: /folder1/

But the links I do want crawled are not in a particular folder. I can make him a hugh file which is actually a site map, but that doesn t seem reasonable. What would you recommend?

Answer 1

The Robots Exclusion Protocol is limited in its URL specification capabilities. I don t know of any published maximum robots.txt file size, but it s generally not expected to be very large. It s just meant to be a recommendation to the crawlers, not an absolute.

You can might consider referencing a sitemap in your robots.txt. The wikipedia page on robots.txt mentions this capability. That would hint to crawlers that support sitemaps the specific URLs you want indexed. I would assume they still follow links on those pages though, so you would still need to specifically disallow anything internally linked that you didn t want crawled.

Again, it would just be a request or recommendation though. Crawlers aren t obligated to follow robots.txt.

Answer 2

If you have the time or energy, organizing your webstie with folders is very helpful in the long run.

As far as the robots.txt is concerned, you can list the disallowed files or folders no problem, but that could be time consuming if you have lots. Robots.txt only have disallowed fields, by the way, so everything is allowed unless discovered otherwise.

See: http://en.wikipedia.org/wiki/Robots_exclusion_standard at the bottom it discusses the use of sitemaps rather than explicit disallow lists.

Answer 3

If the files you want to disallow are scattered around your site and don t follow a particular naming pattern that can be expressed with the simple wildcards that Google, Microsoft, and a few other crawlers support, then your only other option is specifically list each file in a separate Disallow directive in robots.txt. As you indicated, that s a huge job.

If it s important to prevent crawlers from accessing those pages, then you either list each one individually or you rearrange your site to make it easier to block those files that you don t want crawled.

友情链接