English 中文(简体)
What s the best method to capture URLs?
原标题:

I m trying to find the best method to gather URLs, I could create my own little crawler but it would take my servers decades to crawl all of the Internet and the bandwidth required would be huge. The other thought would be using Google s Search API or Yahoo s Search API, but that s not really a great solution as it requires a search to be performed before I get results.

Other thoughts include asking DNS servers and requesting a list of URLs but DNS servers can limit/throttle my requests or even ban me all together. My knowledge of asking DNS servers is quite limited at the moment, so I don t know if this is the best method or not.

I just want a massive list of URLs, but I want to build this list without running into brick walls in the future. Any thoughts?

I m starting this project to learn Python but that really has nothing to do with the question.

最佳回答

You can register to get access to the entire .com and .net zone files at Verisign

I haven t read the fine print for terms of use, nor do I know how much (if anything) it costs. However, that would give you a huge list of active domains to use as URLs.

问题回答
$ wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

How big is massive? A good place to start is http://www.alexa.com/topsites. They offer a download of the top 1,000,000 sites (by their ranking mechanism). You could then expand this list by going to Google and scraping the results of the query link: url for each url in the list.

modern terms now are URI and URN, URL is the shrunk/outdated. i d scan for sitemap files that contain many addresses in one file and study the classic text spiders, wanderes, brokers and bots and RFC 3305 (appendix b. p 50) defining URI regex





相关问题
Scrapy SgmlLinkExtractor question

I am trying to make the SgmlLinkExtractor to work. This is the signature: SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=( a , area ), attrs=( href )...

Scrapy BaseSpider: How does it work?

This is the BaseSpider example from the Scrapy tutorial: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from dmoz.items import DmozItem class DmozSpider(...

Designing a multi-process spider in Python

I m working on a multi-process spider in Python. It should start scraping one page for links and work from there. Specifically, the top-level page contains a list of categories, the second-level pages ...

What is the best way to crawl a login based sites?

I ve to automate a file download activity from a website (similar to, let s say, yahoomail.com). To reach a page which has this file download link, i ve to login, jump from page to page to provide ...

Twisted errors in Scrapy spider

When I run the spider from the Scrapy tutorial I get these error messages: File "C:Python26libsite-packages wistedinternetase.py", line 374, in fireEvent DeferredList(beforeResults)....

Crawling not working windows2008

We installed a new MOSS 2007 farm on windows 2008 SP2 enviroment. We used SQL2008 too. Configuration is 1 index, 1 FE and 1 server with 2008, all on ESX 4.0. All the Service that need it uses a ...

Is there a list of known web crawlers? [closed]

I m trying to get accurate download numbers for some files on a web server. I look at the user agents and some are clearly bots or web crawlers, but many for many I m not sure, they may or may not be ...

Most optimized way to store crawler states?

I m currently writing a web crawler (using the python framework scrapy). Recently I had to implement a pause/resume system. The solution I implemented is of the simplest kind and, basically, stores ...

热门标签