Question

I m trying to find the best method to gather URLs, I could create my own little crawler but it would take my servers decades to crawl all of the Internet and the bandwidth required would be huge. The other thought would be using Google s Search API or Yahoo s Search API, but that s not really a great solution as it requires a search to be performed before I get results.

Other thoughts include asking DNS servers and requesting a list of URLs but DNS servers can limit/throttle my requests or even ban me all together. My knowledge of asking DNS servers is quite limited at the moment, so I don t know if this is the best method or not.

I just want a massive list of URLs, but I want to build this list without running into brick walls in the future. Any thoughts?

I m starting this project to learn Python but that really has nothing to do with the question.

Answer 1

You can register to get access to the entire .com and .net zone files at Verisign

I haven t read the fine print for terms of use, nor do I know how much (if anything) it costs. However, that would give you a huge list of active domains to use as URLs.

Answer 2

$ wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

Answer 3

How big is massive? A good place to start is http://www.alexa.com/topsites. They offer a download of the top 1,000,000 sites (by their ranking mechanism). You could then expand this list by going to Google and scraping the results of the query link: url for each url in the list.

Answer 4

modern terms now are URI and URN, URL is the shrunk/outdated. i d scan for sitemap files that contain many addresses in one file and study the classic text spiders, wanderes, brokers and bots and RFC 3305 (appendix b. p 50) defining URI regex

友情链接