I m working on a crawler project. I m stuck in a situation wherein the href text on a page keeps on repeating on other pages under that domain. For example if the url is example.com then the href values on these pages are hrefList=[/hello/world,/aboutus,/blog,/contact].
So urls for these page would be example.com/hello/world example.com/aboutus etc
Now on the page example.com/hello/world, the hrefList is again present. Hence I ll get urls as example.com/hello/world/hello/world, example.com/hello/world/aboutus etc
现在这些页面中的 / hello/ world/ hello/ world/ world 是一个合适的页面, 以 http 的状态为 200, 正在不断发生 。 其余的页面将找不到页面, 因此可以丢弃 。
我得到一份新的骨髓灰质炎列表,这些骨髓灰质炎不正确。有办法克服吗?
这是我的代码库:
for url in allUrls:
if url not in visitedUrls:
visitedUrls.append(url)
http=httplib2.Http()
response,content=http.request(url,headers={ User-Agent : Crawler-Project })
if (response.status/100<4):
soup=BeautifulSoup(content)
links=soup.findAll( a ,href=True)
for link in links:
if link.has_key( href ):
if len(link[ href ]) > 1:
if not any(x in link[ href ] for x in ignoreUrls):
if link[ href ][0]!="#":
if "http" in link["href"]:
allUrls.append(link["href"])
else:
if url[-1]=="/" and link[ href ][0]=="/":
allUrls.append(url+link[ href ][1:])
else:
if not (url[-1] =="/" or link[ href ][0] =="/"):
allUrls.append(url+"/"+link[ href ])
else:
allUrls.append(url+link[ href ])