Question

I m working on a crawler project. I m stuck in a situation wherein the href text on a page keeps on repeating on other pages under that domain. For example if the url is example.com then the href values on these pages are hrefList=[/hello/world,/aboutus,/blog,/contact].

So urls for these page would be example.com/hello/world example.com/aboutus etc

Now on the page example.com/hello/world, the hrefList is again present. Hence I ll get urls as example.com/hello/world/hello/world, example.com/hello/world/aboutus etc

现在这些页面中的 / hello/ world/ hello/ world/ world 是一个合适的页面, 以 http 的状态为 200, 正在不断发生。其余的页面将找不到页面, 因此可以丢弃。

我得到一份新的骨髓灰质炎列表,这些骨髓灰质炎不正确。有办法克服吗?

这是我的代码库:

for url in allUrls:
    if url not in visitedUrls:
        visitedUrls.append(url)

        http=httplib2.Http()
        response,content=http.request(url,headers={ User-Agent : Crawler-Project })        
        if (response.status/100<4):
            soup=BeautifulSoup(content)
            links=soup.findAll( a ,href=True)
            for link in links:
                if link.has_key( href ):
                    if len(link[ href ]) > 1:
                        if not any(x in link[ href ] for x in ignoreUrls):
                            if link[ href ][0]!="#":
                                if "http" in link["href"]:
                                    allUrls.append(link["href"])
                                else:
                                    if url[-1]=="/" and link[ href ][0]=="/":
                                        allUrls.append(url+link[ href ][1:])
                                    else:       
                                        if not (url[-1] =="/" or link[ href ][0] =="/"): 
                                            allUrls.append(url+"/"+link[ href ])
                                        else:
                                            allUrls.append(url+link[ href ])

Answer 1

如果我们假设你得到的页面是相同的, 一种可能的变通办法是创建一个页面的散列, 并确保你不会用同样的散列来爬两页。

您的 hash 将决定您如何坚固, 以及资源密集度如何。您可以将整个网页内容或者其内容/ 标题的某种组合和您爬行器找到的链接( 或其它一些除 URL 以外每个网页都足够独特的链接) 收集到。显然, 包括网页 s URL 不是一个好主意, 因为您现在的问题在于这些网页有不同的 URL, 但内容相同( 与无效链接相同 ) 。

虽然您可以,但您不必为网页做出错误的工作。这将是一个永无止境的故事。

友情链接