English 中文(简体)
如何处理在网络爬行器中 重复重复的href?
原标题:How to handle repeating href in a web crawler?

I m working on a crawler project. I m stuck in a situation wherein the href text on a page keeps on repeating on other pages under that domain. For example if the url is example.com then the href values on these pages are hrefList=[/hello/world,/aboutus,/blog,/contact].

So urls for these page would be example.com/hello/world example.com/aboutus etc

Now on the page example.com/hello/world, the hrefList is again present. Hence I ll get urls as example.com/hello/world/hello/world, example.com/hello/world/aboutus etc

现在这些页面中的 / hello/ world/ hello/ world/ world 是一个合适的页面, 以 http 的状态为 200, 正在不断发生 。 其余的页面将找不到页面, 因此可以丢弃 。

我得到一份新的骨髓灰质炎列表,这些骨髓灰质炎不正确。有办法克服吗?

这是我的代码库:

for url in allUrls:
    if url not in visitedUrls:
        visitedUrls.append(url)

        http=httplib2.Http()
        response,content=http.request(url,headers={ User-Agent : Crawler-Project })        
        if (response.status/100<4):
            soup=BeautifulSoup(content)
            links=soup.findAll( a ,href=True)
            for link in links:
                if link.has_key( href ):
                    if len(link[ href ]) > 1:
                        if not any(x in link[ href ] for x in ignoreUrls):
                            if link[ href ][0]!="#":
                                if "http" in link["href"]:
                                    allUrls.append(link["href"])
                                else:
                                    if url[-1]=="/" and link[ href ][0]=="/":
                                        allUrls.append(url+link[ href ][1:])
                                    else:       
                                        if not (url[-1] =="/" or link[ href ][0] =="/"): 
                                            allUrls.append(url+"/"+link[ href ])
                                        else:
                                            allUrls.append(url+link[ href ])
最佳回答

如果我们假设你得到的页面是相同的, 一种可能的变通办法是创建一个页面的散列, 并确保你不会用同样的散列来爬两页。

您的 hash 将决定您如何坚固, 以及资源密集度如何。 您可以将整个网页内容或者其内容/ 标题的某种组合和您爬行器找到的链接( 或其它一些除 URL 以外每个网页都足够独特的链接) 收集到。 显然, 包括网页 s URL 不是一个好主意, 因为您现在的问题在于这些网页有不同的 URL, 但内容相同( 与无效链接相同 ) 。

虽然您可以,但您不必为网页做出错误的工作。 这将是一个永无止境的故事。

问题回答

暂无回答




相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签