English 中文(简体)
一种缩略语
原标题:Crawl links which have href within one quote

我将一些网站用图象进行拖网,我对连接<代码>href有问题。 s 缩略语href=,而不是双重引用href=><>。

当我允许所有链接都与<条码>allow(<>>>>>>/code(>)挂钩时,结果将只包含与双字母链接。 我如何解决这一问题?

问题回答

您是否使用<条码>scrapy.contrib_exp.crandspider.rules. RuleSgmlLinkExtractor? 我只想一字或两字。 如果你想要获得与这一特定规则有关的所有联系并与之相联系,则使用

Rule(SgmlLinkExtractor(allow=( .* , )), callback= parse_item )

允许=()指空读物,因此不会配对。





相关问题
Scrapy SgmlLinkExtractor question

I am trying to make the SgmlLinkExtractor to work. This is the signature: SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=( a , area ), attrs=( href )...

Scrapy BaseSpider: How does it work?

This is the BaseSpider example from the Scrapy tutorial: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from dmoz.items import DmozItem class DmozSpider(...

Designing a multi-process spider in Python

I m working on a multi-process spider in Python. It should start scraping one page for links and work from there. Specifically, the top-level page contains a list of categories, the second-level pages ...

What is the best way to crawl a login based sites?

I ve to automate a file download activity from a website (similar to, let s say, yahoomail.com). To reach a page which has this file download link, i ve to login, jump from page to page to provide ...

Twisted errors in Scrapy spider

When I run the spider from the Scrapy tutorial I get these error messages: File "C:Python26libsite-packages wistedinternetase.py", line 374, in fireEvent DeferredList(beforeResults)....

Crawling not working windows2008

We installed a new MOSS 2007 farm on windows 2008 SP2 enviroment. We used SQL2008 too. Configuration is 1 index, 1 FE and 1 server with 2008, all on ESX 4.0. All the Service that need it uses a ...

Is there a list of known web crawlers? [closed]

I m trying to get accurate download numbers for some files on a web server. I look at the user agents and some are clearly bots or web crawlers, but many for many I m not sure, they may or may not be ...

Most optimized way to store crawler states?

I m currently writing a web crawler (using the python framework scrapy). Recently I had to implement a pause/resume system. The solution I implemented is of the simplest kind and, basically, stores ...

热门标签