English 中文(简体)
粉碎:向绝对道路转变相对途径
原标题:Python Scrapy: Convert relative paths to absolute paths

我已经根据在这里的伟大民间人士提出的以下解决办法修改了该守则;我在这里看到了以下错误。

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from dmoz2.items import DmozItem

class DmozSpider(BaseSpider):
   name = "namastecopy2"
   allowed_domains = ["namastefoods.com"]
   start_urls = [
    "http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1",
    "http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12",    

]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select( /html/body/div/div[2]/table/tr/td[2]/table/tr )
    items = []
    for site in sites:
        item = DmozItem()
        item[ manufacturer ] =  Namaste Foods 
        item[ productname ] = site.select( td/h1/text() ).extract()
        item[ description ] = site.select( //*[@id="info-col"]/p[7]/strong/text() ).extract()
        item[ ingredients ] = site.select( td[1]/table/tr/td[2]/text() ).extract()
        item[ ninfo ] = site.select( td[2]/ul/li[3]/img/@src ).extract()
        #insert code that will save the above image path for ninfo as an absolute path
        base_url = get_base_url(response)
        relative_url = site.select( //*[@id="showImage"]/@src ).extract()
        item[ image_urls ] = urljoin_rfc(base_url, relative_url)
        items.append(item)
    return items

我的项目就是这样:

from scrapy.item import Item, Field

class DmozItem(Item):
    # define the fields for your item here like:
    productid = Field()
    manufacturer = Field()
    productname = Field()
    description = Field()
    ingredients = Field()
    ninfo = Field()
    imagename = Field()
    image_paths = Field()
    relative_images = Field()
    image_urls = Field()
    pass

我需要一个相对的道路,那就是,spi子正在为改用绝对道路和安放的物品[相对价值]提取;在物品中节省了[图像_urls],以便我能够从该间谍中下载图像。 例如,间谍网的相对价格途径是/.files/images/small/8270-BrowniesHiResClip.jpg,应转换为http://namastefoods.com/files/images/small/8270-BrowniesHiResClip.jpg, &储存在物品中[图像_urls]。

我还需要这些物品作为一条绝对的道路。

采用上述法典的错误:

2011-06-28 17:18:11-0400 [scrapy] INFO: Scrapy 0.12.0.2541 started (bot: dmoz2)
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, CloseSpider
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled item pipelines: MyImagesPipeline
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-06-28 17:18:11-0400 [namastecopy2] INFO: Spider opened
2011-06-28 17:18:12-0400 [namastecopy2] DEBUG: Crawled (200) <GET http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12> (referer: None)
2011-06-28 17:18:12-0400 [namastecopy2] ERROR: Spider error processing <http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12> (referer: <None>)
    Traceback (most recent call last):
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 1137, in mainLoop
        self.runUntilCurrent()
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 757, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 243, in callback
        self._startRunCallbacks(result)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 312, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 328, in _runCallbacks
        self.result = callback(self.result, *args, **kw)
      File "/***/***/***/***/***/***/spiders/namaste_copy2.py", line 30, in parse
        item[ image_urls ] = urljoin_rfc(base_url, relative_url)
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/url.py", line 37, in urljoin_rfc
        unicode_to_str(ref, encoding))
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/python.py", line 96, in unicode_to_str
        raise TypeError( unicode_to_str must receive a unicode or str object, got %s  % type(text).__name__)
    exceptions.TypeError: unicode_to_str must receive a unicode or str object, got list

2011-06-28 17:18:15-0400 [namastecopy2] DEBUG: Crawled (200) <GET http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1> (referer: None)
2011-06-28 17:18:15-0400 [namastecopy2] ERROR: Spider error processing <http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1> (referer: <None>)
    Traceback (most recent call last):
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 1137, in mainLoop
        self.runUntilCurrent()
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 757, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 243, in callback
        self._startRunCallbacks(result)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 312, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 328, in _runCallbacks
        self.result = callback(self.result, *args, **kw)
      File "/***/***/***/***/***/***/spiders/namaste_copy2.py", line 30, in parse
        item[ image_urls ] = urljoin_rfc(base_url, relative_url)
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/url.py", line 37, in urljoin_rfc
        unicode_to_str(ref, encoding))
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/python.py", line 96, in unicode_to_str
        raise TypeError( unicode_to_str must receive a unicode or str object, got %s  % type(text).__name__)
    exceptions.TypeError: unicode_to_str must receive a unicode or str object, got list

2    011-06-28 17:18:15-0400 [namastecopy2] INFO: Closing spider (finished)
2011-06-28 17:18:15-0400 [namastecopy2] INFO: Spider closed (finished)

感谢-TM

问题回答

https://doc.scrapy.org/en/latest/intro/tutorial.html?highlight=rel#following-links”rel=“noreferer” Scrapy docs:

def parse(self, response):
    # ... code ommited
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, self.parse)

页: 1 物体有这样做的方法。

事实是:

import urlparse
...

def parse(self, response):
    ...
    urlparse.urljoin(response.url, extractedLink.strip())
    ...

<通知代码>strip(,因为有时会像:

<a href="
              /MID_BRAND_NEW!%c2%a0MID_70006_Google_Android_2.2_7%22%c2%a0Tablet_PC_Silver/a904326516.html
            ">MID BRAND NEW!&nbsp;MID 70006 Google Android 2.2 7"&nbsp;Tablet PC Silver</a>
from scrapy.utils.response import get_base_url

base_url           = get_base_url(response)
relative_url       = site.select( //*[@id="showImage"]/@src ).extract()
item[ image_urls ] = [urljoin_rfc(base_url,ru) for ru in relative_url]

or you could extract just one item

base_url           = get_base_url(response)
relative_url       = site.select( //*[@id="showImage"]/@src ).extract()[0]
item[ image_urls ] = urljoin_rfc(base_url,relative_url)

The error was because you were passing a list instead of a str to urljoin function.

Several notes:

items = []
for site in sites:
    item = DmozItem()
    item[ manufacturer ] =  Namaste Foods 
    ...
    items.append(item)
return items

I do it differently:

for site in sites:
    item = DmozItem()
    item[ manufacturer ] =  Namaste Foods 
    ...
    yield item

Then:

relative_url = site.select( //*[@id="showImage"]/@src ).extract()
item[ image_urls ] = urljoin_rfc(base_url, relative_url)

<>编码>extract()总能填写一份清单,因为Xpath query总会收到一份选定的节点清单。

Do this:

relative_url = site.select( //*[@id="showImage"]/@src ).extract()[0]
item[ image_urls ] = urljoin_rfc(base_url, relative_url)

A more general approach to obtaining an absolute url would be

import urlparse

def abs_url(url, response):
  """Return absolute link"""
  base = response.xpath( //head/base/@href ).extract()
  if base:
    base = base[0]
  else:
    base = response.url
  return urlparse.urljoin(base, url)

This also works when a base element is present.

In your case, you d use it like this:

def parse(self, response):
  # ...
  for site in sites:
    # ...
    image_urls = site.select( //*[@id="showImage"]/@src ).extract()
    if image_urls: item[ image_urls ] = abs_url(image_urls[0], response)




相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签