I am trying to make the SgmlLinkExtractor to work.
This is the signature:
SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=( a , area ), attrs=( href ), canonicalize=True, unique=True, process_value=None)
I am just using allow=()
So, I enter
rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback= parse ),)
So, the initial url is http://www.whitecase.com/jacevedo/
and I am entering allow=( /aadler ,)
and expect that
/aadler/
will get scanned as well. But instead, the spider scans the initial url and then closes:
[wcase] INFO: Domain opened
[wcase] DEBUG: Crawled </jacevedo/> (referer: <None>)
[wcase] INFO: Passed NuItem(school=[u JD, , u Columbia Law School, Harlan Fiske Stone Scholar, Parker School Recognition of Achievement in International and Foreign Law, , u 2005 ])
[wcase] INFO: Closing domain (finished)
What am I doing wrong here?
Is there anyone here who used Scrapy successfully who can help me to finish this spider?
Thank you for the help.
I include the code for the spider below:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from Nu.items import NuItem
from urls import u
class NuSpider(CrawlSpider):
domain_name = "wcase"
start_urls = [ xxxxxx/jacevedo/ ]
rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback= parse ),)
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = NuItem()
item[ school ] = hxs.select( //td[@class="mainColumnTDa"] ).re( (?<=(JD,s))(.*?)(d+) )
return item
SPIDER = NuSpider()
Note: SO will not let me post more than 1 url so substitute the initial url as necessary. Sorry about that.