Question

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.

我对自己写一个爬虫感到半信半疑，但我现在时间不够。我看过维基百科上的开源爬虫列表，但我更喜欢使用Python编写。我意识到我可以直接使用维基页上的工具，并在Python中进行包装。如果有人对这些工具有任何建议，我很愿意听取。我已经通过其Web接口使用过Heritrix，但我发现它非常繁琐。我肯定不会在我的即将到来的项目中使用浏览器API。

提前感谢。此外，这是我第一个 Stack Overflow 的问题！

Answer 1

Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission).
Twill is a simple scripting language built on top of Mechanize
BeautifulSoup + urllib2 also works quite nicely.
Scrapy looks like an extremely promising project; it s new.

Answer 2

使用Scrapy。

这是一个基于Twisted的网络爬虫框架，仍在积极开发中，但已经可以使用。具有许多好处：

Built-in support for parsing HTML, XML, CSV, and Javascript
A media pipeline for scraping items with images (or any other media) and download the image files as well
Support for extending Scrapy by plugging your own functionality using middlewares, extensions, and pipelines
Wide range of built-in middlewares and extensions for handling of compression, cache, cookies, authentication, user-agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc
Interactive scraping shell console, very useful for developing and debugging
Web management console for monitoring and controlling your bot
Telnet console for low-level access to the Scrapy process

通过在返回的HTML上使用XPath选择器，提取所有今天添加到mininova种子站的种子文件信息的示例代码：

class Torrent(ScrapedItem):
    pass

class MininovaSpider(CrawlSpider):
    domain_name =  mininova.org 
    start_urls = [ http://www.mininova.org/today ]
    rules = [Rule(RegexLinkExtractor(allow=[ /tor/d+ ]),  parse_torrent )]

    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)
        torrent = Torrent()

        torrent.url = response.url
        torrent.name = x.x("//h1/text()").extract()
        torrent.description = x.x("//div[@id= description ]").extract()
        torrent.size = x.x("//div[@id= info-left ]/p[2]/text()[2]").extract()
        return [torrent]

Answer 3

请查看用Python编写的多线程网络爬虫HarvestMan，还要看一下spider.py模块。 HarvestMan spider.py

这里有代码示例，可以构建一个简单的网络爬虫。

Answer 4

我使用过Ruya，觉得它相当不错。

Answer 5

我黑客攻击了上述脚本，加入了一个登录页面，因为我需要它来访问一个Drupal网站。不太美观，但可能会帮助那里的某个人。

#!/usr/bin/python

import httplib2
import urllib
import urllib2
from cookielib import CookieJar
import sys
import re
from HTMLParser import HTMLParser

class miniHTMLParser( HTMLParser ):

  viewedQueue = []
  instQueue = []
  headers = {}
  opener = ""

  def get_next_link( self ):
    if self.instQueue == []:
      return   
    else:
      return self.instQueue.pop(0)


  def gethtmlfile( self, site, page ):
    try:
        url =  http:// +site+  +page
        response = self.opener.open(url)
        return response.read()
    except Exception, err:
        print " Error retrieving: "+page
        sys.stderr.write( ERROR: %s
  % str(err))
    return "" 

    return resppage

  def loginSite( self, site_url ):
    try:
    cj = CookieJar()
    self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

    url =  http:// +site_url 
        params = { name :  customer_admin ,  pass :  customer_admin123 ,  opt :  Log in ,  form_build_id :  form-3560fb42948a06b01d063de48aa216ab ,  form_id : user_login_block }
    user_agent =  Mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 
    self.headers = {  User-Agent  : user_agent }

    data = urllib.urlencode(params)
    response = self.opener.open(url, data)
    print "Logged in"
    return response.read() 

    except Exception, err:
    print " Error logging in"
    sys.stderr.write( ERROR: %s
  % str(err))

    return 1

  def handle_starttag( self, tag, attrs ):
    if tag ==  a :
      newstr = str(attrs[0][1])
      print newstr
      if re.search( http , newstr) == None:
        if re.search( mailto , newstr) == None:
          if re.search( # , newstr) == None:
            if (newstr in self.viewedQueue) == False:
              print "  adding", newstr
              self.instQueue.append( newstr )
              self.viewedQueue.append( newstr )
          else:
            print "  ignoring", newstr
        else:
          print "  ignoring", newstr
      else:
        print "  ignoring", newstr


def main():

  if len(sys.argv)!=3:
    print "usage is ./minispider.py site link"
    sys.exit(2)

  mySpider = miniHTMLParser()

  site = sys.argv[1]
  link = sys.argv[2]

  url_login_link = site+"/node?destination=node"
  print "
Logging in", url_login_link
  x = mySpider.loginSite( url_login_link )

  while link !=   :

    print "
Checking link ", link

    # Get the file from the site and link
    retfile = mySpider.gethtmlfile( site, link )

    # Feed the file into the HTML parser
    mySpider.feed(retfile)

    # Search the retfile here

    # Get the next link in level traversal order
    link = mySpider.get_next_link()

  mySpider.close()

  print "
done
"

if __name__ == "__main__":
  main()

Answer 6

相信我，没有什么比卷曲更好了。以下代码可以在Amazon EC2上以不到300秒的时间并行爬取10,000个网址。

注意：不要以如此高速攻击同一个域名。

#! /usr/bin/env python
# -*- coding: iso-8859-1 -*-
# vi:ts=4:et
# $Id: retriever-multi.py,v 1.29 2005/07/28 11:04:13 mfx Exp $

#
# Usage: python retriever-multi.py <file with URLs to fetch> [<# of
#          concurrent connections>]
#

import sys
import pycurl

# We should ignore SIGPIPE when using pycurl.NOSIGNAL - see
# the libcurl tutorial for more info.
try:
    import signal
    from signal import SIGPIPE, SIG_IGN
    signal.signal(signal.SIGPIPE, signal.SIG_IGN)
except ImportError:
    pass


# Get args
num_conn = 10
try:
    if sys.argv[1] == "-":
        urls = sys.stdin.readlines()
    else:
        urls = open(sys.argv[1]).readlines()
    if len(sys.argv) >= 3:
        num_conn = int(sys.argv[2])
except:
    print "Usage: %s <file with URLs to fetch> [<# of concurrent connections>]" % sys.argv[0]
    raise SystemExit


# Make a queue with (url, filename) tuples
queue = []
for url in urls:
    url = url.strip()
    if not url or url[0] == "#":
        continue
    filename = "doc_%03d.dat" % (len(queue) + 1)
    queue.append((url, filename))


# Check args
assert queue, "no URLs given"
num_urls = len(queue)
num_conn = min(num_conn, num_urls)
assert 1 <= num_conn <= 10000, "invalid number of concurrent connections"
print "PycURL %s (compiled against 0x%x)" % (pycurl.version, pycurl.COMPILE_LIBCURL_VERSION_NUM)
print "----- Getting", num_urls, "URLs using", num_conn, "connections -----"


# Pre-allocate a list of curl objects
m = pycurl.CurlMulti()
m.handles = []
for i in range(num_conn):
    c = pycurl.Curl()
    c.fp = None
    c.setopt(pycurl.FOLLOWLOCATION, 1)
    c.setopt(pycurl.MAXREDIRS, 5)
    c.setopt(pycurl.CONNECTTIMEOUT, 30)
    c.setopt(pycurl.TIMEOUT, 300)
    c.setopt(pycurl.NOSIGNAL, 1)
    m.handles.append(c)


# Main loop
freelist = m.handles[:]
num_processed = 0
while num_processed < num_urls:
    # If there is an url to process and a free curl object, add to multi stack
    while queue and freelist:
        url, filename = queue.pop(0)
        c = freelist.pop()
        c.fp = open(filename, "wb")
        c.setopt(pycurl.URL, url)
        c.setopt(pycurl.WRITEDATA, c.fp)
        m.add_handle(c)
        # store some info
        c.filename = filename
        c.url = url
    # Run the internal curl state machine for the multi stack
    while 1:
        ret, num_handles = m.perform()
        if ret != pycurl.E_CALL_MULTI_PERFORM:
            break
    # Check for curl objects which have terminated, and add them to the freelist
    while 1:
        num_q, ok_list, err_list = m.info_read()
        for c in ok_list:
            c.fp.close()
            c.fp = None
            m.remove_handle(c)
            print "Success:", c.filename, c.url, c.getinfo(pycurl.EFFECTIVE_URL)
            freelist.append(c)
        for c, errno, errmsg in err_list:
            c.fp.close()
            c.fp = None
            m.remove_handle(c)
            print "Failed: ", c.filename, c.url, errno, errmsg
            freelist.append(c)
        num_processed = num_processed + len(ok_list) + len(err_list)
        if num_q == 0:
            break
    # Currently no more I/O is pending, could do something in the meantime
    # (display a progress bar, etc.).
    # We just call select() to sleep until some more data is available.
    m.select(1.0)


# Cleanup
for c in m.handles:
    if c.fp is not None:
        c.fp.close()
        c.fp = None
    c.close()
m.close()

Answer 7

Another simple spider Uses BeautifulSoup and urllib2. Nothing too sophisticated, just reads all a href s builds a list and goes though it.

Answer 8

(Note: This is already in Chinese characters. However, if you meant to translate "pyspider.py" into Chinese, the translation would be "pyspider.py")

友情链接