English 中文(简体)
What is the best way to crawl a login based sites?
原标题:

I ve to automate a file download activity from a website (similar to, let s say, yahoomail.com). To reach a page which has this file download link, i ve to login, jump from page to page to provide some parameters like dates etc., and finally click on download link.

I am thinking of three approaches:

  1. Using WatIN and develop a windows service that periodically executes some WatiN code to traverse through the page and download the file.

  2. Using AutoIT (no much idea)

  3. Using a simple HTML parsing technique (there are several questions here eg., how to maintain a session after doing a login? how to do a logout after doing it?

最佳回答

Try a Selenium script, automated with Selenium Remote Control.

问题回答

I use scrapy.org, it s a python library. It s quiet good actually. Easy to write spiders and it s very extensive in it s functionality. Scraping sites after login is available in the package.

Here is an example of a spider that would crawl a site after authentication.

class LoginSpider(BaseSpider):
    domain_name =  example.com 
    start_urls = [ http://www.example.com/users/login.php ]

    def parse(self, response):
        return [FormRequest.from_response(response,
                formdata={ username :  john ,  password :  secret },
                callback=self.after_login)]

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.log("Login failed", level=log.ERROR)
            return

        # continue scraping with authenticated session...

I used mechanize for Python with success for a few things. It s easy to use and supports HTTP authentication, form handling, cookies, automatic HTTP redirection (30X), ... Basically the only thing missing is JavaScript, but if you need to rely on JS you re pretty much screwed anyway.

Free Download Manager is great for crawling, and you could use wget.





相关问题
Anyone feel like passing it forward?

I m the only developer in my company, and am getting along well as an autodidact, but I know I m missing out on the education one gets from working with and having code reviewed by more senior devs. ...

NSArray s, Primitive types and Boxing Oh My!

I m pretty new to the Objective-C world and I have a long history with .net/C# so naturally I m inclined to use my C# wits. Now here s the question: I feel really inclined to create some type of ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

How to Use Ghostscript DLL to convert PDF to PDF/A

How to user GhostScript DLL to convert PDF to PDF/A. I know I kind of have to call the exported function of gsdll32.dll whose name is gsapi_init_with_args, but how do i pass the right arguments? BTW, ...

Linqy no matchy

Maybe it s something I m doing wrong. I m just learning Linq because I m bored. And so far so good. I made a little program and it basically just outputs all matches (foreach) into a label control. ...

热门标签