English 中文(简体)
如何开始使用Perl的网页拆解?
原标题:How can I get started with web page scraping using Perl?

我有兴趣学习Perl。 我正在使用《学习手册》和泛泛网站,供参考。

我期待着使用Perl进行一些网络/文本的报废应用,以应用我所学到的一切。

请允许我提出一些很好的选择。

(这不是家庭工作。) 想在Perl做有助于我利用基本东西的东西。 固定特征

最佳回答

If the web pages you want to scrape require JavaScript to function properly, you are going to need more than what WWW::Mechanize can provide you. You might even have to resort to controlling a specific browser via Perl (e.g. using Win32::IE::Mechanize or WWW::Mechanize::Firefox).

我先尝试过,但还有WW:Roder plugin。

问题回答

正如其他人所说的那样,

http://search.cpan.org/dist/Scrappy”rel=“noreferer”> 同样值得看一看——它让你做许多手法很少——其文件就是一例:


    my $spidy = Scrappy->new;

    $spidy->crawl( http://search.cpan.org/recent , {
         #cpansearch li a  => sub {
            print shift->text, "
";
        }
    });

Scrappy使用 在你可能希望把自己看作是另一个选择的情况下。

另外,如果你需要从超文本表格中提取数据,。 传真:TableExtract 使这一死亡变得容易——你可以找到你重新感兴趣的桌子,点名标题,并非常容易地提取数据,例如:


    use HTML::TableExtract;
    $te = HTML::TableExtract->new( headers => [qw(Date Price Cost)] );
    $te->parse($html_string) or die "Didn t find table";
    foreach $row ($te->rows) {
        print join( , , @$row), "
";
    }

The most popular web scraping module for Perl is WWW::Mechanize, which is excellent if you can t just retrieve your destination page but need to navigate to it using links or forms, for instance, to log in. Have a look at its documentation for inspiration. If your needs are simple, you can extract the information you need from the HTML using regular expressions (but beware your sanity), otherwise it might be better to use a module such as HTML::TreeBuilder to do the job.

一种似乎有趣的单元,但我没有经过真正的尝试,是。 它是WWWWWWWWW的一个子类:机械化,但支持Javascript和AJAX,并融入了





相关问题
Why does my chdir to a filehandle not work in Perl?

When I try a "chdir" with a filehandle as argument, "chdir" returns 0 and a pwd returns still the same directory. Should that be so? I tried this, because in the documentation to chdir I found: "...

How do I use GetOptions to get the default argument?

I ve read the doc for GetOptions but I can t seem to find what I need... (maybe I am blind) What I want to do is to parse command line like this myperlscript.pl -mode [sth] [inputfile] I can use ...

Object-Oriented Perl constructor syntax and named parameters

I m a little confused about what is going on in Perl constructors. I found these two examples perldoc perlbot. package Foo; #In Perl, the constructor is just a subroutine called new. sub new { #I ...

Where can I find object-oriented Perl tutorials? [closed]

A Google search yields a number of results - but which ones are the best? The Perl site appears to contain two - perlboot and perltoot. I m reading these now, but what else is out there? Note: I ve ...

热门标签