English 中文(简体)
我该如何使用Perl提取或更改HTML中的链接?
原标题:
  • 时间:2008-12-12 05:53:15
  •  标签:

我有这个输入文本:

<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body><table cellspacing="0" cellpadding="0" border="0" align="center" width="603">   <tbody><tr>     <td><table cellspacing="0" cellpadding="0" border="0" width="603">       <tbody><tr>         <td width="314"><img height="61" width="330" src="/Elearning_Platform/dp_templates/dp-template-images/awards-title.jpg" alt="" /></td>         <td width="273"><img height="61" width="273" src="/Elearning_Platform/dp_templates/dp-template-images/awards.jpg" alt="" /></td>       </tr>     </tbody></table></td>   </tr>   <tr>     <td><table cellspacing="0" cellpadding="0" border="0" align="center" width="603">       <tbody><tr>         <td colspan="3"><img height="45" width="603" src="/Elearning_Platform/dp_templates/dp-template-images/top-bar.gif" alt="" /></td>       </tr>       <tr>         <td background="/Elearning_Platform/dp_templates/dp-template-images/left-bar-bg.gif" width="12"><img height="1" width="12" src="/Elearning_Platform/dp_templates/dp-template-images/left-bar-bg.gif" alt="" /></td>         <td width="580"><p>&nbsp;what y all heard?</p><p>i m shark oysters.</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p></td>         <td background="/Elearning_Platform/dp_templates/dp-template-images/right-bar-bg.gif" width="11"><img height="1" width="11" src="/Elearning_Platform/dp_templates/dp-template-images/right-bar-bg.gif" alt="" /></td>       </tr>       <tr>         <td colspan="3"><img height="31" width="603" src="/Elearning_Platform/dp_templates/dp-template-images/bottom-bar.gif" alt="" /></td>       </tr>     </tbody></table></td>   </tr> </tbody></table> <p>&nbsp;</p></body></html>

正如您所看到的,在此 HTML 文本块中没有换行,并且我需要查找其中的所有图像链接,将它们复制到一个目录中,并将文本内的行更改为类似于 ./images/file_name 的东西。

目前我使用的Perl代码是这样的:

my ($old_src,$new_src,$folder_name);
    foreach my $record (@readfile) {
        ## so the if else case for the url replacement block below will be correct
        $old_src = "";
        $new_src = "";
        if ($record =~ /<img(.+)/){
            if($1=~/src="((w|_|\|-|/|.|:)+)"/){
                $old_src = $1;
                my @tmp = split(//Elearning/,$old_src);
                $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
                push (@images, $new_src);
                $folder_name = "images";
            }## end if
        }
        elsif($record =~ /background="(.+.jpg)/){
            $old_src = $1;
            my @tmp = split(//Elearning/,$old_src);
            $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
            push (@images, $new_src);
            $folder_name = "images";
        }
        elsif($record=~/<iframe(.+)/){
            if($1=~/src="((w|_|\|?|=|-|/|.|:)+)"/){
                $old_src = $1;
                my @tmp = split(//Elearning/,$old_src);
                $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
                ## remove the ?rand behind the html file name
                if($new_src=~/?rand/){
                    my ($fname,$rand) = split(/?/,$new_src);
                    $new_src = $fname;
                    my ($fname,$rand) = split(/?/,$old_src);
                    $old_src = $fname."\?".$rand;
                }
        print "old_src::$old_src
"; ##s7test
        print "new_src::$new_src

"; ##s7test
                push (@iframes, $new_src);
                $folder_name = "iframes";
            }## end if
        }## end if

        my $new_record = $record;
        if($old_src && $new_src){
            $new_record =~ s/$old_src/$new_src/ ;
    print "new_record:$new_record
"; ##s7test
            my @tmp = split(///,$new_src);
            $new_record =~ s/$new_src/.\$folder_name\$tmp[-1]/;
##  print "new_record2:$new_record

"; ##s7test
        }## end if
        print WRITEFILE $new_record;
    } # foreach

This is only sufficient to handle HTML text with newlines in them. I thought only looping the regex statement, but then i would have to change the matching line to some other text.

Do you have any idea if there an elegant Perl way to do this? Or maybe I m just too dumb to see the obvious way of doing it, plus I know putting global option doesn t work.

thanks. ~steve

最佳回答

如果您必须避免使用任何额外的模块,例如HTML解析器,您可以尝试:

while ($string =~ m/(?:<s*(?:img|iframe)[^>]+srcs*=s*"((?:w|_|\|-|/|.|:)+)"|backgrounds*=s*"([^>]+.jpg)|<s*iframe)/g) {
  $old_src = $1;
            my @tmp = split(//Elearning/,$old_src);
                    $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
  if($new_src=~/?rand/){
    // remove rand and push in @iframes
  else
  {
    // push into @images
  }
}

这样,您将在所有源(包括换行符)上应用此正则表达式,并拥有更紧凑的代码(此外,您将考虑属性和其值之间的任何额外空格)。

问题回答

有一些优秀的Perl HTML解析器,学会使用它们并坚持使用。HTML是复杂的,允许在属性中使用>,大量使用嵌套等等。使用正则表达式解析它,除了非常简单的任务(或机器生成的代码)之外容易出现问题。

我认为您想要我的HTML::SimpleLinkExtor模块: HTML::SimpleLinkExtor

use HTML::SimpleLinkExtor;

my $extor = HTML::SimpleLinkExtor->new;
$extor->parse_file( $file );

my @imgs = $extor->img;

我不确定你到底在尝试做什么,但如果我的模块无法完成,那肯定听起来像是需要其中一个HTML解析模块。





相关问题
热门标签