Question

我需要计算图像数量(就本案而言,图1)。明显使用“透镜”?

这是HTML：

<div class="detail-headline">
    Fotogal&#233;ria
        </div>
<div class="detail-indent">
    <table id="ctl00_ctl00_ctl00_containerHolder_mainContentHolder_innnerContentHolder_ZakazkaControl_ZakazkaObrazky1_ObrazkyDataList" cellspacing="0" border="0" style="width:100%;border-collapse:collapse;">
    <tr>
        <td align="center" style="width:25%;">
            <div id="ctl00_ctl00_ctl00_containerHolder_mainContentHolder_innnerContentHolder_ZakazkaControl_ZakazkaObrazky1_ObrazkyDataList_ctl02_PictureContainer">
                <a title="1-izb. Kaspická" class="highslide detail-img-link" onclick="return hs.expand(this);" href="/imgcache/cache231/3186-000393~8621457~640x480.jpg"><img src="/imgcache/cache231/3186-000393~8621457~120x120.jpg" class="detail-img" width="89" height="120" alt="1-izb. Kaspická" /></a>
            </div>
        </td><td></td>
    </tr>
</table>
</div>

I used before HTMLParser and the number of images must be added to "self.srcData".. Previous code:

def handle_starttag(self, tag, attrs):  
    if tag ==  div  and len(attrs) > 1 and attrs[1][0] ==  class  and attrs[1][1] ==  detail-headline  
      and self.srcData[self.getpos()[0]].strip() == u Realitn&#225; kancel&#225;ria :
      self.status = 2

    if self.status == 2 and tag ==  div  and len(attrs) > 0 and attrs[0][0] ==  class  and attrs[0][1] ==  name :
      self.record[-1] = decode(self.srcData[self.getpos()[0]].strip())
      self.status = 0

那么（检查起始标记）..像这样吗？

if tag ==  div  and len(attrs) > 0 and attrs[0][0] ==  class  and attrs[0][1] ==  detail-headline  
      and self.srcData[self.getpos()[0]].strip() ==  Fotogal&#233;ria :
      self.status = 3

可以吗？还有呢？谢谢。

import urllib
import urllib2
import HTMLParser
import codecs
import time
from BeautifulSoup import BeautifulSoup

# decode string
def decode(istr):
  ostr = u  
  idx = 0
  while idx < len(istr):
    add = True
    if istr[idx] ==  &  and len(istr) > idx + 1 and istr[idx + 1] ==  # :
      iend = istr.find( ; , idx)
      if iend > idx:
        ostr += unichr(int(istr[idx + 2:iend]))
        idx = iend
        add = False
    if add:
      ostr += istr[idx]
    idx += 1
  return ostr

# parser 1
class FlatDetailParser (HTMLParser.HTMLParser):
  def __init__ (self):
    HTMLParser.HTMLParser.__init__(self)

  def loadDetails(self, link):
    self.record = (len(self.characts) + 1) * [  ]
    self.status = 0
    self.index = -1
    self.reset()
    request = urllib2.Request(link)
    data = urllib2.urlopen(request)  # URL obtained from the next class
    self.srcData = []
    for line in data:
      line = line.decode( utf8 )
      self.srcData.append(line)
    for line in self.srcData:
      self.feed(line)
    self.close()
    return self.record


  def handle_starttag(self, tag, attrs):
    if tag ==  div  and len(attrs) > 1 and attrs[1][0] ==  class  and attrs[1][1] ==  detail-headline  
      and self.srcData[self.getpos()[0]].strip() == u Realitn&#225; kancel&#225;ria :
      self.status = 2

    if self.status == 2 and tag ==  div  and len(attrs) > 0 and attrs[0][0] ==  class  
      and attrs[0][1] ==  name :
      self.record[-1] = decode(self.srcData[self.getpos()[0]].strip())
      self.status = 0

下一个解析器类，并将数据添加到txt文件中。

When I use BeautifulSoup.. What is soup=BeautifulSoup(???). How can I add to srcData? This can be combined? How?

Answer 1

如果你使用BeautifulSoup ,你的工作将更加容易。

也许像这样的情况

from BeautifulSoup import BeaufitulSoup
def count_images(htmltext)
    soup=BeautifulSoup(htmltext)
    return len(soup.findAll( div ,{ class : detail-indent }))

或者使用 lxml。

from lxml.html.soupparser import fromstring
def count_images(htmltext)
    return len([e.attrib for e in fromstring(htmltext).findall( div )
                             if e.attrib.get( class )== detail-indent ])

Answer 2

就为了好玩，我尝试了一种pyparsing方法。Pyparsing包括一些帮助构建HTML标签匹配模式的方法，其中包括匹配属性、意外的空格、单引号或双引号以及其他难以预测的HTML标记错误。这里是一个pyparsing解决方案（假设您的HTML源代码已被读入字符串变量 html）：

from pyparsing import makeHTMLTags

# makeHTMLTags returns patterns for both opening and closing 
# tags, we just want the opening ones
aTag = makeHTMLTags("A")[0]
imgTag = makeHTMLTags("IMG")[0]

# find the matching tags
tagMatches = (aTag|imgTag).searchString(html)

# yes, use len() to see how many there are
print len(tagMatches)

# get the actual image names
for t in tagMatches:
    if t.startA:
        print t.href
    if t.startImg:
        print t.src

印刷:

2
/imgcache/cache231/3186-000393~8621457~640x480.jpg
/imgcache/cache231/3186-000393~8621457~120x120.jpg

友情链接