Question

I m trying to scrape META keywords and description tags from arbitrary websites. I obviusly have no control over said website, so have to take what I m given. They have a variety of casings for the tag and attributes, which means I need to work case-insensitively. I can t believe that the lxml authors are as stubborn as to insist on full forced standards-compliance when it excludes much of the use of their library.

I d like to be able to say doc.cssselect( meta[name=description] ) (or some XPath equivalent) but this will not catch <meta name="Description" Content="..."> tags due othe captial D.

I m currently using this as a workaround, but it s horrible!

for meta in doc.cssselect( meta ):
    name = meta.get( name )
    content = meta.get( content )

    if name and content:
        if name.lower() ==  keywords :
            keywords = content
        if name.lower() ==  description :
            description = content

It seems that the tag name meta is treated case-insensitively, but the attributes are not. It would be even more annoying meta was case-sensitive too!

Answer 1

Values of attributes must be case-sensitive.

You can use arbitrary regular expression to select an element:

#!/usr/bin/env python
from lxml import html

doc = html.fromstring(   
    <meta name="Description">
    <meta name="description">
    <META name="description">
    <meta NAME="description">
   )
for meta in doc.xpath( //meta[re:test(@name, "^description$", "i")] ,
                      namespaces={"re": "http://exslt.org/regular-expressions"}):
    print html.tostring(meta, pretty_print=True),

Output:

<meta name="Description">
<meta name="description">
<meta name="description">
<meta name="description">

Answer 2

lxml is an XML parser. XML is case-sensitive. You are parsing HTML, so you should use an HTML parser. BeautifulSoup is very popular. Its only drawback is that it can be slow.

Answer 3

You can use

doc.cssselect.xpath("//meta[translate(@name,
     ABCDEFGHJIKLMNOPQRSTUVWXYZ ,  abcdefghjiklmnopqrstuvwxyz )= description ]")

It translates the value of "name" to lowercase and then matches.

友情链接