I m trying to scrape META keywords and description tags from arbitrary websites. I obviusly have no control over said website, so have to take what I m given. They have a variety of casings for the tag and attributes, which means I need to work case-insensitively. I can t believe that the lxml authors are as stubborn as to insist on full forced standards-compliance when it excludes much of the use of their library.
I d like to be able to say doc.cssselect( meta[name=description] )
(or some XPath equivalent) but this will not catch <meta name="Description" Content="...">
tags due othe captial D.
I m currently using this as a workaround, but it s horrible!
for meta in doc.cssselect( meta ):
name = meta.get( name )
content = meta.get( content )
if name and content:
if name.lower() == keywords :
keywords = content
if name.lower() == description :
description = content
It seems that the tag name meta
is treated case-insensitively, but the attributes are not. It would be even more annoying meta
was case-sensitive too!