(lxml) XPath matching against nodes with unprintable characters

Sometimes you want to clean up HTML by removing tags with unprintable characters in them (whitespace, non breaking space, etc). Sometimes encoding this back and forth results in weird characters when the HTML is rendered. Anyways, here is the snippet you might find useful:


def clean_empty_tags(node):
    """
    Finds all tags with a whitespace in it. They come out broke and
    we won't need them anyways.
    """
    for empty in node.xpath("//p[.='\xa0']"):
        empty.getparent().remove(empty)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s