Parsing UTF-8 webpages with lxml

Here is how to use python and lxml to parse web pages with unicode characters, encoded as utf-8. It would be nice if lxml.html.parse(url) could correctly use the Content-Type HTTP header, but it doesn’t, so you have to tell lxml what encoding to use.

>>> import lxml.etree
>>> url = 'मुखपृष्ठ' #utf-8 encoded bytes
>>> url
>>> utf8_html_parser = lxml.etree.HTMLParser(encoding='utf-8')
>>> page = lxml.etree.parse(url, parser=utf8_html_parser)
>>> print page.find('head/title').text
>>> page.find('head/title').text