Parsing UTF-8 webpages with lxml

Here is how to use python and lxml to parse web pages with unicode characters, encoded as utf-8. It would be nice if lxml.html.parse(url) could correctly use the Content-Type HTTP header, but it doesn’t, so you have to tell lxml what encoding to use.

>>> import lxml.etree
>>> url = 'http://hi.wikipedia.org/wiki/मुखपृष्ठ' #utf-8 encoded bytes
>>> url
'http://hi.wikipedia.org/wiki/\xe0\xa4\xae\xe0\xa5\x81\xe0\xa4\x96\xe0\xa4\xaa\xe0\xa5\x83\xe0\xa4\xb7\xe0\xa5\x8d\xe0\xa4\xa0'
 
>>> utf8_html_parser = lxml.etree.HTMLParser(encoding='utf-8')
>>> page = lxml.etree.parse(url, parser=utf8_html_parser)
 
>>> print page.find('head/title').text
विकिपीडिया
>>> page.find('head/title').text
u'\u0935\u093f\u0915\u093f\u092a\u0940\u0921\u093f\u092f\u093e'