Parsing UTF-8 webpages with lxml
Here is how to use python and lxml to parse web pages with unicode characters, encoded as utf-8. It would be nice if lxml.html.parse(url)
could correctly use the Content-Type HTTP header, but it doesn’t, so you have to tell lxml what encoding to use.
>>> import lxml.etree >>> url = 'http://hi.wikipedia.org/wiki/मुखपृष्ठ' #utf-8 encoded bytes >>> url 'http://hi.wikipedia.org/wiki/\xe0\xa4\xae\xe0\xa5\x81\xe0\xa4\x96\xe0\xa4\xaa\xe0\xa5\x83\xe0\xa4\xb7\xe0\xa5\x8d\xe0\xa4\xa0' >>> utf8_html_parser = lxml.etree.HTMLParser(encoding='utf-8') >>> page = lxml.etree.parse(url, parser=utf8_html_parser) >>> print page.find('head/title').text विकिपीडिया >>> page.find('head/title').text u'\u0935\u093f\u0915\u093f\u092a\u0940\u0921\u093f\u092f\u093e' |
Reply