lxml xpath 爬取并正常显示中文内容

最新推荐文章于 2022-08-18 08:31:54 发布

weixin_30929195

最新推荐文章于 2022-08-18 08:31:54 发布

阅读量84

点赞数

文章标签： python 爬虫

原文链接：http://www.cnblogs.com/grandyang/p/7990505.html

版权

在使用python爬虫提取中文网页的内容，为了能正确显示中文的内容，在转为字符串时一定要声明编码为utf-8，否则无法正常显示中文，而是显示原编码的字符，并没有正确转换。比如下面这个简单的爬取百度页面的title的示例：

import os
import lxml
from urllib2 import urlopen # Mac
# from urllib.request import Request, urlopen # Win
from lxml import etree

hfile = urlopen('http://www.baidu.com').read()
tree = etree.HTML(hfile)
strs = tree.xpath( "//title")
strs = strs[0]
# strs = (etree.tostring(strs)) # 不能正常显示中文
strs = (etree.tostring(strs, encoding = "utf-8", pretty_print = True, method = "html")) # 可以正常显示中文
print (strs)

如果不在tostring函数中正确配置的话，会打印出：

<title>&#30334;&#24230;&#19968;&#19979;&#65292;&#20320;&#23601;&#30693;&#36947;</title>

而正确的应该是：

<title>百度一下，你就知道</title>

转载于:https://www.cnblogs.com/grandyang/p/7990505.html

weixin_30929195

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
lxml xpath 爬取并正常显示中文内容

在使用python爬虫提取中文网页的内容，为了能正确显示中文的内容，在转为字符串时一定要声明编码为utf-8，否则无法正常显示中文，而是显示原编码的字符，并没有正确转换。比如下面这个简单的爬取百度页面的title的示例：import osimport lxmlfrom urllib2 import urlopen # Mac# from urllib.request...
复制链接

扫一扫