html处理utf8字符串,用lxml-HTML解析UTF-8/unicode字符串

我一直试图用etree.HTML()解析编码为UTF-8的文本,但没有成功。→ python

Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05)

[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

>>> from lxml import etree

>>> import requests

>>> headers = {'User-Agent': "Opera/9.80 (Macintosh; Intel Mac OS X 10.8.0) Presto/2.12.363 Version/12.50"}

>>> r = requests.get("http://www.rakuten.co.jp/", headers=headers)

>>> r.status_code

200

>>> r.headers

{'x-cache': 'MISS from www.rakuten.co.jp', 'transfer-encoding': 'chunked', 'set-cookie': 'wPzd=lng%3DNA%3Acnt%3DCA; expires=Tue, 13-Aug-2013 16:51:38 GMT; path=/; domain=www.rakuten.co.jp', 'server': 'Apache', 'pragma': 'no-cache', 'cache-control': 'private', 'date': 'Mon, 13 Aug 2012 16:51:38 GMT', 'content-type': 'text/html; charset=EUC-JP'}

>>> responsetext = r.text

到目前为止还不错。响应文本很好,是一个unicode字符串。如果我想得到CSS uri的列表。也没有问题。>>> tree = etree.HTML(responsetext)

>>> csspathlist = tree.xpath('//link[@rel="stylesheet"]/@href')

>>> csspathlist

['http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/common.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/layout.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/sidecolumn.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/api.css?v=1207111500', '/com/inc/home/20080930/beta/css/liquid/myrakuten_dpgs.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/leftcolumn.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/header.css?v=1207111500', '/com/inc/home/20080930/opt/css/normal/footer.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/ipad.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/genre.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/supersale.css?v=1207111500', '/com/inc/home/20080930/beta/css/liquid/rakuten_membership.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/noscript/set.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/suggest-2.0.1.css?v=1204231500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/liquid_banner.css?v=1203011138', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/area_announce.css?v=1203011138']

现在让我们从unicode更改为UTF-8,并再次请求CSS uri列表。>>> htmltext = responsetext.encode('utf-8')

>>> tree2 = etree.HTML(htmltext)

>>> csspathlist2 = tree2.xpath('//link[@rel="stylesheet"]/@href')

>>> csspathlist2

[]

我得到一张空名单。>>> etree.tostring(tree2)

'

'

实际上,第二个解析在标题中的第一个日语字符之后立即停止。

【楽天市場】Shopping is Entertainment! : インターネット最大級の通信販売、通販オンラインショッピングコミュニティ

我仍在努力理解我做错了什么。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值