从零开始学习爬虫出现的问题记录01

最新推荐文章于 2020-12-19 00:10:16 发布

Iwcsdn

最新推荐文章于 2020-12-19 00:10:16 发布

阅读量297

点赞数

本文链接：https://blog.csdn.net/iwcsdn/article/details/81609731

版权

从零开始学习爬虫出现的问题记录01

我用的编译器是JetBrains PyCharm Community Edition 2018.1.2 x64

首先我是参考了这一段代码

from urllib import request
resp = request.urlopen('https://movie.douban.com/nowplaying/hangzhou/')
html_data = resp.read().decode('utf-8')
print(html_data)

运行没有问题

然后我把链接换成了带有中文字符的
https://wiki.52poke.com/wiki/妙蛙种子
报错：UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 10-13: ordinal not in range(128)
参考python3 urlopen打开包含中文的url
得知是urllib.request.urlopen不支持中英文混合的字符串。
应使用urllib.parse.quote进行转换。

根据那个帖子改了代码之后

#coding=utf-8
from urllib import request
from urllib.parse import quote
import string
url = 'https://wiki.52poke.com/wiki/妙蛙种子'
s = quote(url,safe=string.printable)
resp = request.urlopen(s)
html_data = resp.read().decode('utf-8')
print(html_data)

发现还是报错：UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x8b in position 1: invalid start byte
百度翻译后：UndoDebug错误：“UTF-8”编解码器不能解码位置1中的字节0x8b：无效的起始字节
原来是解码的问题，但是还是不是很懂，于是在网上找了半天，
找到了这个帖子Python 3.6中 ‘utf-8’ codec can’t decode byte invalid start byte?
参考其中的代码修改后的代码为

from urllib import request
from urllib.parse import quote
import string
from io import BytesIO
import gzip
url = 'https://wiki.52poke.com/wiki/妙蛙种子'
str = quote(url,safe=string.printable)
print(str)
resp = request.urlopen(str)
content = resp.read()
buff = BytesIO(content)
f = gzip.GzipFile(fileobj=buff)
res = f.read().decode('utf-8')
print(res)

运行没有问题，整个网页都爬下来了

总结：一是中文字符不支持要转化，二是网站爬下来的是压缩过的数据，所以要进行解码。
十分感谢引用的帖子。

Iwcsdn

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
从零开始学习爬虫出现的问题记录01

从零开始学习爬虫出现的问题记录01我用的编译器是JetBrains PyCharm Community Edition 2018.1.2 x64首先我是参考了这一段代码from urllib import requestresp = request.urlopen('https://movie.douban.com/nowplaying/hangzhou/')html_data =...
复制链接

扫一扫