python3 获取百度首页源码

最新推荐文章于 2023-11-13 15:08:58 发布

ykf173

最新推荐文章于 2023-11-13 15:08:58 发布

阅读量1k

点赞数 1

分类专栏： python 文章标签：百度首页 python3 爬取网页 gzip

本文链接：https://blog.csdn.net/ykf173/article/details/83092924

版权

python 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

最近在学习python3爬虫，爬取百度首页的时候遇到了一些问题，说是解码错误，网上找了一些也是一样的写法，下面是我的方法

python环境为：python3.6..5

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')

print(response.read().decode('utf8'))

下面是错误内容

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-549c9418a074> in <module>()
      7 #buff = BytesIO(response.read()) # 把content转为文件对象
      8 #f = gzip.GzipFile(fileobj=buff)
----> 9 print(response.read().decode('utf8'))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

说是解码错误

网址换成http://www.taobao.com此时没有问题

回到获取百度首页的代码，查看一下网页的信息

print(response.info())

下面是输出内容

Content-Type: text/html
Content-Encoding: gzip
Content-Length: 8394
Cache-Control: no-store
Pragma: no-cache
Expires: -1
Connection: close

实际上是百度网页压缩成gzip了，解压缩就可以了

import urllib.request
from io import BytesIO
import gzipresponse = urllib.request.urlopen('http://www.baidu.com')
buff = BytesIO(response.read()) # 把content转为文件对象
f = gzip.GzipFile(fileobj=buff)
print(f.read().decode('utf8'))

这次就好了