爬虫入门笔记

最新推荐文章于 2024-10-03 09:00:56 发布

YYHPLA

最新推荐文章于 2024-10-03 09:00:56 发布

阅读量272

点赞数

分类专栏： Python 文章标签： python 乱码

本文链接：https://blog.csdn.net/weixin_43443913/article/details/107405582

版权

Python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

爬虫入门笔记

request的编码问题

问题的提出

一开始，我用request对东方财富网进行了访问，可是得到的结果却是乱码
代码如下：

import requests
r = requests.get('http://www.eastmoney.com/')
print (r.text)

咋没用呢？网页乱码，绝大部分情况都是编码出现了问题

问题的思考

首先编码的概念网上到处都是，这里就不细说了。
此处借用年轻人——001 的博客来说明一下问题，大家也可以看他的博客。
在这里插入图片描述
我们先看一下系统默认给的编码方式是什么
使用response的encoding方法就行

import requests
r = requests.get('http://www.eastmoney.com/')
print (r.encoding)

在这里插入图片描述
可以看到是ISO-8858-1！而python中有一个自动翻译编码的功能：apparent_encoding，我们来试一下

import requests
r = requests.get('http://www.eastmoney.com/')
print (r.encoding)
print(r.apparent_encoding)

在这里插入图片描述
python给我们识别出的编码方式却是UTF-8-SIG！所以我们知道了，这个网站原本是UTF8编码，而python却错误的使用了ISO编码，进而导致乱码。
所以我们应该将错误的编码转过来。（使用encode和decode方法）

问题的解决

import requests
r = requests.get('http://www.eastmoney.com/')
print (r.encoding)
print (r.apparent_encoding)
print ((r.text.encode(r.encoding).decode(r.apparent_encoding)))
#encode:将response(ISO)转为unicode
#decode:将unicode转为decode(UTF8)
#UTF8和ISO输出的数据类型都是str
#而Unicode输出的数据类型是bytes