2020-10-23 Python爬虫第一章urllib库与requests库，第二节，深入学习requests库

最新推荐文章于 2024-10-27 11:29:42 发布

_落红_

最新推荐文章于 2024-10-27 11:29:42 发布

阅读量368

点赞数

分类专栏：小白学爬虫文章标签： python cookie 爬虫 request session

本文链接：https://blog.csdn.net/qq_42704187/article/details/109246922

版权

小白学爬虫专栏收录该内容

9 篇文章 0 订阅

订阅专栏

第二节：requests学习

requests库是Python获取网页源代码重要库之一，其中它包含了请求方式设置，请求头修改，IP代理，模拟登陆等。它的使用原理和urllib一致，但是语法上却是大大简缩。

1.使用requests发起请求

1.1 了解requests.get()函数。

由于这个函数代码简单，故我先把基础语法讲了，后面再举案例

我们以requests.get（）函数说一下它所接受的几个常用参数，其余的几个函数如requests.post（）函数，所接受参数与get（）一致，故我们就不重复介绍

import requests
html = requests.get(url,headers = None,data = None,cookies = None,files = None,proxies = None,allow_redirects = True)

#这里只列列举了常用的几个参数
# headers  通过headers可置自己的请求头信息，以字典方式传入
# data     通过data用户可提交表单信息，以字典形式传入，比如账号登陆
#cookies   以字典形式传入，将每一个等号左边的字符串设置为键，等号右边的设置为值，
# 例如cookie = 'a=123; b=scrs; c=12sd',我们就需这样传入cookies = {'a':'123','b':'scrs','c':'12sd'
# 让然cookies也能放入headers中，只不过此时headers['cookie']对应的值为字符串(即headers = {'cookie':'a=123; b=scrs; c=12sd'})

#files     通过files用户可向网站发送文件，以字典形式 例如files = {'files':open('a.jpg','rb')}
# proxies  通过proxies用户可以更改请求的IP地址
# 我们还可以更改请求方式如requests.post
# 其余方法方法见下，他们所接受的参数和get一致：
# GET``, ``OPTIONS``, ``HEAD``, ``POST``, ``PUT``, ``PATCH``, or ``DELETE

#allow_redirects,是否允许重定向，默认为True

requests.get（）函数返回一个Respose对象，拥有众多方法以及属性，具体见下

# import requests
# html = requests.get(url)
# 下面说一下html所拥有的几个重要属性，以及常用方法
# html.status_code     返回当前请求方式的状态码，以int形式返回
# html.headers         返回响应头信息
# html.content         将网页源代码以二进制输出
# html.text             将网页源代码以字符串形式输出
# html.apparent_encoding    返回当前网页的编码方式
# html.encoding         主要用于设置编码方式
# html.cookies          返回cookies
# html.url              返回当前请求的url
# html.close()          关闭当前请求
# html.json()           若数据是字符串形式的序列，则将其变为对应序列。主要用于API抓取数据，我们到了API抓取时。再仔细探讨这个方法

1.2 requests.get简单应用

我们以三个案例来了解Response的众多属性以及方法,

第一个案例，我们以'http://httpbin.org/get'案例为例。可以看出我们当前的网页编码方式为ascii编码方式，以及html.text为字符串，也就表明我们可以使用正则表达式去提取自己想要的内容，而不是整篇输出。

import requests
#我们以'http://httpbin.org/get'案例为例
html = requests.get('http://httpbin.org/get')
print('html的类型',type(html),html)
print('返回请求的状态码',type(html.status_code),html.status_code)
print('返回输出类型',type(html.text))
print(html.apparent_encoding)
print(html.headers)
print(html.url)
print(html.text)
#------------------------------
'''
#以下是输出结果
html的类型 <class 'requests.models.Response'> <Response [200]>
返回请求的状态码 <class 'int'> 200
返回输出类型 <class 'str'>
ascii
{'Date': 'Fri, 23 Oct 2020 12:09:36 GMT', 'Content-Type': 'application/json', 'Content-Length': '307', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}
http://httpbin.org/get
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.23.0", 
    "X-Amzn-Trace-Id": "Root=1-5f92c800-7273d33c1dd6004f7ae84275"
  }, 
  "origin": "223.87.210.203", 
  "url": "http://httpbin.org/get"
}
'''

第二个案例，这次我们以下载图片为例

图片网址为https://ss3.bdstatic.com/70cFv8Sh_Q1YnxGkpoWK1HF6hhy/it/u=3142343904,631493091&fm=26&gp=0.jpg

为csdn的图标。注意图片的写入只能以二进制方式，读者在运行前得确保图片链接是否能打开，我只是随便找了一张图片。

结果中cookies返回了一个<RequestsCookieJar[]>，代表没有cookies，读者用浏览器打开上面网站，也可以察觉浏览器中也没有cookies

import requests
from fake_useragent import UserAgent
html = requests.get('https://ss3.bdstatic.com/70cFv8Sh_Q1YnxGkpoWK1HF6hhy/it/u=3142343904,631493091&fm=26&gp=0.jpg',headers = {'user_agent':UserAgent().ie})
if html.status_code == 200: #若请求成功，才执行下面的操作
	#返回cookies
	print(html.cookies)
	#返回当前请求的url
	print(html.url)
	#返回响应头信息
	print(html.headers)
	with open('a.jpg','wb') as  fp:
		#将网页源代码以二进制输出，存储在content中
		content = html.content
		#写入文件
		fp.write(content)
#关闭请求
html.close()
#-------------------------------------
'''
#结果如下,读者打开当前py文件，即可找到下载的图片
<RequestsCookieJar[]>
https://ss3.bdstatic.com/70cFv8Sh_Q1YnxGkpoWK1HF6hhy/it/u=3142343904,631493091&fm=26&gp=0.jpg
{'Server': 'JSP3/2.0.14', 'Date': 'Fri, 23 Oct 2020 12:25:29 GMT', 'Content-Type': 'image/jpeg', 'Content-Length': '12700', 'Connection': 'keep-alive', 'ETag': '0cd1a5dcb7879c53c9c94084b791deee', 'Last-Modified': 'Thu, 01 Jan 1970 00:00:00 GMT', 'Expires': 'Sun, 22 Nov 2020 22:20:42 GMT', 'Age': '287', 'Accept-Ranges': 'bytes', 'Cache-Control': 'max-age=2628000', 'Access-Control-Allow-Origin': '*', 'Ohc-Response-Time': '1 0 0 0 0 0', 'Ohc-Cache-HIT': 'cd4cm52 [4], qdcmcache96 [1]'}
'''

第三个案例，我们以百度百科为例，由于百度百科已经封杀了requests默认的useragent，所以我只能将useragent修改了，

我们将html.encoding 设置为ascii,就发现乱码了，但如果将设置为utf-8，又能够返回正确文本。若有时你爬取网站时，你得到的内容是乱码的，可以试试html.encoding 的设置，看看能不能让其返回为正确的文本

import requests
#我们以'http://httpbin.org/get'案例为例。这里结果太长小编就上结果了，读者自行复制代码运行就能看到差距了
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'}

html = requests.get('https://baike.baidu.com/item/CSDN/172150?fr=aladdin',headers= headers)
html.encoding = 'ascii'
print('html的类型',type(html),html)
print('返回请求的状态码',type(html.status_code),html.status_code)
print('返回输出类型',type(html.text))
print(html.apparent_encoding)
print(html.headers)
print(html.url)
print(html.text)

1.3 requests.get（）t的其他参数使用

我们还是来看两案例，第一个headers与files的应用

通过传参，我们的请求头以及文件都上传到了网站上面,代码中省略号部分小编省略了

import requests
# 传入自己需要修改的请求头
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36',
			'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
			'Accept-Encoding': 'gzip, deflate, br',
			'Accept-Language': 'zh-CN,zh;q=0.9',
			'Cache-Control': 'max-age=0',
			'Connection': 'keep-alive'}
#将上面下载的照片，以二进制打开，组成files为键，的字典
files = {'files':open('a.jpg','rb')}
#传入参数，这里我们需要向网站提交信息，所以应该使用post方法
html = requests.post('http://httpbin.org/post',headers = headers,files = files)
print(html.text)
#--------------------------------
'''
#以下是输出结果
{
  "args": {}, 
  "data": "", 
  "files": {
    "files": "data:application/octet-stream;base64...
  }, 
  "form": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "zh-CN,zh;q=0.9", 
    ...
    "Content-Type": "multipart/form-data; boundary=3068266282867c527ee6fa7a052e265b", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36", 
   ...
}
'''

第二个案例，data、headers,cookies的应用。

这次我们模拟登陆网站https://login2.scrape.cuiqingcai.com/login用户名为 admin，密码为 admin

import requests
data = {'username':'admin','password':'admin'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'}
#这里为了避免失误，最好将两个http变为都https协议，传入登录界面的url
html = requests.post('https://login2.scrape.cuiqingcai.com/login',headers = headers,data=data)
print(html.cookies)
#将登陆成功的cookies，成功传入下一个请求，传入需要访问的界面url
#注意若不传入cookies，则下面请求是无法成功的
html_1 = requests.get('https://login2.scrape.cuiqingcai.com',cookies = html.cookies)
print(html.text)
#--------------------------------
'''
#以下是部分输出结果
<RequestsCookieJar[]>
 <img
                    data-v-7f856186=""
                    src="https://p1.meituan.net/movie/6bea9af4524dfbd0b668eaa7e187c3df767253.jpg@464w_644h_1e_1c"
                    class="cover">
              </a>
            </div>
            <div data-v-7f856186="" class="p-h el-col el-col-24 el-col-xs-9 el-col-sm-13 el-col-md-16">
              <a data-v-7f856186="" href="/detail/2" class="name">
                <h2 data-v-7f856186="" class="m-b-sm">这个杀手不太冷 - Léon</h2>
'''

第三个案例，再次模拟登陆https://pythonscraping.com/pages/cookies/welcome.php网站，

可以看到这次cookies，就不在是空值，若有兴趣的的小伙伴，可将cookies复制下来，做成字典以参数传入，则也能登入网站。

import requests
data = {'username':'admin','password':'password'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'}
#这里为了避免失误，最好将两个http变为都https协议，传入登录界面的url
html = requests.post('https://pythonscraping.com/pages/cookies/welcome.php',headers = headers,data=data)
print(html.cookies)
#将登陆成功的cookies，成功传入下一个请求，传入需要访问的界面
html_1 = requests.get('https://pythonscraping.com/pages/cookies/profile.php',cookies = html.cookies)
print(html_1.text)
#-----------------------------
'''
#以下是输出结果
<RequestsCookieJar[<Cookie loggedin=1 for pythonscraping.com/pages/cookies>, <Cookie username=admin for pythonscraping.com/pages/cookies>]>
Hey admin! Looks like you're still logged into the site!
'''

到这为止，requests库的基本功能就已经差不多了，使用上面语法基本能够，访问绝大多数静态网页（前提是你能破解它们的反爬虫手段，比如比百度百科封杀我们的requests的useragent）

2.requests.Session() 学习

前面我们登陆成功后，当需要访问登陆成功后的界面是，我们需要传入cookies，这样我们才能成功的放入界面，但是当我们访问登陆后的几十个页面，而且登陆后的cookies是动态的，那我们应该咋办。requests为我们提供了一个强大的类

requests.Session（）,使用它之后，我们将不再传入cookies参数，就可以访问其它界面，我们来看看requests.Session基本用法

# import requests
# #将类Session实例化
# session = requests.Session()
# #调用session的方法,session依然拥有着requests的基本方法如get，post，put等，而且其参数原理以及传入形式一样
# #这里同样只列举了部分
# #session.get同样返回了Response对象，故我们可以使用status_code,headers,cookies等属性
# html = session.get(url,headers = None,data = None,cookies = None,files = None,proxies = None,allow_redirects = True)

seeeion主要用于需要cookies的网站，但程序可以不用传入cookies参数，当然我们也能够使用session访问一般的网站，

我们看一下下面的案例,可以看到我们没用cookies，但是依然能够模拟登陆，并成功访问

import requests
#将类Session实例化
session = requests.Session()

data = {'username':'a','password':'password'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'}
#这里为了避免失误，最好将两个http变为都https协议，传入登录界面的url
html = session.post('https://pythonscraping.com/pages/cookies/welcome.php',headers = headers,data=data)
print(html.cookies)
html_1 = session.get('https://pythonscraping.com/pages/cookies/profile.php')
print(html_1.text)
#-----------------------------------
'''
#以下是输出结果
<RequestsCookieJar[<Cookie loggedin=1 for pythonscraping.com/pages/cookies>, <Cookie username=a for pythonscraping.com/pages/cookies>]>
Hey a! Looks like you're still logged into the site!
'''

到这为止requests库的重要函数以及类已经差不多看完了，我们可以使用这个库以及方法去访问90的静态网页，至于你们是否能够爬取自己想要的数据，那得看看你们应对反爬虫的手段.。下面一节我会总结比较一下urllib以及requests区别与共同点，以及再说一说模拟登陆的案例，若有兴趣的读者，可前往阅读。

#小编是一个理科生，文笔不好，大家能理解我的意思就行了

#转载则请标明文章出处，谢谢

#文中若有任何错误，欢迎大家积极指出，小编洗耳恭听