爬虫初阶（二）—— Requests模块基本使用_爬虫初步—

本文链接：https://blog.csdn.net/yhhuang17/article/details/101714590

Requests模块

常用的爬虫模块有urlib和requests，requests 继承了urllib的所有特性，且简洁方便，支持HTTP连接保持和连接池，支持使用cookie保持会话，支持文件上传，支持自动确定响应内容的编码，支持国际化的 URL 和 POST 数据自动编码。

1.Request模块的安装

$ pip install requests

2.Python中的使用

2.1 基本GET请求

import requests

# 使用GET方法请求URL, 返回Response响应
response = requests.get("https://www.baidu.com")  # <Response [200]>

2.1.1 使用response.text获取响应页面, 返回字符串类型

response.encoding = 'utf-8'  # 指定解码方式
html_str = response.text

2.1.2 [推荐]使用response.content获取响应页面, 返回字节类型

html_b = response.content
html_str = response.content.decode()  # 使用decode()方法, 解码为字符串类型

2.2 添加 headers 和 parmas参数

headers, 发送请求时带上headers, 伪装成浏览器, 从服务器获取和浏览器一致的内容
– 形式: 字典
params, 例如: https://www.baidu.com/s?wd=python
– 形式: 字典, 等号前的内容为字典的键, 等号后的内容为字典的值

import requests

url = "https://www.baidu.com"
p = {"wd": "python"}
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"}

# params 接收一个字典或者字符串的查询参数，字典类型自动转换为url编码，不需要urlencode()
response = requests.get(url, params=p, headers=headers, proxies=proxies)

字符串格式化的另一种方式–>format

"https://www.baidu.com/s?{}".format("wd=python")

2.2.1 Response响应的其他属性

# 查看完整url地址
print (response.url)

# 查看响应头部字符编码
print (response.encoding)

# 查看响应码
print (response.status_code)