1. requests库安装
推荐使用anaconda,自带
2. requests使用
import requests
r = requests.get("http://www.baidu.com")
print(r.status_code)
r.encoding = 'utf-8'
print(r.text)
![](https://i-blog.csdnimg.cn/blog_migrate/bff65d723dca949455f3bd7b715cf1e7.png)
![](https://i-blog.csdnimg.cn/blog_migrate/306e6175763f29ce0c68ce76ca901d32.png)
2.1 Requests库的get()方法
![](https://i-blog.csdnimg.cn/blog_migrate/2ab0b3e96c508b42033c828bf816c6c3.png)
![](https://i-blog.csdnimg.cn/blog_migrate/53fa4ce22ab13a7ffc7cce451a3a8d06.png)
![](https://i-blog.csdnimg.cn/blog_migrate/3766d17f68c736ad143003f1762274d5.png)
![](https://i-blog.csdnimg.cn/blog_migrate/10ce1b1cf5ed7a1531a51a62b052a8ed.png)
2.2 Response对象
![](https://i-blog.csdnimg.cn/blog_migrate/1111a7f1a3af6a3b0cec1ec2680a83f5.png)
(1)判断请求是否成功
assert response.status_code == 200
![](https://i-blog.csdnimg.cn/blog_migrate/b0297376ebda511585a8ceeabb4a03b0.png)
![](https://i-blog.csdnimg.cn/blog_migrate/7d4b9bd1e1e37896916257ad11d1341c.png)
![](https://i-blog.csdnimg.cn/blog_migrate/bfa4d02c18bbc33bb4bcc7c46aa3d947.png)
2.3 requests模块发送带headers的请求
(1)如果没有模拟浏览器,对方服务器只会返回一小部分内容
![](https://i-blog.csdnimg.cn/blog_migrate/b9f8fad09ba28bde3f76f5c87fb6b9e6.png)
(2)为了模拟浏览器,所以需要发送带header请求
![](https://i-blog.csdnimg.cn/blog_migrate/57da835a90464e3dc8ee6bd797dc3379.png)
![](https://i-blog.csdnimg.cn/blog_migrate/4027feda73aa89534450bef263c10431.png)
2.4 requests发送带参数的请求
![](https://i-blog.csdnimg.cn/blog_migrate/a5a4587945b977353b6c46cbae19f399.png)
(1)url编码格式
https%3A%2F%2Fwww.baidu.com%2Fs%3Fwd%3Dpython&logid=8596791949931086675&signature=aa5a72defcf92845bdcdac2e55e0aab3×tamp=1579276087
解码:
https://www.baidu.com/s?wd=python&logid=8596791949931086675&signature=aa5a72defcf92845bdcdac2e55e0aab3×tamp=1579276087
(2)字符串格式化另一种方式
import requests
headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"}
ps = {"wd":"python"}
url_tmp = "https://www.baidu.com/s?wd={}".format("python")
r = requests.get(url=url_tmp, headers=headers)
print(r.status_code)
print(r.request.url)
3. 通用代码框架
![](https://i-blog.csdnimg.cn/blog_migrate/9a4198c45d9f9d6c1530a5781ba550ea.png)
![](https://i-blog.csdnimg.cn/blog_migrate/a95cde52619f67e7730bc6ddec28c2cc.png)
![](https://i-blog.csdnimg.cn/blog_migrate/cbb075f14d157823ce265147563e9402.png)
![](https://i-blog.csdnimg.cn/blog_migrate/cee045380df16e0c10cf0871b3c0d4df.png)
![](https://i-blog.csdnimg.cn/blog_migrate/14410efdb1d522067d504cd535ad0bf4.png)
![](https://i-blog.csdnimg.cn/blog_migrate/77ebb8c7c8d7dd1cbfba2a85dd5474f4.png)
![](https://i-blog.csdnimg.cn/blog_migrate/76c801ea3957969542c3d810c258dd0c.png)
![](https://i-blog.csdnimg.cn/blog_migrate/71f4cc6f18cb567fc037e043c0680492.png)
![](https://i-blog.csdnimg.cn/blog_migrate/217e681c48d3edeb233084ea3fa583f8.png)
![](https://i-blog.csdnimg.cn/blog_migrate/43167b24e0d8d0188c7b11b16e2dd49b.png)
![](https://i-blog.csdnimg.cn/blog_migrate/afe6d293f46c179a391b816d903f20c1.png)
4. 几个例子
![](https://i-blog.csdnimg.cn/blog_migrate/ea80bb731e63999d0310f5806cef04b3.png)
![](https://i-blog.csdnimg.cn/blog_migrate/3625b3edc6fc2149ae3a28c0d23a1c4e.png)
5. 实例:贴吧爬虫
# -*- coding: utf-8 -*-
import requests
class TiebaSpider(object):
def __init__(self, tieba_name):
self.tieba_name = tieba_name
self.url_temp = "http://tieba.baidu.com/f?kw="+tieba_name+"&ie=utf-8&pn={}"
self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"}
def get_url_list(self): # 构造url列表
url_list = []
for i in range(100):
url_list.append(self.url_temp.format(i*50))
return url_list
def parse_url(self, url): # 2.遍历,发送请求,获取响应
print(url)
response = requests.get(url, headers=self.headers)
return response.content.decode()
def save_html(self, html_str, page_num): # 3.保存html字符串
file_path = "{}_第{}页.html".format(self.tieba_name, page_num)
with open(file_path, "w", encoding="utf-8") as f: # "成果_第1页.html"
f.write(html_str)
def run(self): # 实现主要逻辑
# 1.构造url列表
url_list = self.get_url_list()
# 2.遍历,发送请求,获取响应
for url in url_list:
html_str = self.parse_url(url)
# 3.保存
page_num = url_list.index(url) + 1 # 页码数
self.save_html(html_str, page_num)
if __name__ == '__main__':
tieba_spider = TiebaSpider("成果")
tieba_spider.run()
![](https://i-blog.csdnimg.cn/blog_migrate/c882c136f4663740ebab8fb00bf22894.png)
![](https://i-blog.csdnimg.cn/blog_migrate/87afdf926af61b9e11378f92b02f6760.png)
![](https://i-blog.csdnimg.cn/blog_migrate/cb87e5dec0c5d58d2f23907322eb46bc.png)
(1)补充:列表推导式
![](https://i-blog.csdnimg.cn/blog_migrate/6d1c7761aaab72ad9111a6007d9497a6.png)
扁平胜于嵌套
![](https://i-blog.csdnimg.cn/blog_migrate/76403c2d48755a3698f4b129ee25e3f7.png)