python爬虫之requests模块
一.安装
1.普通安装
pip install requests
2.换源安装
pip install requests -i https://pypi.douban.com/simple
二.快速入门
1. requests.get方法标准格式:
requests.get(url, params, headers,cookies)
不包含参数时,只需要url
包含参数时,参数为params中的键值对
(1).关键字换为16进制
import requests
# 网页地址百度贴吧:海贼王
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=0 第一页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50 第二页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100 第三页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=150 第四页
# 简化网址:
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=150
# 标准写法
url1 = 'https://tieba.baidu.com/f?'
params = {'kw':'%E6%B5%B7%E8%B4%BC%E7%8E%8B','pn':"150"}
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3868.400 QQBrowser/10.8.4394.400'
}
response = requests.get(url=url1, params=params,headers=headers) # 带参数的get请求
print(response.text)
(2).关键字不转换,直接为中文
import requests
# 网页地址百度贴吧:海贼王
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=0 第一页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50 第二页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100 第三页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=150 第四页
# 简化网址:
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=150
# 标准写法
url1 = 'https://tieba.baidu.com/f?'
params = {'kw':'海贼王','pn':"100"}
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3868.400 QQBrowser/10.8.4394.400'
}
response = requests.get(url=url1, params=params,headers=headers) # 带参数的get请求
print(response.text)
2. requests.get方法便捷格式:
requests.get(url, headers,cookies)
(1).关键字换为16进制
import requests
# 网页地址百度贴吧:海贼王
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=0 第一页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50 第二页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100 第三页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=150 第四页
# 简化网址:
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=150
#便捷写法
url2 = 'https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=250'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3868.400 QQBrowser/10.8.4394.400'
}
response = requests.get(url2, headers=headers) # 带参数的get请求
print(response.text)
(2).关键字不转换,直接为中文
import requests
# 网页地址百度贴吧:海贼王
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=0 第一页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50 第二页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100 第三页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=150 第四页
# 简化网址:
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=150
#便捷写法
url3 = 'https://tieba.baidu.com/f?kw=海贼王&pn=250'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3868.400 QQBrowser/10.8.4394.400'
}
response = requests.get(url3, headers=headers) # 带参数的get请求
print(response.text)
3.显示乱码问题的解决
import requests
url = 'https://qq.yh31.com/zjbq/2920180.html'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/\
537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/\
1.70.3868.400 QQBrowser/10.8.4394.400'
}
response = requests.get(url, headers) # 带参数的get请求
print(response.text)
'''
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>喜羊羊QQ表æƒ
,å¯çˆ±çš„懒羊羊æžç¬‘图片_第1页_表æƒ
以下省略
'''
此时,打印结果会出现乱码形式.
修改:将response.text替换为response.content.decode(‘utf-8’)
import requests
url = 'https://qq.yh31.com/zjbq/2920180.html'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/\
537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/\
1.70.3868.400 QQBrowser/10.8.4394.400'
}
response = requests.get(url, headers) # 带参数的get请求
print(response.content.decode('utf-8'))
'''
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>喜羊羊QQ表情,可爱的懒羊羊搞笑图片_第1页_表情党</title>
<meta name="keywords" content="免费下载喜羊羊QQ表情,可爱的懒羊羊搞笑图片" />
<meta name="description" content="本站有大量喜羊羊QQ表情,可爱的懒羊羊搞笑图片,欢迎点击,免费下载图片" />
<script type="text/javascript" src='/js/hodrduct.js'></script>
<link href="https://qq.yh31.com/css/qq.css" rel="stylesheet" type="text/css" />
<link href="https://qq.yh31.com/css/zt.css" rel="stylesheet" type="text/css" />
<script src="https://dup.baidustatic.com/js/ds.js"></script>
以下省略
'''
原因:
response.text返回的是字符串.它是requests模块将response.content编码之后所得到的数据.编码规则未指定,由requests模块猜测,结果不一定正确,所以会出现乱码.
response.content返回的是字节流.它是直接从网站上抓取数据,没有做任何解码.
解决方法:
方法1.使用print(response.content.decode(‘utf-8’))
方法2.在打印response.text之前,使用response.encoding指定编码方式.response.encoding = 'utf-8' print(response.text)
import requests
url = 'https://qq.yh31.com/zjbq/2920180.html'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/\
537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/\
1.70.3868.400 QQBrowser/10.8.4394.400'
}
response = requests.get(url, headers) # 带参数的get请求
response.encoding = 'utf-8'
print(response.text)
'''
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>喜羊羊QQ表情,可爱的懒羊羊搞笑图片_第1页_表情党</title>
<meta name="keywords" content="免费下载喜羊羊QQ表情,可爱的懒羊羊搞笑图片" />
<meta name="description" content="本站有大量喜羊羊QQ表情,可爱的懒羊羊搞笑图片,欢迎点击,免费下载图片" />
<script type="text/javascript" src='/js/hodrduct.js'></script>
<link href="https://qq.yh31.com/css/qq.css" rel="stylesheet" type="text/css" />
<link href="https://qq.yh31.com/css/zt.css" rel="stylesheet" type="text/css" />
<script src="https://dup.baidustatic.com/js/ds.js"></script>
</head>
以下省略
'''
4.requests设置代理IP
1.代理IP的匿名度
(1) 透明:服务器知道你使用了代理IP,也知道你的真实IP.
(2)匿名:服务器知道使用了代理IP,不知道真实IP.
(3)高匿:服务器不知道使用了代理IP,也不知道真实IP.
2.IP查询
- 查询内网IP
cmd----ipconfig
- 查询外网IP
1.浏览器输入网址IPIP.net
2.httpbin
3.在requests中设置代理.
import requests
url = "http://httpbin.org/ip"
# 代理ip放在字典中,通过参数proxies调用
proxy = {
'http':'175.7.199.79:3256'
}
res = requests.get(url,proxies=proxy)
print(res.text)