python爬虫之requests模块

最新推荐文章于 2024-03-25 13:55:15 发布

weixin_44831124

最新推荐文章于 2024-03-25 13:55:15 发布

阅读量102

点赞数

文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_44831124/article/details/119701183

版权

python爬虫之requests模块

一.安装
- 1.普通安装
- 2.换源安装
二.快速入门

一.安装

1.普通安装

pip install requests

2.换源安装

pip install requests -i https://pypi.douban.com/simple

二.快速入门

1. requests.get方法标准格式:

requests.get(url, params, headers,cookies)
不包含参数时,只需要url
包含参数时,参数为params中的键值对

(1).关键字换为16进制

import requests
# 网页地址百度贴吧:海贼王
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=0    第一页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50   第二页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100  第三页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=150  第四页
# 简化网址:
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=150
# 标准写法
url1 = 'https://tieba.baidu.com/f?'
params = {'kw':'%E6%B5%B7%E8%B4%BC%E7%8E%8B','pn':"150"}
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3868.400 QQBrowser/10.8.4394.400'
}
response = requests.get(url=url1, params=params,headers=headers)      # 带参数的get请求
print(response.text)

(2).关键字不转换,直接为中文

import requests
# 网页地址百度贴吧:海贼王
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=0    第一页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50   第二页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100  第三页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=150  第四页
# 简化网址:
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=150
# 标准写法
url1 = 'https://tieba.baidu.com/f?'
params = {'kw':'海贼王','pn':"100"}
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3868.400 QQBrowser/10.8.4394.400'
}
response = requests.get(url=url1, params=params,headers=headers)      # 带参数的get请求
print(response.text)

2. requests.get方法便捷格式:

requests.get(url, headers,cookies)

(1).关键字换为16进制

import requests
# 网页地址百度贴吧:海贼王
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=0    第一页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50   第二页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100  第三页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=150  第四页
# 简化网址:
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=150

#便捷写法
url2 = 'https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=250'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3868.400 QQBrowser/10.8.4394.400'
}
response = requests.get(url2, headers=headers)      # 带参数的get请求
print(response.text)

(2).关键字不转换,直接为中文

import requests
# 网页地址百度贴吧:海贼王
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=0    第一页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50   第二页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100  第三页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=150  第四页
# 简化网址:
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=150

#便捷写法
url3 = 'https://tieba.baidu.com/f?kw=海贼王&pn=250'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3868.400 QQBrowser/10.8.4394.400'
}
response = requests.get(url3, headers=headers)      # 带参数的get请求
print(response.text)

3.显示乱码问题的解决

import requests
url = 'https://qq.yh31.com/zjbq/2920180.html'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/\
537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/\
1.70.3868.400 QQBrowser/10.8.4394.400'
}

response = requests.get(url, headers)      # 带参数的get请求
print(response.text)
'''
ï»¿<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>å–œç¾Šç¾ŠQQè¡¨æƒ
ï¼Œå¯çˆ±çš„æ‡’ç¾Šç¾Šæžç¬‘å›¾ç‰‡_ç¬¬1é¡µ_è¡¨æƒ
以下省略
'''

此时,打印结果会出现乱码形式.

修改:将response.text替换为response.content.decode(‘utf-8’)

import requests
url = 'https://qq.yh31.com/zjbq/2920180.html'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/\
537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/\
1.70.3868.400 QQBrowser/10.8.4394.400'
}

response = requests.get(url, headers)      # 带参数的get请求
print(response.content.decode('utf-8'))
'''
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>喜羊羊QQ表情，可爱的懒羊羊搞笑图片_第1页_表情党</title>
<meta name="keywords" content="免费下载喜羊羊QQ表情，可爱的懒羊羊搞笑图片" />
<meta name="description" content="本站有大量喜羊羊QQ表情，可爱的懒羊羊搞笑图片，欢迎点击，免费下载图片" />
<script  type="text/javascript" src='/js/hodrduct.js'></script>
<link href="https://qq.yh31.com/css/qq.css" rel="stylesheet" type="text/css" />
<link href="https://qq.yh31.com/css/zt.css" rel="stylesheet" type="text/css" />
<script src="https://dup.baidustatic.com/js/ds.js"></script>
以下省略
'''

原因:
response.text返回的是字符串.它是requests模块将response.content编码之后所得到的数据.编码规则未指定,由requests模块猜测,结果不一定正确,所以会出现乱码.
response.content返回的是字节流.它是直接从网站上抓取数据,没有做任何解码.

解决方法:
方法1.使用print(response.content.decode(‘utf-8’))
方法2.在打印response.text之前,使用response.encoding指定编码方式.
  response.encoding = 'utf-8'
  print(response.text)

import requests
url = 'https://qq.yh31.com/zjbq/2920180.html'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/\
537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/\
1.70.3868.400 QQBrowser/10.8.4394.400'
}

response = requests.get(url, headers)      # 带参数的get请求
response.encoding = 'utf-8'
print(response.text)
'''
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>喜羊羊QQ表情，可爱的懒羊羊搞笑图片_第1页_表情党</title>
<meta name="keywords" content="免费下载喜羊羊QQ表情，可爱的懒羊羊搞笑图片" />
<meta name="description" content="本站有大量喜羊羊QQ表情，可爱的懒羊羊搞笑图片，欢迎点击，免费下载图片" />
<script  type="text/javascript" src='/js/hodrduct.js'></script>
<link href="https://qq.yh31.com/css/qq.css" rel="stylesheet" type="text/css" />
<link href="https://qq.yh31.com/css/zt.css" rel="stylesheet" type="text/css" />
<script src="https://dup.baidustatic.com/js/ds.js"></script>
</head>
以下省略
'''

4.requests设置代理IP

1.代理IP的匿名度

(1) 透明:服务器知道你使用了代理IP,也知道你的真实IP.
(2)匿名:服务器知道使用了代理IP,不知道真实IP.
(3)高匿:服务器不知道使用了代理IP,也不知道真实IP.

2.IP查询

查询内网IP

cmd----ipconfig

查询外网IP

1.浏览器输入网址IPIP.net
2.httpbin

3.在requests中设置代理.

import requests

url = "http://httpbin.org/ip"
# 代理ip放在字典中,通过参数proxies调用
proxy = {
    'http':'175.7.199.79:3256'
}
res = requests.get(url,proxies=proxy)
print(res.text)

weixin_44831124

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫之requests模块

python爬虫之requests模块一.安装1.普通安装2.换源安装二.快速入门1. requests.get方法标准格式:2. requests.get方法便捷格式:3.显示乱码问题的解决一.安装1.普通安装pin install requests2.换源安装pin install requests -i https://pypi.douban.com/simple二.快速入门1. requests.get方法标准格式:requests.get(url, params, header
复制链接

扫一扫