python爬虫之requests模块

一.安装

1.普通安装

pip install requests

2.换源安装

pip install requests -i https://pypi.douban.com/simple

二.快速入门

1. requests.get方法标准格式:

requests.get(url, params, headers,cookies)
不包含参数时,只需要url
包含参数时,参数为params中的键值对

(1).关键字换为16进制

import requests
# 网页地址百度贴吧:海贼王
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=0    第一页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50   第二页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100  第三页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=150  第四页
# 简化网址:
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=150
# 标准写法
url1 = 'https://tieba.baidu.com/f?'
params = {'kw':'%E6%B5%B7%E8%B4%BC%E7%8E%8B','pn':"150"}
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3868.400 QQBrowser/10.8.4394.400'
}
response = requests.get(url=url1, params=params,headers=headers)      # 带参数的get请求
print(response.text)

(2).关键字不转换,直接为中文

import requests
# 网页地址百度贴吧:海贼王
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=0    第一页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50   第二页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100  第三页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=150  第四页
# 简化网址:
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=150
# 标准写法
url1 = 'https://tieba.baidu.com/f?'
params = {'kw':'海贼王','pn':"100"}
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3868.400 QQBrowser/10.8.4394.400'
}
response = requests.get(url=url1, params=params,headers=headers)      # 带参数的get请求
print(response.text)

2. requests.get方法便捷格式:

requests.get(url, headers,cookies)

(1).关键字换为16进制

import requests
# 网页地址百度贴吧:海贼王
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=0    第一页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50   第二页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100  第三页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=150  第四页
# 简化网址:
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=150

#便捷写法
url2 = 'https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=250'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3868.400 QQBrowser/10.8.4394.400'
}
response = requests.get(url2, headers=headers)      # 带参数的get请求
print(response.text)

(2).关键字不转换,直接为中文

import requests
# 网页地址百度贴吧:海贼王
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=0    第一页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50   第二页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100  第三页
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=150  第四页
# 简化网址:
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=150

#便捷写法
url3 = 'https://tieba.baidu.com/f?kw=海贼王&pn=250'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3868.400 QQBrowser/10.8.4394.400'
}
response = requests.get(url3, headers=headers)      # 带参数的get请求
print(response.text)

3.显示乱码问题的解决

import requests
url = 'https://qq.yh31.com/zjbq/2920180.html'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/\
537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/\
1.70.3868.400 QQBrowser/10.8.4394.400'
}

response = requests.get(url, headers)      # 带参数的get请求
print(response.text)
'''
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>喜羊羊QQ表æƒ
,可爱的懒羊羊搞笑图片_第1页_表æƒ
以下省略
'''

此时,打印结果会出现乱码形式.

修改:将response.text替换为response.content.decode(‘utf-8’)

import requests
url = 'https://qq.yh31.com/zjbq/2920180.html'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/\
537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/\
1.70.3868.400 QQBrowser/10.8.4394.400'
}

response = requests.get(url, headers)      # 带参数的get请求
print(response.content.decode('utf-8'))
'''
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>喜羊羊QQ表情,可爱的懒羊羊搞笑图片_第1页_表情党</title>
<meta name="keywords" content="免费下载喜羊羊QQ表情,可爱的懒羊羊搞笑图片" />
<meta name="description" content="本站有大量喜羊羊QQ表情,可爱的懒羊羊搞笑图片,欢迎点击,免费下载图片" />
<script  type="text/javascript" src='/js/hodrduct.js'></script>
<link href="https://qq.yh31.com/css/qq.css" rel="stylesheet" type="text/css" />
<link href="https://qq.yh31.com/css/zt.css" rel="stylesheet" type="text/css" />
<script src="https://dup.baidustatic.com/js/ds.js"></script>
以下省略
'''

原因:
response.text返回的是字符串.它是requests模块将response.content编码之后所得到的数据.编码规则未指定,由requests模块猜测,结果不一定正确,所以会出现乱码.
response.content返回的是字节流.它是直接从网站上抓取数据,没有做任何解码.

解决方法:
方法1.使用print(response.content.decode(‘utf-8’))
方法2.在打印response.text之前,使用response.encoding指定编码方式.

  response.encoding = 'utf-8'
  print(response.text)
import requests
url = 'https://qq.yh31.com/zjbq/2920180.html'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/\
537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/\
1.70.3868.400 QQBrowser/10.8.4394.400'
}

response = requests.get(url, headers)      # 带参数的get请求
response.encoding = 'utf-8'
print(response.text)
'''
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>喜羊羊QQ表情,可爱的懒羊羊搞笑图片_第1页_表情党</title>
<meta name="keywords" content="免费下载喜羊羊QQ表情,可爱的懒羊羊搞笑图片" />
<meta name="description" content="本站有大量喜羊羊QQ表情,可爱的懒羊羊搞笑图片,欢迎点击,免费下载图片" />
<script  type="text/javascript" src='/js/hodrduct.js'></script>
<link href="https://qq.yh31.com/css/qq.css" rel="stylesheet" type="text/css" />
<link href="https://qq.yh31.com/css/zt.css" rel="stylesheet" type="text/css" />
<script src="https://dup.baidustatic.com/js/ds.js"></script>
</head>
以下省略
'''

4.requests设置代理IP

1.代理IP的匿名度

(1) 透明:服务器知道你使用了代理IP,也知道你的真实IP.
(2)匿名:服务器知道使用了代理IP,不知道真实IP.
(3)高匿:服务器不知道使用了代理IP,也不知道真实IP.

2.IP查询

  1. 查询内网IP

cmd----ipconfig

  1. 查询外网IP

1.浏览器输入网址IPIP.net
2.httpbin

3.在requests中设置代理.

import requests

url = "http://httpbin.org/ip"
# 代理ip放在字典中,通过参数proxies调用
proxy = {
    'http':'175.7.199.79:3256'
}
res = requests.get(url,proxies=proxy)
print(res.text)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值