爬虫网络请求urllib和request库的使用

大葱一根

已于 2022-01-29 19:32:16 修改

阅读量2.1k

点赞数 1

分类专栏：小白学爬虫文章标签：爬虫网络 python

于 2022-01-29 19:28:09 首次发布

本文链接：https://blog.csdn.net/qq_45126531/article/details/122720165

版权

小白学爬虫专栏收录该内容

11 篇文章 0 订阅

订阅专栏

文章目录

1、urllib库（python内置库）
2、request库--第三方库

1、urllib库（python内置库）

urlopen函数

创建一个表示远程的url的类文件对象，然后像本地文件一样操作这个类文件对象来获取远程数据

url: 请求url(网址)
data:请求的data,如果设置了这个值。那么将变成post请求
返回值：返回值是一个对象

from urllib import request
resp = request.urlopen('https://www.sogou.com/')
# print(resp.read())#读取数据
# print(resp.readline())#读取一行
# print(resp.readlines())#读取多行
# print(resp.read(10))#读取10个字节
print(resp.getcode())#得到状态码200

urlretrieve函数

用于下载文件保存到本地
request.urlretrieve(url,‘文件名’)

代码示例：

from urllib import request
# request.urlretrieve('https://www.sogou.com/','sougou.html')#下载网页
# 下载照片
request.urlretrieve('https://bkimg.cdn.bcebos.com/smart/0b46f21fbe096b63f1675b2903338744ebf8ac48-bkimg-process,v_1,rw_16,rh_9,maxl_640,pad_1?x-bce-process=image/format,f_auto','yangzi.jpg')

urlencode 函数：编码

urlencode可以把字典数据转换为url编码的数据

parse_qs函数：解码

可以把url编码的数据解码

代码示例：

from urllib import request
from  urllib import parse
data={'wd':'杨紫'}#字典
qs=parse.urlencode(data)#parse 对字典编码
print(qs)
#输出：wd=%E6%9D%A8%E7%B4%AB
print(parse.parse_qs(qs))#解码
#输出：{'wd': ['杨紫']}
url='https://www.baidu.com/s?ie=utf-8&'+qs
resp=request.urlopen(url)
print(resp)
#输出网页内容 但因为百度的爬虫措施返回不一样
#补充 对于字符串编码
a='杨紫'
b=parse.quote(a)
print(b)
#输出：%E6%9D%A8%E7%B4%AB

urlparse和urlsplit函数：解析url

urlparse里有params属性，而urlsplit里没有这个params属性

from urllib import parse
url='http://www.baidu.com/index.html;user?id=S#comment'
result1=parse.urlparse(url)#解析url
#输出
result=parse.urlsplit(url)
print(result1)
# 输出ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=S', fragment='comment')
print(result)
# 输出SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=S', fragment='comment')
print(result.path)
#输出/index.html;user

request.Request类：可以增加请求头

过程：打开浏览器写好一个请求头、用 request.Request类增加请求头、打开网页

from urllib import request
header={
    'User - Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}
rq=request.Request('https://www.baidu.com/',headers=header)#伪造请求头
resp=request.urlopen(rq)
print(resp.read())

目的：增加一个请求头可以模拟浏览器访问网页。

ProxyHandler处理器（代理设置）：封ip问题
1、代理原理
请求代理服务器、代理服务器请求服务、转发给我们
2、http://httpbin.org/ip：这个网站可以方便的查看ip
3、在代码中使用代理示例：

from urllib import request
#没有使用代理ip
# url='http://httpbin.org/ip'
# resp=request.urlopen(url)
# print(resp.read())
#不使用代理ip
#步骤
url='http://httpbin.org/ip'
#1、使用ProxyHandler,传入代理构建一个hanler
handler=request.ProxyHandler({'http':'代理ip'})
#字典 http或者https
#2、使用上面创建的handler构建一个opener
opner=request.build_opener(handler)
#3、使用opener去发送一个请求
resp=opner.open(url)
print(resp.read())

常见的代理ip有（网站）：
1、西刺免费代理ip
2、快代理
3、代理云

cookie

作用：为了某些网站辨别用户信息、进行session跟踪而存储在用户本地终端上的数据
格式：Set-Cookie: NAME=VALUE;Expires/Max-age=DATE;Path=PATH;Domain=DOMAIN_NAME;SECURE;
参数意义：
NAME：cookie的名字
VALUE：值
Expires：cookie的过期时间
PATH：作用路径
Domain：作用域名
SECURE：是否只在https协议下起作用

爬虫实现模拟登陆

两种方法
第一种是登陆后复制cookie写入请求头然后用Request类组装
代码示例：

from urllib import request
url='https://www.zhihu.com/hot'
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36','cookie':'你的cookie'
}
rq=request.Request(url,headers=headers)
reqs=request.urlopen(rq)
print(reqs.read().decode('utf-8'))

第二种方式
分为两步:第一步登陆，第二步访问。
首先用CookieJar模块实现登陆分为3步:创建cookiejar对象、创建HTTPCookieProcess对象、创建opener 。存储账号密码，用request添加到url。
访问，首先用request添加请求头，打开网页即可
代码示例：

from urllib import request
from urllib import parse
from http.cookiejar import CookieJar

#个人网页https://i.meishi.cc/cook.php?id=13686422

headers={
 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}

#1.登录
#1.1 创建cookiejar对象
cookiejar = CookieJar()
#1.2 使用cookiejar创建一个HTTPCookieProcess对象
handler = request.HTTPCookieProcessor(cookiejar)
#1.3 使用上一步的创建的handler创建一个opener
opener = request.build_opener(handler)
#1.4 使用opener发送登录请求  (账号和密码)

post_url = 'https://i.meishi.cc/login.php?redirect=https%3A%2F%2Fwww.meishij.net%2F'
post_data = parse.urlencode({
 'username':' 自己的账号',
 'password':'自己的密码.'
})
req = request.Request(post_url,data=post_data.encode('utf-8'))
opener.open(req)


#2.访问个人网页
url = 'https://i.meishi.cc/cook.php?id=13686422'
rq = request.Request(url,headers=headers)
resp = opener.open(rq)
print(resp.read().decode('utf-8'))

cookie的加载与保存

from urllib import request
from http.cookiejar import MozillaCookieJar
#保存
'''
cookiejar=MozillaCookieJar('cookir.txt')
##三步
handle=request.HTTPCookieProcessor(cookiejar)
opener=request.build_opener(handle)
resp=opener.open('https://httpbin.org/cookies/set/math/123')
#保存
cookiejar.save(ignore_discard=True, ignore_expires=True)
'''
#加载
cookiejar=MozillaCookieJar('cookir.txt')
cookiejar.load()
handler=request.HTTPCookieProcessor(cookiejar)
opener=request.build_opener(handler)
resp=opener.open('https://httpbin.org/cookies/set/math/123')

for i in cookiejar:
    print(i)

2、request库–第三方库

get方法的使用

params参数携带数据

import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'}
kw = {'wd':'中国'}
#params 接收一个字典或者字符串，字典类型自动转换为url编码，不需要urlencode(
resp=requests.get('https://www.baidu.com/s',headers=headers,params=kw)
print(resp)
#属性
'''
#查询内容
print(resp.text)#返回的是unicode格式（文本）的数据
print(resp.content)#返回字节流的数据
#若文本格式发生乱码可以手动编码
print(resp.content.decode('utf-8')
'''
print(resp.url)

post方法的使用

import requests
url='检查源码里查看'
headers={'代理头'}
data={
    '表单'
}
resp=requests.post(url,headers=headers,data=data)
print(resp.text)

代理ip

只需要一个参数（proxies）

import requests
url='https://httpbin.org/ip'
proxy={
    'http':'111.59.199.58:8118'
}
resp=requests.get(url,proxies=proxy)
print(resp.text)

爬虫实现模拟登陆

第一种方法
过程：url 请求头 cookie

import requests
#案例一 用cookie登录
url='https://www.zhihu.com/hot'
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36','cookie':'cookie: '你的cookie'
}
rq=requests.get(url,headers=headers)
print(rq.text)

第二种方式
用账号密码登录：首先登录然后访问
使用requests，也要达到共享cookie的目的，那么可以使用requests库给我们提供的session对象。

# 案例二
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
post_data = {
    'username':'账号',
    'password':'密码'
}
post_url = 'https://i.meishi.cc/login.php?redirect=https%3A%2F%2Fwww.meishij.net%2F'
#登录 会话对象
session = requests.session()
session.post(post_url,headers=headers,data=post_data)


#访问个人网页
url = 'https://i.meishi.cc/cook.php?id=13686422'
req=session.get(url)
print(req.text)

处理不信任的SSL证书：
用verify

import requests
url=''
resp=requests.get(url,verify=False)
print(resp.content.decode('utf-8'))

大葱一根

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
爬虫网络请求urllib和request库的使用

1、urllib库（python内置库）urlopen函数创建一个表示远程的url的类文件对象，然后像本地文件一样操作这个类文件对象来获取远程数据url: 请求url(网址)data:请求的data,如果设置了这个值。那么将变成post请求返回值：返回值是一个对象from urllib import requestresp = request.urlopen('https://www.sogou.com/')# print(resp.read())#读取数据# print(resp.re
复制链接

扫一扫

专栏目录