python爬虫02 - 爬虫请求模块 request库 json数据

最新推荐文章于 2024-03-08 17:04:01 发布

烈风回响

最新推荐文章于 2024-03-08 17:04:01 发布

阅读量2w

点赞数

分类专栏： python爬虫文章标签： python

本文链接：https://blog.csdn.net/LonelyDragons/article/details/107618472

版权

本文详细介绍了Python的urllib.request模块，包括不同方法、响应对象以及如何处理反爬策略。同时，对比了urllib与requests库，重点讲解了requests的安装、常用方法、响应对象、POST请求、代理设置、cookie和session的使用，并探讨了JSON数据的处理。文章通过多个实战案例，帮助读者深入理解网络请求和数据处理。

摘要由CSDN通过智能技术生成

1. urllib.request模块

先用之前学到过的方法解决

第一种方法

先创建一个爬取图片.py 在新标签页中打开图片
在这里插入图片描述
这就是该图片的url

import requests

#图片的url
url='https://ss0.bdstatic.com/94oJfD_bAAcT8t7mm9GUKT-xh_/timg?image&quality=100&size=b4000_4000&sec=1596188551&di=3c06d3cd21c706c6c9c562f2bf76f56e&src=http://a3.att.hudong.com/14/75/01300000164186121366756803686.jpg'

req=requests.get(url)

fn=open('code.png','wb')
fn.write(req.content)

fn.close()

执行后
在这里插入图片描述

第二种方法

import requests

url='https://ss0.bdstatic.com/94oJfD_bAAcT8t7mm9GUKT-xh_/timg?image&quality=100&size=b4000_4000&sec=1596188551&di=3c06d3cd21c706c6c9c562f2bf76f56e&src=http://a3.att.hudong.com/14/75/01300000164186121366756803686.jpg'

req=requests.get(url)
#
# fn=open('code.png','wb')
# fn.write(req.content)
#
# fn.close()

with open('code2.png','wb') as f:
    f.write(req.content)

在这里插入图片描述

第三种方法

import requests

from urllib import request

url='https://ss0.bdstatic.com/94oJfD_bAAcT8t7mm9GUKT-xh_/timg?image&quality=100&size=b4000_4000&sec=1596188551&di=3c06d3cd21c706c6c9c562f2bf76f56e&src=http://a3.att.hudong.com/14/75/01300000164186121366756803686.jpg'

request.urlretrieve(url,'code3.png')
# req=requests.get(url)
#
# fn=open('code.png','wb')
# fn.write(req.content) #content只是二进制数据
#
# fn.close()
#
# with open('code2.png','wb') as f:
#     f.write(req.content)

运行一下
在这里插入图片描述

在这里插入图片描述

1.1 版本

python2 ：urllib2、urllib
python3 ：把urllib和urllib2合并,urllib.request

1.2 常用的方法

• urllib.request.urlopen(“网址”) 作用：向网站发起一个请求并获取响应 (或者是直接获取响应)
• 字节流 = response.read()
• 字符串 = response.read().decode(“utf-8”)
• urllib.request.Request"网址",headers=“字典”) urlopen()不支持重构User-Agent

实际操作

新建一个python文件

import urllib.request


# 向百度发起一个请求  得到一个响应结果 用一个变量接收

response=urllib.request.urlopen('https://www.baidu.com/')

print(response)


<http.client.HTTPResponse object at 0x0000000002504E20>
是一个对象

import urllib.request


# 向百度发起一个请求  得到一个响应结果 用一个变量接收

response=urllib.request.urlopen('https://www.baidu.com/')

#从响应对象中获取数据  read()函数来读取数据
print(response.read())

b'<html>\r\n<head>\r\n\t<script>\r\n\t\tlocation.replace(location.href.replace("https://","http://"));\r\n\t</script>\r\n</head>\r\n<body>\r\n\t<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>\r\n</body>\r\n</html>'
咱们拿到的这个response数据应该是网页源码  但是这些太少了

咱们拿到的网页源码应该是这些
在这里插入图片描述
上面操作拿到的显然太少了原因是百度在这里做了反爬了
咱们换个网站一会再解决这个问题

但是没有文字
因为
这个b bit数据类型字节
可以查看它的类型

import urllib.request


# 向百度发起一个请求  得到一个响应结果 用一个变量接收

response=urllib.request.urlopen('https://qq.yh31.com/')

#从响应对象中获取数据  read()函数来读取数据
# print(response.read())
html=response.read()
print(type(html))

<class 'bytes'>    类型是字节

但是现在我想看到的是字符串类型
字节转换成字符串用解码 decode

import urllib.request


# 向百度发起一个请求  得到一个响应结果 用一个变量接收

response=urllib.request.urlopen('https://qq.yh31.com/')

#从响应对象中获取数据  read()函数来读取数据
# print(response.read())
html=response.read().decode('utf-8')

print(html)

在这里插入图片描述

这会就正常了因为源码实在太多了

咱们再来搞一搞刚才百度的反爬可能是我们没有添加headers 请求头

import urllib.request


# 向百度发起一个请求  得到一个响应结果 用一个变量接收
headers = {
   
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}

response=urllib.request.urlopen('https://www.baidu.com/',headers=headers)

#从响应对象中获取数据  read()函数来读取数据
# print(response.read())
html=response.read().decode('utf-8')

print(html)


TypeError: urlopen() got an unexpected keyword argument 'headers'
类型错误:urlopen（）获得意外的关键字参数“headers”

因为urlopen()这个方法是不支持重构headers的

最终版本

• urllib.request.Request"网址",headers=“字典”)

import urllib.request


# 请求百度的数据(网页源码)
url='https://www.baidu.com/'

headers = {
   
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
#1.创建请求对象(构建User-Agent)
response=urllib.request.Request(url,headers=headers)

#2.获取响应对象(urlopen())
res=urllib.request.urlopen(response)

#3.读取响应对象的内容(read().decode('utf-8'))

html=res.read().decode('utf-8')
print(html)

在这里插入图片描述
这就和用谷歌浏览器打开的百度源码页内容一样了

1.3 响应对象

• read() 读取服务器响应的内容（读取到的是一个字节流数据所以后面要加上.decode()）
• getcode() 返回HTTP的响应码
• geturl() 返回实际数据的URL(防止重定向问题)

import urllib.request


# 请求百度的数据(网页源码)
url='https://www.baidu.com/'

headers = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
#1.创建请求对象(构建User-Agent)
response=urllib.request.Request(url,headers=headers)


#2.获取响应对象(urlopen())
res=urllib.request.urlopen(response)

#3.读取响应对象的内容(read().decode('utf-8'))

html=res.read().decode('utf-8')
# print(html)
print(res.getcode())#返回状态码

print(res.geturl())# 返回实际的请求网站


200
https://www.baidu.com/

urllib 是python自带的请求模块 request是第三方的请求模块

2. urllib.parse模块

2.1 常用方法

• urlencode(字典) 参数就是字典
• quote(字符串) (这个里面的参数是个字符串)

实际操作0

我们先用百度搜素海贼王
在这里插入图片描述
因为是get请求请求参数会显示在url上

后面的参数我们先不看我们先拿关键的部分

url='https://www.baidu.com/s?wd=%E6%B5%B7%E8%B4%BC%E7%8E%8B'

我们发现海贼王没了 %E6%B5%B7%E8%B4%BC%E7%8E%8B
原因就是我们在向一个网页发起请求的时候比如在百度的搜素框里搜索一个中文海贼王
也就是它要把中文向服务器提交但是网站只能识别 ascii 码也就是英文它真正提交传输内容的时候是ascii码是网站给你做了这么一个编码网页是不识别中文的
所以海贼王变成了十六进制的这么个东西%E6%B5%B7%E8%B4%BC%E7%8E%8B
比如你搜索关键字比如是用拼串如果直接加上你要搜索的关键字 (比如美女) 那么你的程序就有可能会出现问题所以我们要将汉字进行手动的编码
如何进行手动编码呢就要用到urllib.parse模块的 urlencode( ) 的这个方法

我们先可以试试这个%E6%B5%B7%E8%B4%BC%E7%8E%8B 是不是就是海贼王
三个百分号为一个汉字
那么%E6%B5%B7 就是海 %E8%B4%BC就是贼 %E7%8E%8B就是王
可以使用工具urldecode解码

https://tool.chinaz.com/tools/urlencode.aspx
在这里插入图片描述


import urllib.parse

# url='https://www.baidu.com/s?wd=%E6%B5%B7%E8%B4%BC%E7%8E%8B'


r={
   'wd':'海贼王'}  #这个是字典  传集合 也就是没有键 是会报错的


result=urllib.parse.urlencode(r)

print(result)



wd=%E6%B5%B7%E8%B4%BC%E7%8E%8B

练习一

# 导入模块
import urllib.parse

import urllib.request


#在百度上输入一个内容 例如:美女 数据保存到本地文件  美女.html

#baseurl  初始url
baseurl='https://www.baidu.com/s?'#此处的s?比较重要 不能丢

content=input('你要搜索的内容:')

wd={
   'wd':content}  #当然也可以是d={'wd':content} 前面的那个变量可以随便编  但是key值就是wb 这个不能错

content=urllib.parse.urlencode(wd)

#拼接url
url=baseurl+content

print(url)

你要搜索的内容:美女 #回车   此处可以自行输入内容 
https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3

练习之最终版本

# 导入模块
import urllib.parse

import urllib.request


#在百度上输入一个内容 例如:美女 数据保存到本地文件  美女.html

#baseurl  初始url
baseurl='https://www.baidu.com/s?'#此处的s?比较重要 不能丢

content=input('你要搜索的内容:')

wd={
   'wd':content}  #当然也可以是d={'wd':content} 前面的那个变量可以随便编  但是key值就是wb 这个不能错

content=urllib.parse.urlencode(wd)

#拼接url
url=baseurl+content

headers = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}

#创建请求对象
req=urllib.request.Request(url,headers=headers)

#获取响应对象
res=urllib.request.urlopen(req)
#读取
html=res.read().decode('utf-8')

#保存文件
with open('美女.html','w',encoding='utf-8')as f:
    f.write(html)

在这里插入图片描述

小结

反爬 ua refer cookie 可以把这些添加进去来反制一些反爬在这里插入图片描述

2.2补充 urllib.parse模块中的两个常用方法

• urlencode(字典) 参数就是字典
• quote(字符串) 这个里面的参数是个字符串

urlencode(字典)

import urllib.parse

baseurl='https://www.baidu.com/s?'


r={
   'wd':'海贼王'}

result=urllib.parse.urlencode(r)

url=baseurl+result
print(url)


https://www.baidu.com/s?wd=%E6%B5%B7%E8%B4%BC%E7%8E%8B

quote(字符串)

quote(引用)

import urllib.parse

key=input('输入内容:')

baseurl='https://www.baidu.com/s?wd='

r=urllib.parse.quote(key)

url=baseurl+r

print(url)


输入内容:姑娘
https://www.baidu.com/s?wd=%E5%A7%91%E5%A8%98

二者区别应该也就是baseurl中的wd=

练习一爬取百度贴吧

在这里插入图片描述
需求:
1输入要爬取贴吧的主题
2输入爬取的起始页和终止页
3把每一页的内容保存到本地
我们可以去网上搜索一些请求头

需求: 1输入要爬取贴吧的主题 2输入爬取的起始页和终止页 3把每一页的内容保存到本地

分析
https://tieba.baidu.com/f?kw=%E4%B8%AD%E5%9B%BD&ie=utf-8&pn=0 第一页
https://tieba.baidu.com/f?kw=%E4%B8%AD%E5%9B%BD&ie=utf-8&pn=50 第二页https://tieba.baidu.com/f?kw=%E4%B8%AD%E5%9B%BD&ie=utf-8&pn=100 第三页
pn=(当前页数-1)*50kw 贴吧的主题

import random
import urllib.request
import urllib.parse
#随机获取一个user-agent


headers_list = [{
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'},{
   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'},{
   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}]

headers=random.choice(headers_list)

name=input('请输入贴吧名:')

start=int(input('请输入起始页:'))
end=int(input('请输入结束页:'))

#对贴吧名name做个编码
kw={
   'kw':'%s'%name}

kw=urllib.parse.urlencode(kw)

#开始拼接url 发起请求 获取响应
for i in range(start,end+1):
    #开始拼接url

    pn=(i-1)*50
    baseurl='https://tieba.baidu.com/f?'

    url=baseurl+kw+'&pn='+str(pn)# 因为这是一个字符串的拼串
    #创建请求对象
    req=urllib.request.Request(url,headers=headers)
    #获取响应对象
    res=urllib.request.urlopen(req)
    #读取
    html=res.read().decode('utf-8')
    #写入文件
    filename='第'+str(i)+'页%s贴吧.html'%name
    with open(filename,'w',encoding='utf-8') as f:
        print('正在爬取%d页'%i)
        f.write(html)

练习二爬取百度贴吧(引入函数)

import urllib.request
import urllib.parse

#读取页面
def readPage(url):
    headers={
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.