python爬虫urllib库的使用

！小白菜！y

已于 2022-06-08 10:54:44 修改

阅读量945

点赞数

分类专栏： python爬虫项目文章标签： python 爬虫

于 2022-05-12 13:36:51 首次发布

本文链接：https://blog.csdn.net/qq_45834835/article/details/124689976

版权

python爬虫项目专栏收录该内容

6 篇文章 2 订阅

订阅专栏

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

文章目录

前言
- - 需要导入的模块
一、请求
二、获取响应头
三、解决418异常(让网页察觉不到是爬虫)
- 1.418异常的原因：
- 2.解决步骤：
四.模拟浏览器访问豆瓣案例
总结

前言

做个笔记~

需要导入的模块

# coding=utf-8
#导入包
import urllib.request
import urllib.parse   #解析器 post请求用

提示：以下是本篇文章正文内容，下面案例可供参考

一、请求

1.获取一个get请求

# 1.获取一个get请求
response = urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode('utf-8'))	# decode('utf-8')对获取的网页信息进行utf-8的解码

2.获取一个pose请求

# 2.获取一个pose请求
# httpbin.org 一个测试pose请求的网址
# urllib.parse.urlencode封装数据 encoding='utf-8' 以utf-8的形式封装
data = bytes(urllib.parse.urlencode({'hello':'world'}),encoding='utf-8')  #bytes转换成二进制的数据包
respose = urllib.request.urlopen("http://httpbin.org/post",data = data)
print(respose.read().decode('utf-8'))

pose请求运行后返回的数据：

在这里插入图片描述

3.超时异常处理

#3.超时异常处理
#超时/网页不允许爬虫
try:
    response = urllib.request.urlopen("http://httpbin.org/get", timeout=1)   #timeout=1 超时处理
    print(response.read().decode('utf-8'))
except urllib.error.URLError as e:
    print('time out')

二、获取响应头

1.获取请求状态

response = urllib.request.urlopen("http://httpbin.org/get", timeout=1)
# 获取请求状态 200 正常执行 418 被发现是爬虫 404找不到内容
print(response.status)

2.获取响应头的所有属性与值

response = urllib.request.urlopen('http://www.baidu.com')
# getheaders获取到的所有数据
print(response.getheaders())

在这里插入图片描述

3.获取请求头中的单个属性所对应的值

response = urllib.request.urlopen('http://www.baidu.com')
print(response.getheader('Bdpagetype'))

三、解决418异常(让网页察觉不到是爬虫)

1.418异常的原因：

请求头的内容不相符
在这里插入图片描述

2.解决步骤：

# 1.封装
url = 'https://httpbin.org/post'
headers  = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36 Edg/101.0.1210.39"
}
data = bytes(urllib.parse.urlencode({'name':'eric'}),encoding='utf-8')
# 2.构建请求对象
req = urllib.request.Request(url = url,data = data,headers = headers,method = 'POST')   # url:访问网页地址 data:数据 headers:针对返回需要的信息 methon:用什么请求
# 3.请求响应对象
response = urllib.request.urlopen(req)

print(response.read().decode('utf-8'))

'User-Agent’所在区域:
在这里插入图片描述

四.模拟浏览器访问豆瓣案例

# 访问豆瓣
#封装
url = 'https://www.douban.com'
headers  = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36 Edg/101.0.1210.39"
}
#构建请求对象
req = urllib.request.Request(url = url,headers = headers)
#请求响应对象
response = urllib.request.urlopen(req)

print(response.read().decode('utf-8'))