Urllib库获取网页信息：

最新推荐文章于 2023-10-14 10:24:04 发布

洗手不上厕所

最新推荐文章于 2023-10-14 10:24:04 发布

阅读量421

点赞数 1

分类专栏： python爬虫笔记文章标签： python html url 爬虫

本文链接：https://blog.csdn.net/weixin_50560109/article/details/119182297

版权

python爬虫笔记专栏收录该内容

3 篇文章 1 订阅

订阅专栏

Urllib库获取网页信息：

1、直接获取一个网页的get请求（网页无防爬虫）

# 获取一个get请求（不需要传参数）

response = urllib.request.urlopen("http://www.baidu.com") # 获取网页，返回一个对象保存网页中所有的信息
print(response.read())           # 返回对象调用read方法读取网页内容, 编码为二进制(type为bytes)
print(response.read().decode('utf-8'))      # 用utf-8编码格式解码，防止中文乱码，除去换行符

结果：
使用get请求获取的网页源码（二进制格式）
在这里插入图片描述
使用get请求获取的网页源码（由utf-8解码二进制后）

2、利用测试网页获取pos请求

测试网站http://httpbin.org

# 获取一个pos请求（需要二进制编码参数）

import urllib.parse             # 用于解析下面的键值对
data = bytes(urllib.parse.urlencode({"hello": "world"}), encoding="utf-8")   # 解析utif-8格式的键值对，并转化为bytes二进制类型，最后传入数据包中
response = urllib.request.urlopen("http://httpbin.org/post", data=data)
print(response.read().decode("utf-8"))

测试结果：
在这里插入图片描述

3、超时操作

try:
    response = urllib.request.urlopen("http://httpbin.org/get", timeout=0.01)   # 获取网页，返回一个对象保存网页中所有的信息,并定义超时时间为0.01s
    print(response.read().decode('utf-8'))  # 返回对象调用read方法读取网页内容, 编码为二进制(type为bytes)
except urllib.error.URLError as e:
    print("time out!")

超时操作结果：
在这里插入图片描述

4、用post方式爬取一个防爬虫网页

# 伪装成浏览器访问豆瓣网页（用post方式访问，需要参数）

import urllib.parse
url = "https://www.douban.com"
headers = {                  # 定义一个请求的头部信息，并复制浏览器访问时的headers中的User-Agent信息
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}
data = bytes(urllib.parse.urlencode({"hello": "world"}), encoding="utf-8")
req = urllib.request.Request(url=url,                   # 自定义封装一个请求，模拟浏览器访问网页时的请求，关键点是模拟User-Agent内容
                             data=data,
                             headers=headers,
                             method='POST')
response = urllib.request.urlopen(req)    # 根据封装的对象信息，返回信息
print(response.read().decode("utf*8"))

结果：
用post方式爬取一个防爬虫网页
在这里插入图片描述

5、用get方式爬取一个防爬虫网页

# 伪装成浏览器访问豆瓣网页（用get方式访问，不需要参数）

url = "https://www.douban.com"
headers = {                  # 定义一个请求的头部信息，并复制浏览器访问时的headers中的User-Agent信息
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}
req = urllib.request.Request(url=url,                   # 自定义封装一个请求，模拟浏览器访问网页时的请求，关键点是模拟User-Agent内容
                             headers=headers,)
response = urllib.request.urlopen(req)    # 根据封装的对象信息，返回信息
print(response.read().decode("utf*8"))

结果：
用get方式爬取一个防爬虫网页在这里插入图片描述

洗手不上厕所

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Urllib库获取网页信息：

Urllib库获取网页信息：1、直接获取一个网页的get请求（网页无防爬虫）# 获取一个get请求（不需要传参数）response = urllib.request.urlopen("http://www.baidu.com") # 获取网页，返回一个对象保存网页中所有的信息print(response.read()) # 返回对象调用read方法读取网页内容, 编码为二进制(type为bytes)print(response.read().decode('utf-8'))
复制链接

扫一扫