python爬虫学习笔记(一):urllib是什么

最新推荐文章于 2024-08-25 23:24:39 发布

Stepfen Shawn

最新推荐文章于 2024-08-25 23:24:39 发布

阅读量1.5k

点赞数 3

分类专栏：爬虫

本文链接：https://blog.csdn.net/qq_43933657/article/details/105188966

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

什么是`urllib`?

urllib是python内置的爬虫库，它包含4个模块：

request：基础的 HTTP 请求模块。
error：异常处理模块。
parse：用于解析 URL 的模块。
robotparser：识别网站中 robots.txt 文件。

`urlopen`的使用

urlopen()的原型：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

除了url之外，其它参数都是默认的。

第一只爬虫

# 导入urllib的request模块
import urllib.request
#用urlopen得到网页的响应
response = urllib.request.urlopen('https://www.geekdigging.com/')
print(response.read().decode('utf-8'))

运行结果是打印了整个网页的源代码

`response`是什么

#print the type of response
print(type(response))

由运行结果看出，它是一个HTTPResponse类型的对象:

<class 'http.client.HTTPResponse'>

那么什么是HTTPResponse对象？
HTTPResponse 是对 HTTP 响应的包装。它提供了对请求头和请求体的访问。这个响应是一个可以迭代的对象。
HTTPResponse主要包含 read() 、 readline() 、 getheader(name) 、 getheaders() 、 fileno() 等方法，以及 msg 、 version 、 status 、 reason 、 debuglevel 、 closed 等属性。
version可以获取HTTP协议版本号(10 for HTTP/1.0, 11 for HTTP/1.1)