Python爬虫之urllib库和requests库的基本使用

最新推荐文章于 2025-09-10 10:00:06 发布

原创

最新推荐文章于 2025-09-10 10:00:06 发布 · 1.2k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#python

Python爬虫之基本库urllib的使用

在Python3中，有urllib库来实现请求的发送（将Python2中的urllib2已经统一至urllib库中）。

对于urllib库的疑问可以参照官网说明了解：https://docs.python.org/3/library/urllib.html

urllib库

urllib库是python内置的HTTP请求库，包含四个模块：

request：最基本的HTTP请求模块，可以用来模拟发送请求。
error：异常处理模块，如果出现请求错误，可以对这些异常进行捕获，防止程序意外终止。
parse：一个工具模块，提供了许多URL的处理方法，如拆分、合并等。
robotparser：主要用来识别网站的robot.txt文件。

发送请求

urlopen()

urllib.request的模块提供了最基本的构造HTTP请求的方法，利用它可以模拟浏览器的一个请求发起过程。

首先给出urlopen()方法的API：urllib.request.urlopen(url, data=None, [timeout, ]***, cafile=None, capath=None, cadefault=False, context=None)，后面会详细介绍各项参数意义。

下面我们简单的调用下urlopen()方法：

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))

可以得到如下输出：

<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
....#省略
</body>
</html>

接下来，我们看看urllib.request.urlopen()方法返回的是个是什么

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(type(response)) #打印response的类型

输出如下：

<class 'http.client.HTTPResponse'>

可以发现，返回的是一个HTTPResponse类型的对象。HTTPResponse类型的对象包含read()、readinto()、hetheader(name)、fileno()等方法和status等属性。如下打印了HTTPResponse类型的方法返回及其status属性。

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.status)
print(response.getheaders())

输出了状态码和相应的头信息，如下：

200
[('Connection', 'close'), ('Content-Length', '48730'), ('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'DENY'), ('Via', '1.1 vegur'), ('Via', '1.1 varnish'), ('Accept-Ranges', 'bytes'), ('Date', 'Mon, 27 Jul 2020 07:55:36 GMT'), ('Via',

最低0.47元/天解锁文章