Python爬虫入门 ——请求模块 Urllib【1】

最新推荐文章于 2023-04-24 22:49:05 发布

神秘的企鹅

最新推荐文章于 2023-04-24 22:49:05 发布

阅读量248

点赞数

分类专栏： Python网络爬虫从入门到实践文章标签： python 爬虫 http

本文链接：https://blog.csdn.net/qq_59697980/article/details/120710661

版权

Python网络爬虫从入门到实践专栏收录该内容

2 篇文章 0 订阅

订阅专栏

urllib.request: 用于实现基本的HTTP请求的模块

urllib.error: 异常处理模块，如果在发送网络请求时超时，可以捕获异常，进行异常的有效处理

urllib.parse: 用于解析URL的模块

urllib.robotpraser: 用于解析robots.txt 文件，判断网站是否可以爬取信息

使用 urlopen（）方法发送请求

urllib.request模块提供了 urlopen()的方法，用于实现最基本的HTTP请求，然后接受服务器所响应的数据，格式如下

response = urllib.request.urlopen(url, data=None, [timeout]*, cafile=None, capath=None, cadefault=False, context=None)

# ***************************
# 参数说明：
# url     : 需要访问网站的URL的完整地址
# data    : 该参数默认为None,通过该参数确认请求方式；
#           若为None，则以GET
#           否则的话为POST，在发送POST请求时，参数data需要以字典形式的数据作为参数的值
#           并且需要将字典形式的参数值转化为字节类型的数据才可以实现POST请求
# timeout : 以秒为单位，设置超时时间，网站响应超过设置的时间则报错
# cafile  : 指定一组HTTPS请求信任的CA证书，cafile 指包含CA证书的单个文件
# capath  : 指定证书文件的目录
# cadefault : CA证书的默认值
# context : 描述SSL选项的实例

发送GET请求

import urllib.request                            # 导入request子模块

url = 'https://www.baidu.com/'
response = urllib.request.urlopen(url=url)       # 发送网络请求
print('相应的数据类型为：', type(response))       # 查看返回类型

运行结果如下

相应的数据类型为： <class 'http.client.HTTPResponse'>

进程已结束，退出代码为 0

HTTP Response常用的方法与属性获取信息

import urllib.request                                                   # 导入request子模块
url = 'https://www.python.org/'
response = urllib.request.urlopen(url=url)                            # 发送网络请求

print('响应状态码: ', response.status)
print('响应头所有信息: ', response.getheaders())
print('响应头指定信息: ', response.getheaders('Accept-Ranges'))

print('Python官网HTML代码如下: \n', response.read().decode('utf-8'))      # 读取HTML代码并进行UTF-8翻译

运行结果如下（部分）

响应状态码:  200
响应头所有信息:  [('Connection', 'close'), ('Content-Length', '49880'), ('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'DENY'), ('Via', '1.1 vegur, 1.1 varnish, 1.1 varnish'), ('Accept-Ranges', 'bytes'), ('Date', 'Mon, 11 Oct 2021 12:43:28 GMT'), ('Age', '2662'), ('X-Served-By', 'cache-bwi5162-BWI, cache-tyo11939-TYO'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '2, 823'), ('X-Timer', 'S1633956208.134679,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
响应头指定信息:  bytes
Python官网HTML代码如下: 
 <!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">
    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js">

    <meta name="application-name" content="Python.org">
    <meta name="msapplication-tooltip" content="The official home of the Python Programming Language">

发送POST请求

使用urlopen()方法

import urllib.request  # 导入urllib.request模块
import urllib.parse  # 导入urllib.parse模块

url = 'https://www.httpbin.org/post'  # post 请求测试地址
# 将表单数据改为bytes类型,并设置编码方式为UTF-8
# 原urllib.parse.urlencoded 变为 urllib.parse.quote_plus
# 字典类型数据改为字节类型
data = bytes(urllib.parse.quote_plus('helle' 'python'), encoding='utf-8')
response = urllib.request.urlopen(url=url, data=data)  # 发送网络请求
print(response.read().decode('utf-8'))  # 读取HTML代码并进行

运行结果如下

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {                                # 此处为表单数据
    "helle python": ""
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "www.httpbin.org", 
    "User-Agent": "Python-urllib/3.8", 
    "X-Amzn-Trace-Id": "Root=1-6164335d-4ed5a2b053a663921dc5c491"
  }, 
  "json": null, 
  "origin": "117.70.185.26", 
  "url": "https://www.httpbin.org/post"
}


进程已结束，退出代码为 0

设置网络超时

import urllib.request

url = 'https://www.python.org/'
response = urllib.request.urlopen(url=url, timeout=0.1)             # 发送网络请求，并设置超时时间（超出规定时间还未响应）
print(response.read().decode('utf-8'))                              # 读取HTML代码，并编码UTF-8

运行结果如下

# 超时结果
Traceback (most recent call last):
  File "E:/pythonProject/timeout.py", line 5, in <module>
    print(response.read().decode('utf-8'))                              # 读取HTML代码，并编码UTF-8
  File "E:\Pythontool\Anaconda\lib\http\client.py", line 471, in read
    s = self._safe_read(self.length)
  File "E:\Pythontool\Anaconda\lib\http\client.py", line 612, in _safe_read
    data = self.fp.read(amt)
  File "E:\Pythontool\Anaconda\lib\socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "E:\Pythontool\Anaconda\lib\ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "E:\Pythontool\Anaconda\lib\ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

进程已结束，退出代码为 1

# 未超时结果（部分）
<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">
    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js">

    <meta name="application-name" content="Python.org">
    <meta name="msapplication-tooltip" content="The official home of the Python Programming Language">
    <meta name="apple-mobile-web-app-title" content="Python.org">
    <meta name="apple-mobile-web-app-capable" content="yes">
    <meta name="apple-mobile-web-app-status-bar-style" content="black">

捕获超时异常处理网络超时

import urllib.request
import urllib.error                                                     # 导入 urllib.error 模块
import socket                                                           # 导入 socket 模块

url = 'https://www.python.org/'
try:                                                                    # try 用来检测语句块中的错误，从而让 except 获取异常信息并处理
    response = urllib.request.urlopen(url=url, timeout=0.1)
    print(response.read().decode('utf-8'))
except urllib.error.URLError as error:                                  # 处理异常
    if isinstance(error.reason, socket.timeout):                        # 判断是否未超时异常
        print('当前任务已经超时，即将进入下一个任务!')

运行结果如下

当前任务已经超时，即将进入下一个任务!

进程已结束，退出代码为 0

神秘的企鹅

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python爬虫入门 ——请求模块 Urllib【1】

urllib.request: 用于实现基本的HTTP请求的模块urllib.error: 异常处理模块，如果在发送网络请求时超时，可以捕获异常，进行异常的有效处理urllib.parse: 用于解析URL的模块urllib.robotpraser: 用于解析robots.txt 文件，判断网站是否可以爬取信息使用 urlopen（）方法发送请求urllib.request模块提供了 urlopen()的方法，用于实现最基本的HTTP请求，然后接受服务器所响应的数据，格式如下re
复制链接

扫一扫