爬虫学习笔记3

最新推荐文章于 2023-10-17 00:05:02 发布

笑揖峰头月一轮

最新推荐文章于 2023-10-17 00:05:02 发布

阅读量469

点赞数

分类专栏：学习笔记文章标签：爬虫

本文链接：https://blog.csdn.net/qq_19268039/article/details/84069551

版权

本文详细介绍了Python内置的urllib模块，包括urlopen、Request、Handler、OpenerDirector等核心概念。从发送HTTP请求、异常处理到URL解析，深入探讨了如何处理验证、Cookie、代理设置，以及robots.txt协议。通过实例展示了如何构造完整的HTTP请求，以及如何处理和分析响应数据。

摘要由CSDN通过智能技术生成

基本库的使用

urllib

urllib

urllib是Python内置的HTTP请求模块，它包含

request：它是最基本的 HTTP 请求模块,可以用来模拟发送请求。就像在浏览器里输入网址
然后回车一样，只需要给库方法传入 URL 以及额外的参数，就可以模拟实现这个过程了。
error：异常处理模块，如果出现请求错误，我们可以捕获这些异常，然后进行重试或其他操
作以保证程序不会意外终止。
parse：一个工具模块,提供了许多 URL 处理方法,比如拆分、解析、合并等。
robotparser：主要是用来识别网站的 robots.txt 文件,然后判断哪些网站可以爬,哪些网站不
可以爬,它其实用得比较少。

发送请求

使用 urllib 的 request 模块，我们可以方便地实现请求的发送并得到响应。

urlopen

urllib.request 模块提供了最基本的构造 HTTP 请求的方法, 利用它可以模拟浏览器的一个请求发起过程, 同时它还带有处理授权验证( authenticaton )、重定向( redirection) 、浏览器 Cookies 以及其他内容。

import urllib.request

response = urllib.request.urlopen('https://www.baidu.com')
print(type(response))
print(response.read().decode('utf-8'))

输出结果：

<class 'http.client.HTTPResponse'>
<html>
<head>
    <script>
        location.replace(location.href.replace("https://","http://"));
    </script>
</head>
<body>
    <noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>
[Finished in 0.1s]

响应获取了baidu官网的网页源代码，响应类型为：http.client.HTTPResponse 它是一个 HTTPResposne 类型的对象,主要包含 read() 、 readinto()、 getheader(name)、
getheaders() 、 fileno()等方法，以及 msg 、version 、status 、reason 、debuglevel 、 closed 等属性。
查看响应的状态码和响应首部字段：

print(f"status:{response.status}")
print(f"headers:{response.getheaders()}")
print(f"server:{response.getheader('Server')}")

结果：

status:200
headers:[('Accept-Ranges', 'bytes'), ('Cache-Control', 'no-cache'), ('Content-Length', '227'), ('Content-Type', 'text/html'), ('Date', 'Wed, 14 Nov 2018 09:14:55 GMT'), ('Etag', '"5be10158-e3"'), ('Last-Modified', 'Tue, 06 Nov 2018 02:50:00 GMT'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Pragma', 'no-cache'), ('Server', 'BWS/1.1'), ('Set-Cookie', 'BD_NOT_HTTPS=1; path=/; Max-Age=300'), ('Set-Cookie', 'BIDUPSID=182555CB247734E41DC41EAD4E3D44A8; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'PSTM=1542186895; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Strict-Transport-Security', 'max-age=0'), ('X-Ua-Compatible', 'IE=Edge,chrome=1'), ('Connection', 'close')]
server:BWS/1.1

urlopen()的参数:
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
data参数时可选的，data参数是可选的。如果要添加该参数,并且如果它是字节流编码格式的内容,即 bytes 类型，则需要通过 bytes ()方法转化。另外,如果传递了这个参数,则它的请求方式就不再是 GET方式，而是 POST方式。

import urllib.request
import urllib.parse

# urlencode：将参数转为 ASCII字符串 bytes: 转为字节流
data = bytes(urllib.parse.urlencode({
   'word': 'Nice to meet you'}), encoding='utf-8')
# 测试Post请求
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read().decode('utf-8'))

运行结果：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "word": "Nice to meet you"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "21", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.6"
  }, 
  "json": null, 
  "origin": "115.205.15.135", 
  "url": "http://httpbin.org/post"
}

传递的参数州现在了 form字段中，而且Content-Type也表示在提交表单，这表明是模拟了表单提交的方式，以 POST 方式传输数据。
timeout参数用欧冠与设置超时时间（秒）。如果请求超出了这个设置的时间，还没有得到响应，就会抛出相关的异常。可以用它来控制一个网页如果长时间未响应，就跳过抓取。
使用try except语句：

from urllib import request
from urllib import error
import socket
try:
    response = request.urlopen('http://httpbin.org/get', timeout=0.1)
except error.URLError as e:
    if isinstance(e.reason