Python爬虫系列——（一）发起HTTP请求/解析数据

最新推荐文章于 2024-04-29 14:34:14 发布

Chestimouse

最新推荐文章于 2024-04-29 14:34:14 发布

阅读量2.6k

点赞数 4

分类专栏： Python爬虫文章标签： python json

本文链接：https://blog.csdn.net/Lehi_Chiang/article/details/103575271

版权

（一）发起HTTP/HTTPS请求

方法一：urllib

urllib是python内置的HTTP请求库，无需安装即可使用，它包含了4个模块：

request：它是最基本的http请求模块，用来模拟发送请求
error：异常处理模块，如果出现错误可以捕获这些异常
parse：一个工具模块，提供了许多URL处理方法，如：拆分、解析、合并等

robotparser：主要用来识别网站的robots.txt文件，然后判断哪些网站可以爬

快速爬取一个网页

import urllib.request as ur
response=ur.urlopen("https://www.baidu.com")
html=response.read().decode("utf-8")
print(html)

<html>
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>

1、urllib.request.urlopen()

urllib.request.urlopen(url,data=None,[timeout,],cafile=None,capath=None,cadefault=False,context=None)

urlopen()方法可传递参数：

url：网站地址，str类型，也可以是一个request对象

**data：**data参数是可选的，内容为字节流编码格式的即bytes类型，如果传递data参数，urlopen将使用Post方式请求

from urllib.request import urlopen
import urllib.parse

data = bytes(urllib.parse.urlencode({
   'word':'hello'}),encoding='utf-8') #data需要字节类型的参数，使用bytes()函数转换为字节，使用urllib.parse模块里的urlencode()方法来讲参数字典转换为字符串并指定编码
response = urlopen('http://httpbin.org/post',data=data)
print(response.read())

#output
b'{
   
........
"form":{
   "word":"hello"},  #form字段表明模拟以表单的方法提交数据，post方式传输数据
"headers":{
   "Accept-Encoding":"identity",
 .......}'

**timeout参数：**用于设置超时时间，单位为秒，如果请求超出了设置时间还未得到响应则抛出异常，支持HTTP,HTTPS,FTP请求

import urllib.request
response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)  #设置超时时间为0.1秒,将抛出异常
print(response.read())

#output
urllib.error.URLError: <urlopen error timed out>

可以使用异常处理来捕获异常

import urllib.request
import urllib.error
import socket
try:
    response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
    print(response.read())
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout): #判断对象是否为类的实例
        print(e.reason) #返回错误信息

#output
timed out

其他参数：context参数，她必须是ssl.SSLContext类型，用来指定SSL设置，此外,cafile和capath这两个参数分别指定CA证书和它的路径，会在https链接时用到。

2.返回HTTPResponse对象

urlopen()方法返回一个HTTPResponse类型的对象，该对象包含的方法和属性：

方法：read()、readinto()、getheader(name)、getheaders()、fileno()

属性：msg、version、status、reason、bebuglevel、closed

import urllib.request

response=urllib.request.urlopen('https://www.python.org')  #请求站点获得一个HTTPResponse对象
print(response.read().decode('utf-8'))   #返回网页内容
print(response.getheader('server')) #返回响应头中的server值
print(response.getheaders()

最低0.47元/天解锁文章

Chestimouse

关注

4
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫系列——（一）发起HTTP请求/解析数据

（一）发起HTTP/HTTPS请求方法一：urllib urllib是python内置的HTTP请求库，无需安装即可使用，它包含了4个模块：request：它是最基本的http请求模块，用来模拟发送请求error：异常处理模块，如果出现错误可以捕获这些异常parse：一个工具模块，提供了许多URL处理方法，如：拆分、解析、合并等robotparser：主要用来识别网站...
复制链接

扫一扫