Python爬虫——Urllib库使用

最新推荐文章于 2024-08-29 09:59:54 发布

Crush317

最新推荐文章于 2024-08-29 09:59:54 发布

阅读量176

点赞数

分类专栏： Python爬虫文章标签： python 爬虫 http

原文链接：https://docs.python.org/3/howto/urllib2.html#introduction

版权

Python爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Urllib库解析

什么是Urllib

Python内置的HTTP请求库：

urllib.request 请求模块

urllib.error 异常处理模块

urllib.parse url解析模块

urllib.robotparser robots.txt 解析模块

获取URL

最简单的使用 urllib.request 的方法如下：

import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
    html = response.read()
print(html)

以最简单的形式创建一个 Request 对象，该对象指定要获取的 URL。urlopen使用此 Request 对象进行调用会为所请求的 URL 返回一个响应对象。

import urllib.request
req = urllib.request.Request("http://python.org/")
with urllib.request.urlopen(req) as response:
     the_page = response.read()
print(the_page)

数据

Post请求

将数据发送到URL，在 HTML 表单的常见情况下，数据需要以标准方式进行编码，然后作为data 参数传递给 Request 对象。

post请求主要在请求体内以表单的形式发送

import urllib.parse
import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name': 'Michael Foord', 'location': 'Northampton', 'language': 'Python'}
data = urllib.parse.urlencode(values)       #编码
data = data.encode('ascii')
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as response:
    the_page = response.read()
print(the_page)

Get请求

如果不传递data参数，urllib 将使用GET请求。

get请求的数据是放在url中的。

import urllib.request
import urllib.parse

data = {}
data['wd'] = '青岛'
url_values = urllib.parse.urlencode(data)
# 转换格式：wd=%E9%9D%92%E5%B2%9B
print(url_values)
url = 'http://www.baidu.com'
full_url = url + '/s?' + url_values
# 输出 http://www.baidu.com/s?wd=%E9%9D%92%E5%B2%9B
data = urllib.request.urlopen(full_url)
print(data)

headers

浏览器标识自己的方式是通过 User-Agent标头。创建 Request 对象时，您可以传入标头字典。以下示例发出与上述相同的请求，但将自身标识为 Internet Explorer 的版本。

import urllib.parse
import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'
# 用户唯一标识
user_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64)" 
values = {'name': 'Michael Foord', 'location': 'Northampton', 'language': 'Python'}
# 传入请求头
headers = {'User-Agent': user_agent}
data = urllib.parse.urlencode(values)  # 编码
data = data.encode('ascii')
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
    the_page = response.read()
print(the_page)

处理异常

网址错误

通常，由于没有网络连接（没有到指定服务器的路由）或指定的服务器不存在，会引发 URLError。在这种情况下，引发的异常将具有“原因”属性，它是一个包含错误代码和文本错误消息的元组。

req = urllib.request.Request('https://www.pretend_server.org')
try:
    urllib.request.urlopen(req)
except urllib.error.URLError as e:
    print(e.reason)

HTTP错误

来自服务器的每个 HTTP 响应都包含一个数字“状态代码”。有时状态码表示服务器无法满足请求。默认处理程序将为您处理其中一些响应（例如，如果响应是请求客户端从不同 URL 获取文档的“重定向”，则 urllib 将为您处理）。对于那些它无法处理的，urlopen 会引发一个HTTPError. 典型的错误包括“404”（未找到页面）、“403”（请求被禁止）和“401”（需要身份验证）。

import urllib.request
req = urllib.request.Request('http://www.python.org/fish.html')
try:
    urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
    # 响应码，输出404
    print(e.code)
    print(e.read())

合并：

from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request('http://www.python.org/fish.html')
try:
    response = urlopen(req)
except HTTPError as e:
    print('The server couldn\'t fulfill the request.')
    print('Error code: ', e.code)
except URLError as e:
    print('We failed to reach a server.')
    print('Reason: ', e.reason)

Crush317

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫——Urllib库使用

Urllib库解析什么是UrllibPython内置的HTTP请求库：urllib.request 请求模块urllib.error 异常处理模块urllib.parse url解析模块urllib.robotparser robots.txt 解析模块获取URL最简单的使用 urllib.reque
复制链接

扫一扫

专栏目录