什么是Urllib库？

最新推荐文章于 2022-07-15 09:52:20 发布

bao_120973681

最新推荐文章于 2022-07-15 09:52:20 发布

阅读量6k

点赞数 3

分类专栏： Python爬虫

本文链接：https://blog.csdn.net/bao_120973681/article/details/84000840

版权

Python爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Urllib是Python提供的一个用于操作URL的模块。在Python2.X中，有Urllib库，也有Urllib2库，但是在最新的Python3.X中，将Urllib2合并到了Urllib中，这个库在我们爬取网页的时候会经常用到。

升级合并后，模块中的包的位置变化的地方较多，以下是一些常见的变动：

在Python2.X中使用import urllib2，在Python3.X中会用到import urllib.request,urllib.error

在Python2.X中使用import urllib，在Python3.X中会用到import urllib.request,urllib.error,urllib.parse

在Python2.X中使用import urlparse，在Python3.X中会用到import urllib.parse

在Python2.X中使用import urlopen，在Python3.X中会用到import urllib.request.urlopen

在Python2.X中使用import urlencode，在Python3.X中会用到import urllib.parse.urlencode

在Python2.X中使用import urllib.quote，在Python3.X中会用到import urllib.request.quote

在Python2.X中使用import cookielib.CookieJar，在Python3.X中会用到import http.CookeiJar

在Python2.X中使用import urllib2.Requset，在Python3.X中会用到import urllib.request.Request

其中，这里主要介绍Python3.X中各个模块的作用

Python3.6.0中urllib模块包括一下四个子模块：

urllib is a package that collects several modules for working with URLs:

urllib.request for opening and reading URLs
urllib.error containing the exceptions raised by urllib.request
urllib.parse for parsing URLs
urllib.robotparser for parsing robots.txt files
官方解释是如下：
urllib模块是一个运用于URL的包

urllib.request用于访问和读取URLS

urllib.error包括了所有urllib.request导致的异常

urllib.parse用于解析URLS

urllib.robotparser用于解析robots.txt文件（网络蜘蛛）

urllib.request.urlopen该函数主要用于打开一个URL网页：

urlopen一般常用的有三个参数，它的参数如下：
urllib.requeset.urlopen(url,data,timeout)
# -*- coding:utf-8 -*-
#获取并打印google首页的html
import urllib.request
response=urllib.request.urlopen('http://www.google.com')
html=response.read()
print(html)
urlopen返回一个类文件对象,可以像文件一样操作,同时支持一下三个方法:

info()：返回一个对象，表示远程服务器返回的头信息。
getcode()：返回Http状态码，如果是http请求，200表示请求成功完成;404表示网址未找到。
geturl()：返回请求的url地址

URL解析

urlparse
The URL parsing functions focus on splitting a URL string into its components, or on combining URL components into a URL string.

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

函数用于将一个URL解析成六个部分，返回一个元组，URL的格式为：scheme://netloc/path;parameters?query#fragment；包含六个部分，元组中每一个元素都是一个字符串，可以为空，这六个部分均不能再被分割成更小的部分；

以下为返回的元组元素：

元素	编号	值	值不存在时默认值
scheme	0	请求	一定存在
netloc	1	网址	空字符串
path	2	分层路径	空字符串
params	3	参数	空字符串
query	4	查询组件	空字符串
fragment	5	标识符	空字符串
username	6	用户名	None
password	7	密码	None
hostname	8	主机名	None
port	9	端口号	None

import urllib.parse

print(urllib.parse.urlparse("https://www.zhihu.com/question/50056807/answer/223566912"))
ParseResult(scheme='https', netloc='www.zhihu.com', path='/question/50056807/answer/223566912', params='', query='', fragment='')

bao_120973681

关注

3
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
什么是Urllib库？

Urllib是Python提供的一个用于操作URL的模块。在Python2.X中，有Urllib库，也有Urllib2库，但是在最新的Python3.X中，将Urllib2合并到了Urllib中，这个库在我们爬取网页的时候会经常用到。升级合并后，模块中的包的位置变化的地方较多，以下是一些常见的变动：在Python2.X中使用import urllib2，在Python3.X中会用到imp...
复制链接

扫一扫

专栏目录