学习笔记之urllib库

最新推荐文章于 2024-08-25 23:24:39 发布

岳野

最新推荐文章于 2024-08-25 23:24:39 发布

阅读量257

点赞数

分类专栏：学习笔记文章标签： pyhton urllib 爬虫

本文链接：https://blog.csdn.net/weixin_37938228/article/details/88386481

版权

学习笔记专栏收录该内容

26 篇文章 41 订阅

订阅专栏

urllib是Python内置的HTTP请求库，包含四个模块：
urllib.request（请求模块），urllib.error（异常处理模块），urllib.parse（url解析模块，如拆分合并等），urllib.robotparser（robot.txt解析模块）
一.request模块
1.urlopen方法
urllib.request.urlopen(url,data=none,[timeout]*,cafile=none,capath=none,cadefault=False.context=none)
参数：url：需要打开的网址
data:post提交的数据
timeout：设置网站的访问超时时间
urlopen返回对象提供的方法：
read();readline();readlines();fileno();close();这些方法用于对HTTPResponse类型数据进行操作
info()用于返回HTTPMessage对象，表示远程服务器返回的头信息
getcode()用于返回HTTP状态码，如果是http请求，200表示请求成功完成，404表示网址未找到。
geturl()用于返回请求的url

request模块可以非常方便地抓取URL内容，当data参数为空的时候也就是发送一个GET请求到指定页面，然后返回HTTP响应。
在这里插入图片描述
urlopen()的data参数默认为none,当参数不为空的时候，urlopen()提交方式为Post

2.使用Request包装请求
urllib.request.Request(url,data=none,headers={},method=none)

【注】用来包装头部的数据
User-Agent:这个头部可以携带如下几条信息：浏览器名和版本号，操作系统名和版本号，默认语言。
Referer:可以用来防止盗链
connection:表示连接状态，记录Session状态。
3.代理IP
在爬取网页的过程中，经常会出现前不久可以爬取的网页现在无法爬取了，这是因为您的IP被爬取网站的服务器屏蔽了。此时代理服务可以为您解决这一麻烦。
在这里插入图片描述
4.Cookie
cookie在客户端保存，是用来记录客户身份的文本文件，用来维持登录状态。
代码：
运行结果：

二.error模块（异常处理）
最好先捕捉HHTPError再捕捉其他异常。

三.parse模块（url解析）
1.urlparse:将url分割成几个部分再依次将其复制
parse.urlparse(urlstring,scheme=’’,allow_fragements=True)
urlstring:要分割的地址
scheme:协议类型
allow_fragements:#后面的内容是否分割
代码：
在这里插入图片描述
运行结果：

通过对比运行结果理解各个参数的作用。
2.urlunparse(urlparse的反函数)
代码：

运行结果：

3urljion(拼接)
.无论是正常连接或是随便打的字符串都可以拼接，如果同时出现完整链接‘http’或’https’，不会产生拼接，而会输出后者的链接。
代码：
在这里插入图片描述
运行结果：

4.urlencode(字典对象转化为get请求参数)
parse.urlencode()方法的作用就是将字典里的所有的键值转化为query-string格式并将中文转码。
代码：

运行结果：
四.robotparse模块（robot.txt解析模块）
【注】：
1.什么是robot.txt?
robot.txt是纯文本文件，网络管理者可以在其中声明该网站中不想被robots访问的部分或者指定搜索引擎只收录指定内容，当一个搜索机器人访问一个站点时，会首先检查该站点根目录下是否存在robot.txt，如果存在，搜索机器人就会按照该文件中的内容来确定访问的范围，如果该文件不存在，那么搜索机器人就沿着链接抓取。
2.robot.txt写作语法
#Robot.txt file from http://www.seovip.cn
#All robots will spider the domain
User-ageng:*
Disallow:
以上表达的意思为允许所有的搜索机器人访问www.seo.vip.cn下的所有文件。
#后的文字为说明信息
User-agent:后面为搜索机器人的名称
Disaloow:后面为不允许访问的文件目录
3.关于 Robots META
Robots META主要针对具体的页面，在HTML文件中位于头部信息中，用来告诉搜索引擎ROBOTS如何抓取该页面的内容。它的写法：name="Robots"表示所有的搜索引擎，也可以针对某个具体的搜索引擎，如name=“BaiduSpider”；content部分有四个指令选项，index/noindex,follow/nofollow,INDEX指令告诉搜索机器人抓取该页面，Follow指令表示搜索机器人可以沿着该页面上的链接继续抓取下去。（缺省值为INDEX和FOLLOW，inktomi例外，他的为INDEX和Nofollow）

urllib.robotparser.RobotFileParser()返回值提供的一些方法：

1.set_url(url)
Sets the URL referring to a robots.txt file.
2.read()
Reads the robots.txt URL and feeds it to the parser.
3. parse(lines)
Parses the lines argument.
4.can_fetch(useragent, url)
Returns True if the useragent is allowed to fetch the url according to the rules contained in the parsed robots.txt file.
5.mtime()
Returns the time the robots.txt file was last fetched. This is useful for long-running web spiders that need to check for new robots.txt files periodically.
6. modified()
Sets the time the robots.txt file was last fetched to the current time.Sets the time the robots.txt file was last fetched to the current time.
7.crawl_delay(useragent)
Returns the value of the Crawl-delay parameter from robots.txt for the useragent in question. If there is no such parameter or it doesn’t apply to the useragent specified or the robots.txt entry for this parameter has invalid syntax, return None.
8.request_rate(useragent)
Returns the contents of the Request-rate parameter from robots.txt as a named tuple RequestRate(requests, seconds). If there is no such parameter or it doesn’t apply to the useragent specified or the robots.txt entry for this parameter has invalid syntax, return None

代码：
在这里插入图片描述
运行结果：