Python request库入门

最新推荐文章于 2024-02-01 09:51:56 发布

擎天小祝

最新推荐文章于 2024-02-01 09:51:56 发布

阅读量363

点赞数 1

分类专栏：爬虫技术 python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/zhu_lizhe/article/details/130116238

版权

技术同时被 3 个专栏收录

5 篇文章 2 订阅

订阅专栏

python

5 篇文章 1 订阅

订阅专栏

爬虫

2 篇文章 0 订阅

订阅专栏

一.简介与安装

Requests 是⽤Python语⾔编写，基于urllib，采⽤Apache2 Licensed开源协议的 HTTP 库，该模块主要用来发送 HTTP 请求。

下载建议使用pip，用电脑打开命令行窗口，输入以下命令：

pip install requests

库下载的慢的话可以换源，换源的方法如下：

pip install +库名 -i +源
eg:    pip install requests -i http://mirrors.aliyun.com/pypi/simple/

几个国内源：

阿里云 http://mirrors.aliyun.com/pypi/simple/ 中国科技大学 https://pypi.mirrors.ustc.edu.cn/simple/ 豆瓣(douban) http://pypi.douban.com/simple/ 清华大学 https://pypi.tuna.tsinghua.edu.cn/simple/ 中国科学技术大学 http://pypi.mirrors.ustc.edu.cn/simple/

二.使用

关于获取网站数据，requests库提供了get和post两种方法，post是被设计用来向上放东西的，而get是被设计用来从服务器取东西的，如果你需要的数据在网页源代码中有，使用get；如果你需要的数据在网页源代码中没有，是从网络中获取的，那么选择post，下面以get为例：

print("一次简单的爬虫尝试") #爬取百度信息

import requests  #导入requests库

url="https://www.baidu.com" #要爬取的网站地址

resp=requests.get(url) #用requests获取网站信息

print(resp.text) #打印信息

打印结果：

一次简单的爬虫尝试
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
                </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

我们发现打印出来的内容很复杂看不懂，没有关系，因为这是HTML（超文本语言）,我们之后会简单介绍，现在回到程序自身，

get函数有几个参数：

1.url:即我们请求的网站的网址，是必要参数，填在第一位；

2.headers:可选参数，请求头文件，输入要求字典形式，有时我们在打印网页源代码时，没有报错，但打印不出信息，可能就是User-Agent的问题，这时就需要添加一个带User-Agent的headers文件；

3.params:可选参数，请求参数，输入要求字典形式，如在搜索时设置kw=小狗；

4.verify：可选参数，ssl证书验证，输入是bool类型，有时候在访问网站时会报SSLError,这时候我们就可以通过设置verify参数值为false,在请求是不验证网站的ca证书，设置完以后运行时可能会出现warning，如果看着不顺眼，可以加上

import urllib3
urllib3.disable_warnings()

去掉警告；

5.timeout:可选参数，get再申请访问时，所用时间如果超过timeout设置的值，就会返回timeoutError；

6.proxies:可选参数，有时候因为一些原因，你的IP地址可能会被网站封了，禁止你访问，这时我们就需要用该参数，使用代理IP访问网站；

在用get申请访问时，有时会因为网络不良或连接不好无法请求到网站,遇到这种情况可以在get前加上requests.adapters.DEFAULT_RETRIES = 5

设置如果连接不上重新连接的次数，可以搭配get的timeout参数使用，设置例如30s连接不上重新连接。

使用requests方法后，会返回一个response对象，其存储了服务器响应的内容，在输出resp响应时，如果选择直接输出，输出的是状态码：如果输出<Response [200]>，则说明连接网站成功，如果输出<Response [400]>，则说明没有连接上。

在输出resp时，也可以选择输出resp.text，输出响应的内容，也就是申请访问的网页的源代码，有时我们在输出的时候得到的结果中会有很多看不懂的字符，那是因为windows系统默认的字符集是GBK，而网页源代码所使用的字符集并不一定是GBK，例如在上面的输出中我们发现在第二行有一个 charset=utf-8 ，说明百度网页源代码使用的字符集是utf-8，这时我们只要在输出之前设置resp.encoding="utf-8"即可正确输出源代码内容。

三.第一次反爬

我们用电脑浏览器登录www.baidu.com，鼠标右键查看网页源代码，发现看到的源代码很长很长，和我们之前获得的根本不一样，

百度网页实际源代码部分截取
</script><div id="head"><div id="s_top_wrap" class="s-top-wrap s-isindex-wrap "><div class="s-top-nav"></div><div class="s-center-box"></div></div><div id="u"><a class="toindex" href="/">百度首页</a><a href="javascript:;" name="tj_settingicon" class="pf">设置<i class="c-icon c-icon-triangle-down"></i></a><a href="https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5" name="tj_login" class="lb" οnclick="return false;">登录</a><div class="bdpfmenu"></div></div><div id="s-top-left" class="s-top-left-new s-isindex-wrap"><a href="http://news.baidu.com" target="_blank" class="mnav c-font-normal c-color-t">新闻</a><a href="https://www.hao123.com?

而且仔细观察之前获得的数据，发现一些网页上有的元素其中没有，例如百度热搜，说明我们被反爬了，百度网站没有让我们获取我们想要的信息，其实解决的办法很简单，就是利用之前说过的get函数的可选参数headers,鼠标右击百度网页检查-network，刷新网页，随便点开一个文件点开headers，拉到最下方request headers，我们会找到一个user-agent，我们把它复制下来，设置为headers作为参数发给网站：

headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}

resp=requests.get(url,headers=headers)

这之后再打印网站信息，就会发现和在浏览器中打开的一样啦。