Requests库入门---Python网络爬虫与信息提取1（北理工mooc）

最新推荐文章于 2024-04-23 17:28:25 发布

白金燐燐

最新推荐文章于 2024-04-23 17:28:25 发布

阅读量494

点赞数

分类专栏： Python爬虫笔记文章标签： python

本文链接：https://blog.csdn.net/jjrm123/article/details/107444634

版权

Python爬虫笔记专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Requests库入门

在cmd中输入以下命令安装：

pip install requests

Requests库主要方法：
在这里插入图片描述

get()方法

最简单的一个获取网页的方法：

r = requests.get(url, params=None, **kwargs)

其中r是requests.get(url) 返回的包含爬虫返回的全部内容的 Response（区分大小写）对象，get()方法构造了一个向服务器请求资源的Request对象。
get()方法中的参数url为网页链接，params是url的额外参数，字典或字节流格式，**kwargs为其它12个可选参数
Response对象的属性：
在这里插入图片描述
使用get()方法获取网上资源时的基本流程：先用r.status_code检测状态，若为200，则可用其他方法获取信息，否则说明本次访问出现错误或异常。
获取百度首页的示例：

>>> import requests
>>> r = requests.get("http://www.baidu.com")
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç\x99¾åº¦ä¸\x80ä¸\x8b class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ\x96°é\x97»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å\x9c°å\x9b¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§\x86é¢\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å\x90§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç\x99»å½\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">ç\x99»å½\x95</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ\x9b´å¤\x9aäº§å\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å\x85³äº\x8eç\x99¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>ä½¿ç\x94¨ç\x99¾åº¦å\x89\x8då¿\x85è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æ\x84\x8fè§\x81å\x8f\x8dé¦\x88</a>&nbsp;äº¬ICPè¯\x81030173å\x8f·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
>>> r.encoding  #发现有乱码，查询其编码方式
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
>>> r.encoding = "utf-8"  #更换编码方式
>>> r.text  #乱码消失
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登录</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

r.encoding属性是从网页的header中的charset字段获取的，若不存在charset字段，则编码默认为ISO-8859-1（该编码不能解析中文）。
r.apparent_encoding是从内容部分分析可能出现的编码方式。

爬取网页通用代码框架

由于get()方法不一定能成功，因此爬取网页过程中的异常处理比较重要。
Requests库的异常：
在这里插入图片描述

ConnectTimeout指连接服务器出现的超时，而Timeout是指从连接到接收整个过程的超时。
通用代码框架：

import requests


def getGTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()  #若状态不是200，引发HTTPError异常
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "产生异常"


if _name_--"_main_":
    url = "http://www.baidu.com"
    print(getHTMLText(url))

HTTP协议

HTTP，Hypertext Transfer Protocol，超文本传输协议。
HTTP是一个基于“请求与相应”模式的、无状态的应用层协议。无状态指不同的连接不相关联，应用层协议指该协议工作在TCP协议之上。
采用URL作为定位网络资源的标识，
URL格式：http://host[:port][path]

host：合法的Internet主机域名或IP地址
port：可省略。端口号，缺省（默认）端口为80
path：请求资源的路径
HTTP协议对资源的操作：

requests.request()

requests.request(method, url, **kwargs)
**kwargs为其他的13个可选参数，其中method为请求方式，有以下七种（功能和get()等方法一致）：
在这里插入图片描述
OPTIONS一般用于获取服务器和客服端打交道的一些参数，较少使用。
13个可选参数：

params：字典或字节序列，作为参数添加到url中
data：字典、字节序列或文件对象，作为Request的内容，向服务器提交资源时使用
json：JSON格式的数据，作为Request的内容
headers：字典，HTTP定制头
cookies：字典或CookieJAR，Request中的cookie
auth：元组，支持HTTP认证功能
file：字典类型，传输文件
timeout：设定超时时间，单位为秒
proxies：字典类型，设定访问代理服务器，可以增加登录认证
allow_redirects：True/False，默认为True，重定向开关
stream：True/False，默认为True，获取内容立即下载开关
verify：True/False，默认为True，认证SSL证书开关
cert：本地SSL证书路径

------------params参数
>>> kv = {"key1":"value1", "key2":"value2"}
>>> r = requests.request("GET", "http://python123.io/ws", params=kv)
>>> r.url
'https://python123.io/ws?key1=value1&key2=value2'
>>> print(r.url)
https://python123.io/ws?key1=value1&key2=value2


------------data参数
>>> kv = {"key1":"value1", "key2":"value2"}
>>> r = requests.request("POST", "http://python123.io/ws", data=kv)
>>> body = "主题内容"
>>> r = requests.request("POST", "http://python123.io/ws", data=body.encode('utf-8'))


------------json参数
>>> kv = {"key1":"value1"}
>>> r = requests.request("POST", "http://python123.io/ws", json=kv)


------------headers参数
>>> hd = {'user-agent':'Chrome/10'}
>>> r = requests.request("POST", "http://python123.io/ws", headers=hd)
#修改user-agent字段，模拟10代Chrome进行访问


------------files参数
>>> fs = {file:open('data.xls', 'rb')}
>>> r = requests.request('POST', 'http://python123.io/ws', files=fs)


------------proxies参数
>>> pxs = {'http':'http://user:pass@10.10.10.1:1234','https':'https://10.10.10.1:4321'}
>>> r = request.request('GET', 'http://www.baidu.com', proxies=pxs)
#访问时使用的是pxs中的代理地址，可以隐藏爬取过程中源的地址，防止对爬虫逆追踪

由于安全问题，大部分的服务器不支持资源的上传，在爬虫的使用中，一般只会用到get()方法和head()方法（针对很大的url链接）

网络爬虫的应用

网络爬虫的尺寸：
在这里插入图片描述

网络爬虫带来的问题

网络爬虫带来的骚扰：一般Web服务器为人类的访问提供服务，网络爬虫相当于机器访问服务器，受限于爬虫编写者的能力和目的，网络爬虫有时会给服务器带来巨大的负担。

网络爬虫的法律风险：服务器上的数据有产权归属，网络爬虫获取数据后牟利会带来法律风险。

网络爬虫对隐私的泄露：网络爬虫具有突破简单访问控制的能力，获得被保护数据从而泄露个人隐私。

网络爬虫的限制

来源审查：检查来访HTTP协议头的User-Agent域，只响应浏览器或友好爬虫的访问。利用前面提到的修改user-agent的方法可以通过审查。
发布公告：Robots协议，告知所有爬虫网站的爬取策略，要求爬虫遵守。

Robots协议：网络爬虫排除标准，一般在网站根目录下的robots.txt文件
https://www.jd.com/robots.txt，以下是京东的协议(被我标注过)：

# 注释，*代表所有，/代表根目录

User-agent: *    #对所有的爬虫
Disallow: /?*      #不允许访问以？开头的路径
Disallow: /pop/*.html 
Disallow: /pinpai/*.html?* 
User-agent: EtaoSpider      #对于EtaoSpider
Disallow: /      #不允许爬取任何资源，这两句指该爬虫被认为是恶意爬虫
User-agent: HuihuiSpider 
Disallow: / 
User-agent: GwdangSpider 
Disallow: / 
User-agent: WochachaSpider 
Disallow: /

原则上，类人行为（访问频率低，如一小时一次，每次访问内容不大）可以不遵守该协议。

搜索引擎关键词提交接口

百度：
http://www.baidu.com/s?wd=keyword
360:
http://www.so.com/s?q=keyword
替换keyword即可提交关键字
查看请求给百度url的方法：r.request.url

>>> import requests
>>> kv = {'wd':'Python'}
>>> hd = {'user-agent':'Chrome/10'}  #不设置代理会被百度导入验证界面
>>> r = requests.get('http://www.baidu.com/s', headers=hd, params=kv)
>>> len(r.text)
558604
>>> r.request.url  #查看请求给百度的url
'http://www.baidu.com/s?wd=Python'
>>> r.encoding
'utf-8'
>>> r.text[:1000]
'<!DOCTYPE html>\n<!--STATUS OK-->\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\t\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\t\n\n\n<html>\n\t<head>\n\t\t\n\t\t<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n\t\t<meta http-equiv="content-type" content="text/html;charset=utf-8">\n\t\t<meta content="always" name="referrer">\n        <meta name="theme-color" content="#2932e1">\n        <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />\n        <link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu_85beaf5496f291521eb75ba38eacbd87.svg">\n        <link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="百度搜索" />\n\t\t\n\t\t\n<title>Python_百度搜索</title>\n\n\t\t\n\n\t\t\n<style data-for="result" type="text/css" id="css_newi_result">body{color:#333;background:#fff;padding:6px 0 0;margin:0;position:relative}\nbody,th,td,.p1,.p2{font-family:arial}\np,form,ol,ul,li,dl,dt,dd,h3{margin:0;padding:0;list-'
>>>

网络图片的爬取

网络图片链接格式：http://www.example.com/picture.jpg
示例：

>>> import requests
>>> url = "https://www.nationalgeographic.com/content/dam/archaeologyandhistory/rights-exempt/history-magazine/2019/07-08/thamugadi/04-trajan-immersive.adapt.1900.1.jpg"
>>> path = "F:/"
>>> root = "F:/"
>>> path = root + url.split("/")[-1]
>>> r = requests.get('https://www.nationalgeographic.com/content/dam/archaeologyandhistory/rights-exempt/history-magazine/2019/07-08/thamugadi/04-trajan-immersive.adapt.1900.1.jpg')
>>> with open(path, 'wb') as f:
...     f.write(r.content)
...
718199
#r.content表示返回内容的二进制形式，保存图片需要用二进制形式
>>> f.close()

图片爬取代码模板：
在这里插入图片描述

IP地址归属地查询

查询网站：https://www.ip138.com/
和前面提交给搜索引擎关键词一样，可以提交给网站ip地址获取结果。

白金燐燐

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Requests库入门---Python网络爬虫与信息提取1（北理工mooc）

Requests库入门在cmd中输入以下命令安装：pip install requestsRequests库主要方法：get()方法最简单的一个获取网页的方法：r = requests.get(url, params=None, **kwargs)其中r是requests.get(url) 返回的包含爬虫返回的全部内容的 Response（区分大小写）对象，get()方法构造了一个向服务器请求资源的Request对象。get()方法中的参数url为网页链接，params是url的额外参数，
复制链接

扫一扫

专栏目录