Python网络爬虫与信息提取（北京理工大学慕课）学习笔记1

最新推荐文章于 2022-10-01 10:13:29 发布

陆空生

最新推荐文章于 2022-10-01 10:13:29 发布

阅读量850

点赞数 2

分类专栏：学习笔记文章标签： python

本文链接：https://blog.csdn.net/weixin_43754153/article/details/105599924

版权

学习笔记专栏收录该内容

14 篇文章 2 订阅

订阅专栏

Requests库入门

安装requests库

pip install requests

Requests库的7个主要方法

方法	功能
requests.request()	构造一个请求，支撑以下各方法的基础方法
requests.get()	获取HTML网页的主要方法，对应于HTTP的GET
requests.head()	获取HTML网页头信息的方法，对应于HTTP的HEAD
requests.post()	向HTML网页提交POST请求的方法，对应于HTTP的POST
requests.put()	向HTML网页提交PUT请求的方法，对应于HTTP的PUT
requests.patch()	向HTML网页提交局部修改请求，对应于HTTP的PATCH
requests.delete()	向HTML页面提交删除请求，对应于HTTP的DELETE

requests.get()

r=requests.get(url)
构造一个向服务器请求资源的Request对象，返回一个包含服务器资源的Response对象

get方法的完整版：
requests.get(url,params=None,**kwargs)
url:拟获取页面的url链接
params:url中的额外参数，字典或字节流格式，可选
**kwargs:12个控制访问的参数

get方法源代码：

request方法是基础

Response对象

>>> import requests
>>> r=requests.get("http://www.baidu.com")
>>> print(r.status_code)
200		#状态码为200，则访问成功
>>> type(r)
<class 'requests.models.Response'>
>>> r.headers
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Sat, 18 Apr 2020 08:11:18 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:52 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}

Response对象的属性

属性	说明
r.status_code	HTTP请求的返回状态，200表示链接成功，404表示失败
r.text	HTTP相应内容的字符串形式，即url对应的页面内容
r.encoding	从HTTP header中猜测的相应内容编码方式
r.apparent_encoding	从内容中分析出的相应内容编码方式（备选编码方式）
r.content	HTTP相应内容的二进制形式

>>> import requests
>>> r=requests.get("http://www.baidu.com")
>>> r.status_code
200
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç\x99¾åº¦ä¸\x80ä¸\x8b class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ\x96°é\x97»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å\x9c°å\x9b¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§\x86é¢\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å\x90§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç\x99»å½\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">ç\x99»å½\x95</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ\x9b´å¤\x9aäº§å\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å\x85³äº\x8eç\x99¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>ä½¿ç\x94¨ç\x99¾åº¦å\x89\x8då¿\x85è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æ\x84\x8fè§\x81å\x8f\x8dé¦\x88</a>&nbsp;äº¬ICPè¯\x81030173å\x8f·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
>>> r.encoding="utf-8"
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登录</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

r.encoding：如果header中不存在charset，则认为编码为ISO-8859-1

爬取网页的通用代码框架

异常处理很重要
Requests库的异常：

异常	说明
requests.ConnectionError	网络连接错误异常，如DNS查询失败、拒绝连接等
requests.HTTPError	HTTP错误异常
requests.URLRequired	URL缺失异常
requests.TooManyRedirects	超过大量重定向次数，产生重定向异常
requests.ConnectTimeOut	连接远程服务器超时异常
requests.TimeOut	请求URL超时，产生超时异常

异常	说明
r.raise_for_status()	如果不是200，产生异常requests.HTTPError

通用代码框架：

import requests
def getHTMLText(url):
    try:
        r = requests.get(url,timeout = 30)
        r.raise_for_status() #如果状态不是200，引发HTTPError异常
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "产生异常"
 
 
if __name__ == "__main__":
    url = "http://www.baidu.com"
    print(getHTMLText(url))

HTTP协议

HTTP，hypertext transfer protocol，超文本传输协议
HTTP是一个基于“请求与响应”模式的、无状态的应用层协议
用户发起请求，服务器做出响应。
HTTP协议采用URL作为定位网络资源的标识
URL格式 ： http://host[:port][path]
host:合法的Internet主机域名或IP地址
port:端口号，缺省端口为80
path：请求资源的路径
URL是通过HTTP协议存取资源的Internet路径，一个URL对应一个数据资源
HTTP协议对资源的操作

方法	说明
GET	请求获取URL位置的资源
HEAD	请求获取URL位置资源的响应消息报告，即获得该资源的头部信息
POST	请求向URL位置的资源后附加新的数据
PUT	请求向URL位置存储一个资源，覆盖原URL位置的资源
PATCH	请求局部更新URL位置的资源，即改变该处资源的部分内容
DELETE	请求删除URL位置存储的资源

网络通道和服务器都是黑盒子，所能看到的只是URL链接以及对URL的操作
PATCH和PUT相比较，PATCH可以节省网络带宽（只更改局部）
在这里插入图片描述
向URL POST一个字典，自动编码为form（表单）

向URL POST一个字符串，自动编码为data

Requests库主要方法解析

request（）

request()方法是基础
在这里插入图片描述

json格式在HTTP，HTML相关的web开发中最常使用

若在timeout规定时间内没有返回，则抛出一个timeout异常

隐藏用户爬取网页时的原ip地址，防止对爬虫的逆追踪

requests.get(url,params=None,**kwargs)

url:拟获取页面的url链接
params:url中的额外参数，字典或字节流格式，可选
**kwargs:12个控制访问的参数

requests.head(url,**kwargs)

url:拟获取页面的url链接
**kwargs:13个控制访问的参数

requests.post(url,data=None,json=None,**kwargs)

url:拟更新页面的url链接
data:字典、字节序列或文件，Request的内容
json:JSON格式的数据，Request的内容
**kwargs:11个控制访问的参数

requests.put(url,data=None,**kwargs)

url:拟更新页面的url链接
data:字典、字节序列或文件，Request的内容
**kwargs:12个控制访问的参数

requests.patch(url,data=None,**kwargs)

url:拟更新页面的url链接
data:字典、字节序列或文件，Request的内容
**kwargs:12个控制访问的参数

requests.delete(url,**kwargs)

url:拟删除页面的url链接
**kwargs:13个控制访问的参数

最常用的就是get()方法，其次是head()方法
若url资源非常大，就使用head方法获取头部信息概要

网络爬虫的“盗亦有道”

网络爬虫的尺寸

1.小规模，数据量小，爬取速度不敏感，使用Requests库（爬取网页）
2.中规模，数据规模较大，爬取速度敏感，使用Scrapy库（爬取网站）
3.大规模，搜索引擎，爬取速度关键，定制开发（爬取全网）

网络爬虫会为web服务器带来巨大的资源开销

网络爬虫的限制

来源审查：判断User-Agent进行限制（检查来访HTTP协议头的User-Agent域，只响应浏览器或友好爬虫的访问）
发布公告：Robots协议（告知所有爬虫网站的爬取策略，要求爬虫遵守）

Robots协议

robots exclusion standard 网络爬虫排除标准
作用：网站告知网络爬虫哪些页面可以抓取，哪些不行
形式：在网站根目录下的robots.txt文件

Robots协议基本语法：

#注释，*代表所有，/代表根目录
User-agent: *
Disallow: /

若一个网站无Robots协议，则所有爬虫都可以爬取

Robots协议的遵守

网络爬虫：自动或人工识别robots.txt，再进行内容爬取
约束性：Robots协议是建议但非约束性，网络爬虫可以不遵守但存在法律风险
类人行为可不参考Robots协议

Requests库网络爬虫实战

京东商品页面爬取

课程全代码：

import requests
url="http://item.jd.com/2967929.html"
try:
	r=requests.get(url)
	r.raise_for_status()
	r.encoding=r.apparent_encoding
	print(r.text[:1000])
except:
	print("爬取失败")

但其实自己运行跑的时候发现获取到的信息是京东的用户登录界面。。。

亚马逊商品爬取

爬取时获取意外错误（API造成）
在r.request.headers中，User-Agent值为’python-requests/2.22.0’
更改User-Agent 使爬虫程序模拟浏览器访问

kv = {'user-agent':'Mozilla/5.0'}
url = "https://www.amazon.cn/gp/product/B01M8L5Z3Y"
r = requests.get(url,headers=kv)
print(r.status_code) #变成200了

全代码：

import requests
url = "https://www.amazon.cn/gp/product/B01M8L5Z3Y"
try:
    kv = {'user-agent':'Mozilla/5.0'}
    r = requests.get(url,headers=kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[1000:2000])
except:
    print("爬取失败")

跑的时候发现还是会有api的问题（可能亚马逊的审查机制又更新了吧）

百度/360搜索关键词提交

搜索引擎关键词提交接口：

百度的关键词接口：
http://www.baidu.com/s?wd=keyword
360的关键词接口：
http://www.so.com/s?q=keyword

>>> import requests
>>> kv={'wd':'Python'}
>>> r=requests.get("http://www.baidu.com/s",params=kv)
>>> r.status_code
200
>>> r.request.url
'https://wappass.baidu.com/static/captcha/tuxing.html?&ak=c27bbc89afca0463650ac9bde68ebe06&backurl=https%3A%2F%2Fwww.baidu.com%2Fs%3Fwd%3DPython&logid=8627790867845123754&signature=8c6cf7300c8bb315f6f358e46dbd00a5&timestamp=1587210680'
>>> len(r.text)
1519

全代码：

import requests
keyword = "Python"
try:
    kv = {'wd': keyword}
    r = requests.get("http://www.baidu.com/s", params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print("爬取失败")

网络图片的爬取和存储

网络上图片链接的格式

http://www.example.com/picture.jpg

>>> import requests
>>> path="D:/abc.jpg"
>>> url="http://image.nationalgeographic.com.cn/2017/0211/20170211061910157.jpg"
>>> r= requests.get(url)
>>> r.status_code
200
>>> with open(path,'wb')as f:
...    f.write(r.content)#将图片以二进制编码信息写入文件中
...
228206
>>> f.close()

全代码：

import requests
import os
url = "http://image.nationalgeographic.com.cn/2017/0211/20170211061910157.jpg"
root = "D://pics//"
path = root+url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path, 'wb') as f:
            f.write(r.content)
            f.close()
            print("文件已保存成功")
    else:
        print("文件已存在")
except:
    print("爬取失败")

IP地址归属地的自动查询

在IP138网站上查询
接口形式为：

http://m.ip138.com/ip=ipaddress

陆空生

关注

2
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
Python网络爬虫与信息提取（北京理工大学慕课）学习笔记1

Requests库入门安装requests库pip install requestsRequests库的7个主要方法方法功能requests.request()构造一个请求，支撑以下各方法的基础方法requests.get()获取HTML网页的主要方法，对应于HTTP的GETrequests.head()获取HTML网页头信息的方法，对应于...
复制链接

扫一扫