Python网络爬虫与信息提取

最新推荐文章于 2022-09-08 11:32:43 发布

愿热爱常在

最新推荐文章于 2022-09-08 11:32:43 发布

阅读量257

点赞数

文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/qq_39427413/article/details/99191373

版权

常用Python IDE工具

文本工具类IDE

IDLE
Notepad++
Sublime Text
Vim & Emacs
Atom
Komodo Edit

集成工具类IDE

PyCharm
Wing
PyDev & Eclipse
Visual Studio
Anaconda & Spyder
Canopy

>>> import requests
>>> r = requests.get("http://www.baidu.com")
>>> r.status_code
200 //显示访问成功
>>> r.encoding ='utf-8'
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登录</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
>>>

requests库的7个主要方法

requests.request() 构造一个请求，支撑一下各方法的基础方法
requests.get() 获取HTML网页的主要方法，对应HTTP的GET
requests.head() 获取HTML网页头信息的方法，对应HTTP的HEAD
- requests.post() 向HTML网页提交POST请求的方法，对应于HTTP的POST
requests.put() 向HTML网页提交POST请求的方法，对应于HTTP的PUT
requests.patch() 向HTML网页提交局部修改请求，对应于HTTP的PATCH
requests.delete() 向HTML网页提交删除请求，对应于HTTP的DELETE

Requests库的get()方法

r=requests.get(url) 爬取网页信息
requests.get(url,params=None,**kwargs)

url: 获取页面的url链接
params: url中的额外参数，字典或字节流格式，可选
**kwargs: 12个控制访问的参数

Response对象的属性

r.status_code HTTP请求的返回状态, 200表示连接成功,404表示失败
r.text HTTP响应内容的字符串形势，即url对应的页面内容
r.encoding 从HTTP header中猜测的响应内容编码方式
r.apparent_encoding 从内容中分析的响应内容编码方式
r.content HTTP响应内容的二进制形式

r.encoding: 如果header中不存在charset，则认为编码为ISO-8859-1
r.apparent_encoding: 根据网页内容分析出的编码方式

理解Requests库的异常

requests.ConnectionError
网络连接错误异常，如DNS查询失败、拒绝连接等
requests.HTTPError
HTTP错误异常
requests.URLRequired
URL缺失异常
requests.TooManyRedirects
超过最大重定向次数，产生重定向异常
requests.ConnectTimeout
连接远程服务器超时异常
requests.Timeout
请求URL超时，产生超时异常

Response对象提供方法raise_for_status() ：
r.raise_for_status() 如果不是200，产生异常requests.HTTPError

# 爬取网页通用代码框架
import requests

def getHTMLText(url):
    try:
       r=requests.get(url,timeout=30)
       r.raise_for_status()
       r.encoding=r.apparent_encoding
       return r.text
    except:
        return "HTTPError"


if __name__=="__main__":
    url="http://www.baidu.com"
    print(getHTMLText(url))

HTTP协议及Requests库方法

HTTP协议：超文本协议

HTTP是一个基于“请求与响应”模式的、无状态的应用层协议
HTTP协议采用URL作为定位网络资源的标识

URL格式 http://host[:port][path]
host: 合法的Internet主机域名或IP地址
port:端口号，缺省端口号为80
path:请求资源的路径

URL是通过HTTP协议存取资源的Internet路径，一个URL对应一个数据资源

HTTP协议对资源的操作

GET 请求获取URL位置的资源
HEAD 请求获取URL位置资源的响应消息报告，即获取该资源的头部信息
POST 请求向URL位置的资源后附加新的数据
PUT请求向URL位置存储一个资源，覆盖原URL位置的资源
PATCH 请求局部更新URL位置的资源，即改变该处资源的部分内容
DELETE 请求删除URL位置存储的资源

Requests库主要方法解析

requests.get(method, url, **kwargs)

method: 请求方式，对应get/put/post等7种
url: 拟获取页面的url链接
**kwargs: 控制访问的参数，共13个

method: 请求方式

r=requests.request(‘GET’, url, **kwargs)
r=requests.request(‘HEAD’, url, **kwargs)
r=requests.request(‘POST’, url, **kwargs)
r=requests.request(‘PUT’, url, **kwargs)
r=requests.request(‘PATCH’, url, **kwargs)
r=requests.request(‘delete’, url, **kwargs)
r=requests.request(‘OPTIONS’, url, **kwargs)

**kwargs: 控制访问的参数，均为可选项

params: 字典或字节序列，作为参数增加到url中

>>>import requests
>>> kv={'key1':'value1','key2':'value2'}
>>>r=requests.request('GET','http://python123.io/ws',params=kv)
>>>> print(r.url)
https://python123.io/ws?key1=value1&key2=value2
#使用params参数，访问资源时把一些键值对增加到url中

data : 字典、字节序列或文件对象，作为Request的内容
json : JSON格式的数据，作为Request的内容
headers : 字典，HTTP定制头
cookies : 字典或CookieJar，Request中的auth : 元组支持HTTP认证功能
files : 字典类型，传输文件
timeout : 设定超时时间，秒为单位
proxies : 字典类型，设定访问代理服务器，可以增加登录认证
allow_redirects : True/False，默认为True，重定向开关
stream : True/False，默认为True，获取内容立即下载开关
verify : True/False，默认为True，认证SSL证书开关
cert : 本地SSL证书
auth : 元组，支持HTTP认证功能

网络爬虫的尺寸

爬取网页玩转网页：小规模, 数据量小,爬取速度不敏感Requests库
爬取网站爬取系列网站：中规模，数据规模较大，爬取速度敏感
Scrapy库
爬取全网：大规模，搜索引擎爬取速度关键，定制开发

网络爬虫的限制

来源审查：判断User-Agent进行限制

检查来访HTTP协议头的User-Agent域，只响应浏览器或友好爬虫的访问

发布公告：Robots协议

Robots协议

作用：网站告知网络爬虫哪些页面可以抓取，哪些不行
形式：在网站根目录下的robots.txt文件

Robots协议的使用

网络爬虫：自动或人工识别robots.txt, 再进行内容爬取
约束性：Robots协议是建议但非约束，可以不遵守，但存在法律风险

类人行为可以不遵守Robots协议

网页爬取失败时，可以采用伪装头部信息的方法：

import requests
url="https://www.amazon.cn/gp/product/B01M8L5Z3Y"
#以访问亚马逊商品为例
try:
    kv=	{'User-Agent':'Mozilla/5.0'}
     #Mozilla/5.0常规网站协议头
    r=requests.get(url,headers=kv)
    r.raise_for_status()
    r.encoding=r.apparent_encoding
    print(r.text[1000:2000])
except:
    print("爬取失败")

搜索引擎关键词提交接口
百度的关键词接口：
http://www.baidu.com/s?wd=keyword
360的关键词接口：
http://www.so.com/s?q=keyword

#百度搜索全代码
import requests
keyword="Python"
try:
    kv={'wd':keyword}
    r=requests.get("http://www.baidu.com/s",params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print("爬取失败")

网络图片的爬取和存储

import requests
import os
url="https://www.runoob.com/wp-content/uploads/2014/01/jsp_life_cycle.jpg"
root="/Users/wangli/Pictures/"
path=root+url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r=requests.get(url)
        with open(path,'wb') as f:
            f.write(r.content)
            f.close()
            print("success")
    else:
        print("the file is existed")
except:
    print("failed")

Beautiful Soup库

#引入BeautifulSoup库：
from bs4 import BeautifulSoup
soup=BeautifulSoup('<p>data</p>',"html.parser")

BeautifulSoup库是解析、遍历、维护“标签树”的功能库

Beautiful Soup库解析器：

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(mk,“html.parser”)	安装bs4库
lxml的HTML解析器	BeautifulSoup(mk,“lxml”)	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,“xml”)	pip install lxml
html5lib的解析器	BeautifulSoup(mk,“html5lib”)	pip install html5lib

Beautiful Soup类的基本元素：

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾
Name	标签的名字，<p>…</p>的名字是’p’，格式：<tag>.name
Attributes	标签的属性，字典形式组织，格式：<tag>.attrs
NavigableString	标签内非属性字符串，<p>…</p>中字符串，格式：<tag>.string
Comment	标签内字符串的注释部分，一种特殊的Comment类型

import requests
r = requests.get("https://python123.io/ws/demo.html")
r.text
demo = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo , "html.parser")
print(soup.prettify())


>>> soup.title
<title>This is a python demo page</title>
>>> soup.a.name
'a'
>>> soup.a.parent.name
'p'
>>> soup.a.parent.parent.name
'body'
>>> 
>>> 
>>> tag=soup.a
>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

>>> tag.attrs['class']
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)
<class 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>
>>> 
>>> 
>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.string
'Basic Python'
>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.p.string
'The demo python introduces several python courses.'

基于bs4库的HTML内容遍历

下行遍历
上行遍历
平行遍历

标签树的下行遍历：

属性	说明
.contents	子节点的列表，将<tag>所有儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

标签树的上行遍历：

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

标签树的平行遍历：

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

信息标记的三种形式：