Python网络爬虫【1】-- Request库、 Robos协议、BeautifulSoup库、简单爬虫项目

最新推荐文章于 2024-09-19 14:31:54 发布

David Wolfowitz

最新推荐文章于 2024-09-19 14:31:54 发布

阅读量435

点赞数 1

分类专栏： Python爬虫

本文链接：https://blog.csdn.net/weixin_43763859/article/details/107031047

版权

Python爬虫专栏收录该内容

2 篇文章

订阅专栏

单元一：Requests库入门-HTTP协议及Requests库方法（SHD）

1、Requests库的7个主要方法

方法	说明
requests.request()	构造一个请求，支撑一下各个方法的基础方法
requests.get()	获取HTML网页的主要方法，对应于HTTP的GET
requests.head()	获取HTML网页头信息的方法，对应于HTTP的HEAD
requests.post()	向HTML网页提交POST请求的方法，对应于HTTP的POST
requests.put()	向HTML网页提交PUT请求的方法，对应于HTTP的PUT
requests.path()	向HTML网页提交局部修改请求，对应HTTP的PATCH
requests.delete()	向HTML网页提交删除请求，对应HTTP的DELETE

2、HTTP协议

1.HTTP，Hypertext Transfer Protocol，超文本传输协议。

2.HTTP是一个基于“请求与响应”模式的、无状态的应用层协议。

简单说，用户发起请求，服务器做相关相应，这就是请求与响应的模式，无状态指的是第一次请求与第二次请求之间并没有关联性，应用层协议指的是该协议工作在HTTP协议之上。

3.HTTP协议采用URL作为定位网络资源的标识。

URL格式： http://host[:port][path]

+ host: 合法的Internet主机域名或者IP地址
+ port: 端口号，缺省端口为80
+ path: 请求资源的路径（资源在主机或者IP地址服务器上所包含的内部路径）

HTTP URL的理解：URL是通过HTTP协议存储资源的Internet路径，一个URL对应一个数据资源。

4.HTTP协议对资源的操作

方法	说明
GET	请求获取URL位置的资源
HEAD	请求获取URL位置资源的相应消息报告，即获取该资源的头部信息
POST	请求向URL位置的资源后附加新的数据
PUT	请求向URL位置存储一个资源，覆盖原URL位置的资源
PATCH	请求局部更新URL位置的资源，即改变该处资源的部分内容
DELETE	请求删除URL位置存储的资源

在这里插入图片描述

HTTP协议通过URL对资源做定位，通过上面六个常用的方法对资源进行管理。

理解PATCH和PUT区别：

假设URL位置有一组数据UserInfo，包括UserID， UserName等20个字段。

需求：用户修改UserName，其它不变。

采用PATCH，仅向URL提交UserName的局部更新请求。
采用PUT，必须将所有20个字段一并提交到URL，未提交字段被删除。

PATCH的最主要好处：节省网络带宽

HTTP协议方法	Requests库方法	功能一致性
GET	requests.request()	一致
HEAD	requests.head()	一致
POST	requests.post()	一致
PUT	requests.put()	一致
PATCH	requests.patch()	一致
DELETE	requests.delete()	一致

3、 Requests库的使用

request方法的使用（是其它方法的基础）

1）语法

requests.request(method, url, **kwargs)

选项：
method: 请求方式，对应get/put/post/等一共7种
    r = requests.request('GET', url, **kwargs)
    r = requests.request('HEAD', url, **kwargs)
    r = requests.request('POST', url, **kwargs)
    r = requests.request('PUT', url, **kwargs)
    r = requests.request('PATCH', url, **kwargs)
    r = requests.request('DELETE', url, **kwargs)
    r = requests.request('OPTIONS', url, **kwargs)
url:拟获取页面的url链接
**kwargs:控制访问参数，共13个
    params: 字典或字节序列，作为参数增加到url中
    data: 字典、字节序列或文件对象，作为Request的内容
    json: JSON格式的数据，作为Request的内容
    headers: 字典，HTTP定制头
    cookies: 字典或CookieJar, Request中的cookie
    auth: 元组，支持HTTP认证功能
    files: 字典类型，传输文件
    timeout: 设定超时时间，秒为单位
    proxies: 字典类型，设定访问代理服务器，可以增加登陆认证
    all_redirects: True/False, 默认为True，重定向开关
    stream: True/False, 默认为True, 获取内容立即下载开关
    verify: True/False, 默认为True, 认证SSL证书开关
    cert: 本体SSL证书路径

参数parmas：

>>> kv = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.request('GET', 'http://python123.io/ws', params = kv)
>>> r.url
'https://python123.io/ws?key1=value1&key2=value2'

参数data:

>>> kv = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.request('POST', 'http://python123.io/ws', data = kv)
>>> body = 'information'
>>> r = requests.request('POST', 'http://python123.io/ws', data = body)

参数json:

>>> r = requests.request('POST', 'http://python123.io/ws', json = kv)

参数headers:

>>> hd = {'user-agent': 'Chrome/10'} ## 模拟浏览器Chrome10访问
>>> r = requests.request('POST', 'http://python123.io/ws', headers = hd)

参数files:

>>> fs = {'file': open('data.xls', 'rb')} 
>>> r = requests.request('POST', 'http://python123.io/ws', files = fs)

参数timeout:

>>> r = requests.request('GET', 'http://python123.io/ws', timeout=10)

参数proxies:

>>> pxs = {'http': 'http://user:pass@10.10.10.1:1234', 'https': 'https://10.10.10.1:4321'}
>>> r = requests.request('GET', 'http://www.baidu.com', proxies = pxs)

head方法的使用

1）语法

requests.head(url, **kwargs)

选项：
url：拟获取页面的url链接
**kwargs：13个控制访问的参数

2）例子

>>> import requests
>>> r = requests.head('http://www.baidu.com')
>>> r.headers
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Mon, 29 Jun 2020 06:22:52 GMT', 'Last-Modified': 'Mon, 13 Jun 2016 02:50:26 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18'}
>>> r.text
''

post方法的使用

1）语法

requests.post(url, data=None, json=NOne, **kwargs)

选项：
url：拟获取页面的url链接
data: 字典、字节序列或文件，Request的内容
json: JSON格式的数据，Request的内容
**kwargs：11个控制访问的参数

2）例子

提交字典类型（或者键值对），其存放在form的字段下

>>> payload = {'key1': 'value1', 'key2': 'value2'} 
>>> r = requests.post('http://httpbin.org/post', data = payload)
>>> print(r.text)
···
  "form": {
    "key1": "value1",
    "key2": "value2"
···

直接提交数据，其存放在data的字段下

>>> r = requests.post('http://httpbin.org/post', data = 'ABC')
>>> print(r.text)
{
"args": {},
"data": "ABC",
"files": {},
···
}

put方法的使用

1）语法

requests.put(url, data=None, **kwargs)

选项:
url: 拟更新页面的url链接
data:字典、字节序列、或文件，Request的内容
**kwargs: 12个控制访问的参数

2）例子

>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.put('http://httpbin.org/put', data = payload)
>>> print(r.text)
{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "key1": "value1",
    "key2": "value2"
      ...
}

patch方法使用

1）语法

requests.patch(url, data=None, **kwargs)

选项:
url: 拟更新页面的url链接
data:字典、字节序列、或文件，Request的内容
**kwargs: 12个控制访问的参数

get方法的使用

1）语法

requests.get(url, params = None, **kwargs)

选项：
url：拟获取页面的url链接
params：url中的额外参数，字典或字节流格式，可选
**kwargs：12个控制访问的参数(除了parmas)
返回的是一个Respose对象

在这里插入图片描述

>>> import requests
>>> r = requests.get('http://www.baidu.com')
>>> print(r.status_code)
200
>>> type(r)
<class 'requests.models.Response'>
>>> r.headers
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Mon, 29 Jun 2020 06:46:43 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:56 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
>>>

2）Response对象的属性

属性	说明
r.status_code	HTTP请求的返回状态，200表示连接成功，404表示失败
r.text	HTTP相应内容的字符串形式，即，url对应的页面内容
r.encoding	从HTTP header中猜测的响应内容编码方式
r.apparent_encoding	从内容中分析出响应内容编码方式（备用编码方式）
r.content	HTTP响应内容的二进制形式

3）Respose对象处理流程

在这里插入图片描述

4）理解Respose的编码

先看个例子：

>>> import requests
>>> r = requests.get('http://www.baidu.com')
>>> r.status_code
200
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title></head> <body link=#0000cc> <div 
...
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
>>> r.encoding = 'utf-8'
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> 
...

属性	说明
r.encoding	从HTTP header中猜测的响应内容编码方式
r.apparent_encoding	从内容中分析出的响应内容编码方式（备选编码方式）

r.encoding: 如果header中不存在charset，则认为编码为ISO-8859-1。

r.apparent_encoding: 根据网页内容分析出的编码方式。

delete方法的使用

1）语法

requests.delete(url, **kwargs)

选项：
url: 拟删除页面的url链接
**kwargs: 13个控制访问的参数

4、Requests爬取网页的通用代码框架

1）Requests库的异常

异常	说明
requests.ConnectionError	网络连接错误异常，如DNS查询异常、拒绝连接等
requests.HTTPError	HTTP错误异常
requests.URLRequired	URL缺失异常
requests.TooManyRedirects	超过最大重定向次数，产生重定向异常
requests.ConnectTimeout	连接远程服务器超时异常
requests.Timeout	请求URL超时，产生超时异常
requests.raise_for_status()	如果不是200，产生异常requests.HTTPError

2）通用代码框架

import requests


def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "产生异常"


if __name__ == '__main__':
    url = 'http://www.baidu.com'
    print(getHTMLText(url))

单元二、Robots协议

1）Robots协议

Robos Exclusion Standard 网络爬虫排除标准

作用：网站告知网络爬虫哪些页面可以抓取，哪些不行

Robots协议基本语法：

在这里插入图片描述

2）Robots协议遵守的方式

Rotots协议的使用：

网络爬虫：自动或人工识别robots.txt，再进行内容爬取

约束性：Rotots协议是建议但非约束性，网络爬虫可以不遵守，但存在法律风险。（面向监狱爬虫，哈哈）

3）网络爬虫的尺寸

在这里插入图片描述

单元三、 BeautifulSoup库

1）Beautiful Soup库的理解

在这里插入图片描述

Beautiful Soup库是解析、遍历、维护“标签树”的功能库。

Beautiful Soup对应一个HTML/XML文档的全部内容。

2）Beautiful Soup库解析器

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(mk, ‘html.parser’)	安装bs4库
lxml的HTML解析器	BeautifulSoup(mk, ‘lxml’)	pip install lxml
lxml的XML解析器	BeautifulSoup(mk, ‘xml’)	pip install lxml
html5lib的解析器	BeautifulSoup(mk, ‘html5lib’)	pip install html5lib

3）Beautiful Soup类的基本元素

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾
Name	标签的名字， … p>的名字是‘p’，格式：.name
Attributes	标签的属性，字典形式组织，格式：.attrs
NavigablesString	标签内非属性字符串，<>…</>中字符串，格式：.string
Comment	标签内字符串的注释部分，一种特殊的Comment类型

在这里插入图片描述

4）标签树的遍历

1）标签树的下行遍历

属性	说明
.contents	子节点的列表，将所有儿子节点存入列表，是列表类型
.children	子结点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

2）标签树的上行遍历

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

3）标签树的平行遍历

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

在这里插入图片描述

4）总结

在这里插入图片描述

5）基于bs4的HTML输出

HTML的格式化

>>> import bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo, 'html.parser')
>>> soup.prettify() ##在标签后面增加'\n' 
>>> print(soup.prettify()) ## 使用print能更清晰的看出来
>>> soup.a.prettify()

单元四、信息组织与提取方法

1）信息标记的三种方式

XML， JSON， YAML

2）信息提取的一般方法

方法一：完整解析信息的标记形式，再提取关键信息

XML JSON YAML

需要标记解析器例如：bs4库的标记树遍历

优先：信息解析准确

缺点：提取过程繁琐，速度慢
方法二：无标记形式，直接搜索关键信息

搜索

对信息的文本查找函数即可。

优点：提取过程简单，速度较快

缺点：提取结果准确性与信息内容有关
方法三：融合方法：结合形式解析与搜索方法，提取关键信息

XML JSON YAML 搜索

需要标记解析器及文本查找函数

实例：提取HTML中所有的URL链接

思路；1）搜索到所有的标签

2）解析标签格式，提取href后的链接内容

3）基于bs4的HTML内容查找方法

<tag>.fint_all(name, attrs, recursive, string, **kwargs)
其等价于<tag>(..)


返回：
一个列表类型，存储查找的结果

选项：
name: 对标签名称的检索字符串
attrs: 对标签属性值的检索字符串，可标注属性索引
recursive: 是否对子孙全部索引，默认True
string: <>...</>中字符串区域的检索字符串
**kwargs:

参数name:

>>> soup.find_all('a') ## 查找所有a标签
>>> soup.find_all(['a', 'b']) ## 查找所有a, b标签

参数attrs:

>>> soup.find_all('p', 'course')
>>> soup.find(id = 'link1')

参数recursive:

>>> soup.find_all('a', recursive = True)
>>> soup.find_all('a', recursive = False)

参数string:

>>> soup.find_all(string = 'Basic Python')

简写：

(…) 等价于 .find_all(…)

soup(…) 等价于 soup.find_all(…)

拓展方法：

方法	说明
<>.find()	搜索且只返回一个结果，字符串类型，同.find_all()参数
<>.find_parents()	在先辈节点中搜索，返回列表类型，同.find_all()参数
<>.find_parent()	在先辈节点中返回一个结果，字符串类型，同.find_all()参数
<>.find_next_siblings()	在后续平行节点中搜索，返回列表类型，同.find_all()参数
<>.find_next_sibling()	在后续平行节点中返回一个结果，字符串类型，同.find_all()参数
<>.find_previous_siblings()	在前续平行节点中搜索，返回列表类型，，同.find_all()参数
<>.find_previous_sibling()	在前续平行节点中返回一个结果，字符串类型，同.find_all()参数

案例：爬取最好中国大学网

import requests
import bs4
from bs4 import BeautifulSoup


def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""


def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, 'html.parser')
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[4].string])


def printUnivList(ulist, num):
    tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名", "学校", "总分", chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0], u[1], u[2], chr(12288)))


def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.com/zuihaodaxuepaiming2020.html'
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20)

main()