Python爬虫笔记

最新推荐文章于 2022-10-01 10:13:29 发布

PenguinLeee

最新推荐文章于 2022-10-01 10:13:29 发布

阅读量179

点赞数

分类专栏：计算机杂项

本文链接：https://blog.csdn.net/weixin_43466027/article/details/118968729

版权

计算机杂项专栏收录该内容

4 篇文章 0 订阅

订阅专栏

本笔记总结于嵩天的Python进阶课程

0x01 Requests库

介绍：这个库是用来处理HTTP请求的

安装：直接pip install requests即可。

安装的测试：

import requests
r = requests.get("http://www.baidu.com")
print(r.status_code)
r.text

requests库的主要方法：

request（构造请求，支撑其他的HTTP方法，下面的方法实现中都只是在request函数的参数中加上了方法名）
get（获取资源）
head（获取响应头）
post（追加新的资源）
put（覆盖写资源）
patch（局部更新资源）
delete（删除资源）

get方法

# r 是response对象，包含请求头、请求内容等等东西
r = requests.get(url)

# params: 额外参数，**kwargs:12个控制访问的参数
requests.get(url, params=None, **kwargs)

Response对象的一些属性

url = "http://www.baidu.com"
r = requests.get(url)

type(r)
# <class 'requests.models.Response'>

r.headers
# {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Wed, 21 Jul 2021 08:17:11 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:29 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}

# 响应属性
r.status_code

# 内容的字符串形式
r.text

# 从header中猜测的编码方式
r.encoding

# 从内容中猜测的编码方式
r.apparent_encoding

# 相应内容的二进制形式
r.content

通用的代码框架

import requests

def getHTMLText(url):
	try:
		r = requests.get(url, timeout = 30)
		r.raise_for_status()
		r.encoding = r.apparent_encoding
		return r.text
	except:
		return "异常产生！"

if __name__ == "__main__":
	url = "https://www.baidu.com"
	print(getHTMLText(url))

requests库的异常

例如：网络连接错误、HTTP错误、URL缺失、超过最大重定向次数、远程服务器链接超时、请求URL超时等。

网络连接status的异常

# 这个东西是依据网络连接之后返回的页面状态码是不是200来抛出异常的
r.raise_for_status()

HTTP协议

懒得写了…之前打CTF和搞项目的时候都用的比较多…

参考资料：计算机网络（自顶向下方法），讲得很详细

配置代理

0x02 beautifulsoup4库

库的功能：解析、遍历、维护“html标签树”的库

import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://www.bit.edu.cn").text, 'html.parser')
print(soup.prettify())

HTML标签可以按照顺序关系和嵌套关系组成树的内容。

bs就是依照这个关系把html页面可检索化一下。

页面解析器

‘html.parser’：bs4自带的解析器
’lxml‘：lxml库的html解析器（需要安装同名库）
'xml‘：lxml的xml解析器（需要安装同名库）
‘html5lib’：html5lib的解析器（需要安装同名库）

bs库中beautifulsoup类的基本元素

tag：最基本的信息组织单元。例如：<a href=blablabla>hhhhh</a>
name：标签的名字。比如上面的标签名字是 "a" 。<tag>.name
attributes：标签的属性，用字典形式组织。<tag>.attrs
navigablestring：标签内的非属性字符串。例如<a href=blablabla>hhhhh</a>中的hhhhh
comment：字符串的注释部分

页面中任何标签都可以用soup.<tag>获得。如果有多个标签时，HTML返回第一个。

from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.request("http://www.baidu.com"), "html.parser")
soup.a.name
soup.parent.name
soup.a.parent.parent.name

HTML标签内容遍历方法

<tag>.contents 所有子节点构成的列表
<tag>.children 子节点
<tag>.descendants 所有子孙节点（迭代类型）
<tag>.parent 父节点
<tag>.parents 父爷节点（迭代类型）
<tag>.next_sibling 同一个父节点下的下一个平行节点标签
<tag>.next_siblings 平行节点（迭代类型）
<tag>.previous_sibling 同一个父节点下的上一个平行节点标签
<tag>.previous_siblings 上一个平行节点（迭代类型）

在这里插入图片描述

基于bs4的HTML格式输出

让HTML更加友好地显示，比如开发者模式中排版过后的HTML标签：
在这里插入图片描述
可以使用prettify方法：

import requests
demo = requests.get("http://www.bit.edu.cn")
demo = demo.text

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, "html.parser")
demo.prettify()

bs4库的编码

将所有的HTML输入都变成UTF-8。

0x03 信息标记&信息提取

信息标记的三种主流格式：XML, JSON, YAML

信息提取的一般方法：

完整解析标记形式：慢
直接搜索关键字
两个方法融合

例子：提取HTML中的所有URL链接

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, "html.parser")
for link in soup.find_all('a'):
	print(link.get('href'))

soup.find_all(name, attrs, recursive, string, **kwargs)
# 也可以写成这样
soup(name, attrs, recursive, string, **kwargs)

0x04 正则表达式库re

语法

在这里插入图片描述

使用

import re
regex = "P(Y|YT|TYH)?N"
p = re.compile(regex)
# 然后可以把这玩意扔到re的匹配里

一些方法

re.search(pattern, string, flags=0) 返回匹配的match对象
re.match()
re.findall() 返回匹配列表
re.split() 字符串按照匹配子串进行分割
re.finditer() 返回一个迭代器，里面是match对象
re.sub() 替换所有匹配了正则表达式的字符串

0x05 scrapy框架

因为任务不是太大，所以没看x

PenguinLeee

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录