python爬虫基础知识—02python网络爬虫与信息提取

最新推荐文章于 2023-07-10 13:45:00 发布

张小北哈哈

最新推荐文章于 2023-07-10 13:45:00 发布

阅读量141

点赞数

分类专栏： python爬虫文章标签： python爬虫

本文链接：https://blog.csdn.net/yulizan9165/article/details/89197720

版权

python爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

1、soup=BeautifulSoup(‘

data

’,‘html.parser’)
第一个为BeautifulSoup需要解析的代码、变量等，第二个为解析器
2、BeautifulSoup库是解析、遍历、维护“标签树”的功能库

属性由键值对构成 from bs4 import BeautifulSoup soup=BeautifulSoup("data","html.parser") soup2=BeautifulSoup(open("D://demo.html"),"html.parser") 3、BeautifulSoup库解析器 https://www.cnblogs.com/themost/p/7223907.html?utm_source=itdadao&utm_medium=referral https://www.cnblogs.com/hanmk/p/8724162.html https://www.jianshu.com/p/9cd7fb95b74f

4、html的结构
标签树的下行遍历：
.content 子节点的列表，将所有儿子节点存入列表
.children 子节点的迭代类型，与.content类似，用于循环遍历儿子节点
.descendants 子孙节点的迭代类型，包含所有子孙节点，用于循环遍历
标签树的上行遍历
.parent 节点的父亲标签
.parents 节点先辈标签的迭代类型，用于循环遍历先辈节点
示例程序
soup=BeautifulSoup(demo,“html.parser”)
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)

标签树的平行遍历
.next_sibling 返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling 返回按照html文本顺序的上一个平行节点标签
.next_siblings 迭代类型，返回按照html文本顺序的后续所有平行节点标签
.previous_siblings 迭代类型，返回按照html文本顺序的前续所有平行节点标签

平行遍历发生在同一个父节点下的各节点间

5、基于bs4库的html输出
bs4库的prettify（）方法，能为html的标签和内容增加换行符

6、bs4库的基本元素
Tag Name Attributes NavigableString Comment
.bs4库的遍历功能