python 爬虫标签文本beautifullsoup_python爬虫学习笔记四：BeautifulSoup库对HTML文本进行操作...

最新推荐文章于 2023-10-06 10:57:02 发布

weixin_39710249

最新推荐文章于 2023-10-06 10:57:02 发布

阅读量104

点赞数

文章标签： python 爬虫标签文本beautifullsoup

本文链接：https://blog.csdn.net/weixin_39710249/article/details/111955191

版权

只要你提供的信息是标签，就可以很好的解析

怎么使用BeautifulSoup库？

from bs4 import BeautifulSoup

soup=BeautifulSoup('

data

','html.parser')

例如：

import requests

r=requests.get("http://python123.io/ws/demo.html")

r.text

demo=r.text

from bs4 import BeautifulSoup

soup=BeautifulSoup(demo,"html.parser")

print(soup.prettify())

BeautifulSoup类的基本元素

基本元素

说明

Tag

标签，最基本的信息组织单元，分别用<>和>标明开头和结尾

Name

标签的名字，

...

的名字是'p'，格式：.name

Attributes

标签的属性，字典形式组织，格式：.attrs

NavigableString

标签内非属性字符串，<>...>中字符串，格式：.string

Comment

标签内字符串的注释部分，一种特殊的Comment类型

用法

tag=soup.a

tag.attrs

tag.attrs['class']

tag.string //标签内容

标签树的下行遍历

属性

说明

.contents

子节点的列表，将所有的儿子节点存入列表

.children

子节点的迭代类型，与.contents类似，用于循环遍历儿子节点

.descendants

子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

遍历儿子节点

for child in soup.body.children:

print(child)

遍历子孙节点

for child in soup.body.descendants:

print(child)

标签树的上行遍历

属性

说明

.parent

节点的父亲标签

.parents

节点的先辈标签的迭代类型，用于循环遍历先辈节点

标签树的上行遍历代码(对a标签所有先辈标签的名字进行打印)：

soup=BeautifulSoup(demo,"html.parser")

for parent in soup.a.parents:

if parent is None:

print(parent)

else:

print(parent.name)

标签树的平行遍历

属性

说明

.next_sibling

返回按照HTML文本顺序的下一个平行节点标签

.previous_sibling

返回按照HTML文本顺序的上一个平行节点标签

.next_siblings

迭代类型，返回按照HTML文本顺序的后续所有平行节点标签

.previous_siblings

迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

平行遍历条件：平行遍历发生在同一个父节点下的各节点间

平行遍历的节点不一定是标签类型，可能是...String

迭代类型只能用在for in结构中

遍历后续节点

for sibling in soup.a.next_siblings:

print(sibling)

遍历前续节点

for sibling in soup.a.previous_siblings:

print(sibling)

prettify()方法

可以很好的对每个标签添加换行符，使结构清晰

soup=BeautifulSoup("

中文

","html.parser")

soup.p.string

print(soup.p.prettify())

weixin_39710249

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python 爬虫标签文本beautifullsoup_python爬虫学习笔记四：BeautifulSoup库对HTML文本进行操作...

只要你提供的信息是标签，就可以很好的解析怎么使用BeautifulSoup库？from bs4 import BeautifulSoupsoup=BeautifulSoup('data','html.parser')例如：import requestsr=requests.get("http://python123.io/ws/demo.html")r.textdemo=r.textfrom bs...
复制链接

扫一扫