爬虫-4.数据解析（BeabutifulSoup）

最新推荐文章于 2023-01-24 21:16:47 发布

CHEN-QING

最新推荐文章于 2023-01-24 21:16:47 发布

阅读量265

点赞数

文章标签： python 爬虫 big data

本文链接：https://blog.csdn.net/Love_Stars/article/details/119447939

版权

文章目录一：基本概念二：基础1、bs4的对象种类2、遍历文档树3、获取文本3、获取属性4、补充：二：使用基础使用：find_all 和 find（重点）一：基本概念Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库二：基础1、bs4的对象种类通过下面示例来解释：from bs4 import BeautifulSouphtml_doc = """<html><head><title>The Dormouse's s

摘要由CSDN通过智能技术生成

一：基本概念

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库

二：基础

1、bs4的对象种类

通过下面示例来解释：

from bs4 import BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
<h><!-- 在此处写注释 --></h>
"""

tag : 标签

soup = BeautifulSoup(html_doc, 'lxml')

# tag
print(type(soup.html))  # <class 'bs4.element.Tag'>

NavigableString : 可导航的字符串

soup = BeautifulSoup(html_doc, 'lxml')

# NavigableString
print(type(soup.p.string))  # <class 'bs4.element.NavigableString'>

BeautifulSoup : bs对象

soup = BeautifulSoup(html_doc, 'lxml')

# BeautifulSoup : bs对象
print(type(soup))  # <class 'bs4.BeautifulSoup'>

Comment : 注释

soup = BeautifulSoup(html_doc, 'lxml')

# Comment : 注释
print(type(soup.h.string))  # <class 'bs4.element.Comment'>
print(soup.h.string)  #  在此处写注释

2、遍历文档树

通过下面示例来解释：

from bs4 import BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">hhhhhhhhhhh</p>
<h><!-- 在此处写注释 --></h>
"""


soup = BeautifulSoup(html_doc, 'lxml')

（1）.遍历子节点

contents：返回的是一个所有子节点的列表

a = soup.head.contents
print(a)  # [<title>The Dormouse's story</title>]

children：返回的是一个子节点的迭代器

a = soup.head.children
print(a)  # <list_iterator object at 0x000002ADC4A93710>
for i in a:
    print(i) # <title>The Dormouse's story</title>

descendants：返回的是一个生成器遍历子子孙孙
```
a = soup.head
```

最低0.47元/天解锁文章

CHEN-QING

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫-4.数据解析（BeabutifulSoup）

文章目录一：基本概念二：基础1、bs4的对象种类2、遍历文档树3、获取文本3、获取属性4、补充：二：使用基础使用：find_all 和 find（重点）一：基本概念Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库二：基础1、bs4的对象种类通过下面示例来解释：from bs4 import BeautifulSouphtml_doc = """<html><head><title>The Dormouse's s
复制链接

扫一扫