BeautifulSoup库-标签解析遍历

最新推荐文章于 2023-08-24 07:36:58 发布

small-white

最新推荐文章于 2023-08-24 07:36:58 发布

阅读量4.1k

点赞数 1

分类专栏： Python爬虫

本文链接：https://blog.csdn.net/ndyd_csdn/article/details/105970451

版权

Python爬虫专栏收录该内容

3 篇文章

订阅专栏

BeautifulSoup库的介绍

BeautifulSoup类的基本元素

html的内容遍历

BeautifulSoup库的介绍

安装：pip install BeautifulSoup4

<html>    
    <body>标签树</body>    
    ...
</html>

BeautifulSoup库是解析、遍历、维护“标签树”的功能库

... : 标签 Tag

...

p 名称 Name 成对出现 ; class = "title"属性 Attribute 0个或多个

Beautiful Soup库也叫beautifulsoup4库或bs4库，主要是用BeautifulSoup类

约定引用方式如下：

from bs4 import BeautifulSoup

import bs4

BeautifulSoup类对应了一个HTML/XML文档的全部内容

from bs4 import BeautifulSoup

soup = BeautifulSoup("<html>...</html>","html.parse")

soup = BeautifulSoup(opem("filepath"),"html.parser")

# html.parser为解析器

BeautifulSoup库的解析器

解析器使用方法条件

bs4的解析器 BeautifulSoup(mk,"html.parser") pip install BeautifulSoup4

lxml的HTML解析器 BeautifulSoup(mk,"lxml") pip install lxml

lxml的XML解析器 BeautifulSoup(mk,"lxml") pip install lxml

htmllib5解析器 BeautifulSoup(mk,"htmllib5") pip install html5lib

BeautifulSoup类的基本元素

Tag标签，最基本的信息组织单元，分别用<>..</>标明开头和结尾
Name标签的名字，..的名字是p，格式：<tag>.name
Attributes标签的属性，class、id等，字典形式组织，格式：<tag>.attrs
NavigableString标签内的非属性字符串，<>..</>中的字符串
Comment标签内字符串的注释部分，一种特殊的Comment类型

示例：

<html>
    
    <head>
        <title>403 Forbidden</title></head>
    
    <body bgcolor="white">
        <h1>403 Forbidden</h1>
        <p>You don't have permission to access the URL on this server. Sorry for the inconvenience.</p>
        <p>Please report this message and include the following information to us. Thank you very much!</p>
        <table>
            <tr>
                <td>URL:</td>
                <td>https://www.jianshu.com/</td></tr>
            <tr>
                <td>Server:</td>
                <td>zurich</td></tr>
            <tr>
                <td>Date:</td>
                <td>2019/12/24 11:49:40</td></tr>
        </table>
    </body>

</html>

将请求返回的html代码用BeautifulSoup库来解析

import requestsfrom bs4
import BeautifulSoup

r = requests.get('https://www.jianshu.com') 
r.encoding ='utf-8'#防止中文乱码
soup = BeautifulSoup(r.text, 'html.parser') 
print(soup.prettify())# 标签格式化换行输出

Tag标签解析：

任何存在与HTML中的标签都可以使用soup.<tag>来访问获得，当html文档中同事存在多个<tag>标签时，返回第一个

soup.p

输出：You don't have permission to access the URL on this server. Sorry for the inconvenience.

Please report this message and include the following information to us.

Thank you very much!

Tag的标签名称解析：

每一个<tag>标签都有自己的名字，通过<tag>.name获取，字符串类型

soup.p.name

输出：'p'

标签的属性attrs解析：

一个标签可以有0个或多个标签属性，他们是字典类型

soup.body.attrs

输出：{'bgcolor': 'white'}

soup.body.attrs['bgcolor']

输出：'white'

Tag的NavigableString解析：

标签内的非属性字符串获取，可以跨越多个层级

soup.title.string

输出：'403 Forbidden'

Tag的Comment解析：

解析标签内的字符串的注释部分

html的内容遍历

HTML十一标签树的形式来组织信息，标签书遍历一般有下行遍历，上行遍历和平行遍历

下行遍历：从当前节点往下遍历

.contents：子节点的列表，将所有儿子节点存入列表中（只能获取下一级儿子节点）
.children：子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants：子孙节点的迭代类型，包含所有子孙结点，用于循环遍历（可以获取所有子节点）

用这三种方法来解析一下table中的内容：

a = soup.table.children

for child in a:

if isinstance(child,bs4.element.Tag): #排除非标签元素干扰

for i in child.children:

if isinstance(i,bs4.element.Tag):

print(i.string)

#contents属性与children类似