②BeautifulSoup基础知识

最新推荐文章于 2022-03-08 09:31:18 发布

bear_n

最新推荐文章于 2022-03-08 09:31:18 发布

阅读量589

点赞数

分类专栏： Python网络爬虫知识文章标签： python

本文链接：https://blog.csdn.net/bear_n/article/details/52064790

版权

Python网络爬虫知识专栏收录该内容

4 篇文章 1 订阅

订阅专栏

BeautifulSoup 是一个可以从HTML或XML文件中提取数据的Python库。BeautifulSoup通过定位HTML标签来格式化和组织复杂的网络信息，用简单易用的Python对象为我们展示XML结构信息。

一、BeautifulSoup的对象

BeautifulSoup对象可以归纳为：
    Tag对象。
    BeautifulSoup对象。
    NavigableString对象。
    Comment对象。
<1>Tag对象：tag对象与XML或HTML原生文档中的tag相同。
比如，h标签名以及“An Interesting Title”一起构成h标签。
           <h>An Interesting Title</h>
tag对象有很多方法和属性，其中，name和attributes两个属性最为重要。

①name属性：每个标签tag都有自己的名字,通过 .name 来获取。

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

#创建一个BeautifulSoup对象soup
soup = BeautifulSoup(html_doc)    

#soup 对象本身比较特殊，它的 name 即为 [document]
print(soup.name)

#利用 soup.标签名可以轻松地获取这些标签的内容
#但是查找的是在所有内容中的第一个符合要求的标签
print(soup.title)
print(soup.a)

#对于其他内部标签，输出的值便为标签本身的名称
print(soup.head.name)

②attributes属性：一个tag标签可能有很多个属性。
标签的属性的操作方法与字典相同。

from bs4 import BeautifulSoup

#创建BeautifulSoup对象soup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')  

#输出标签b的名字（即标签本身的名字b）
print(soup.b.name)

#修改标签b的名字，会影响所有通过当前Beautiful Soup对象生成的HTML文档
soup.b.name = "blockquote"
print(soup)

#输出blockquote标签 “class” 的属性的值
print(soup.blockquote["class"])

print("#"*30)

#输出blockquote标签的属性
print(soup.blockquote.attrs)

#修改blockquote标签的属性
soup.blockquote['class'] = 'verybold'
print(soup.blockquote)

#添加blockquote标签的属性
soup.blockquote['id'] = 1
print(soup.blockquote)

#删除blockquote标签的属性
del soup.blockquote['class']
print(soup.blockquote)

<2> BeautifulSoup 对象： BeautifulSoup 库最常用的对象恰好就是 BeautifulSoup 对象。

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

#创建一个BeautifulSoup对象
soup = BeautifulSoup(html_doc)     

#输出BeautifulSoup对象类型
print(type(soup.name))

#输出BeautifulSoup对象名字
print(soup.name)

#输出BeautifulSoup对象属性
print(soup.attrs)

BeautifulSoup库还有另外两个对象，虽然不常用，却应该了解一下。

<3>NavigableString对象：获取标签内部的文字。

<4>Comment对象：用来查找HTML文档的注释标签，<!--像这样-- >。

二、导航树

导航树（文档树）：HTML页面可以映射成一个树，便于通过标签在文档中的位置来查找标签。

<1>子标签：和人类家谱一样，子标签就是父标签的下一级，后代标签就是指一个父标签下面所有级别的标签。
所有的子标签都是后代标签，但不是所有的后代标签都是子标签。
一般情况下，BeautifulSoup函数总是处理当前标签的后代标签。例如，soup.body.h1选择body标签后代里的第一个h1标签，不会去找body外面的h1标签。
通过.children可以找出当前标签的子标签。
通过.descents可以找出当前标签的所有后代标签。
通过.contents 可以将当前标签的子标签以列表的方式输出。
注意：BeautifulSoup中字符串节点不支持这些属性,因为字符串没有子节点。

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

#创建一个BeautifulSoup对象soup
soup = BeautifulSoup(html_doc)

#操作文档树最简单的方法就是告诉它你想获取标签的name
#如果想获取 <head> 标签,只要用 soup.head
print(soup.head)

#获取<body>标签中的第一个<b>标签
print(soup.body.b)

#通过点取属性的方式只能获得当前名字的第一个标签
#只能获取soup对象中的第一个a
print(soup.a)

#通过.contents 可以将当前标签的子标签以列表的方式输出
print(soup.head.contents)

#通过.children可以找出当前标签的子标签
for child in soup.head.contents[0].children:
    print(child)

#通过.descents可以找出当前标签的所有后代标签
for child in  soup.head.descendants:
    print(child)

#如果标签只有一个 NavigableString 类型子标签,那么这个标签可以使用 .string 得到子节点
print(soup.head.contents[0].string)

#如果一个标签仅有一个子标签,那么这个标签也可以使用 .string 方法
print(soup.head.string)

#如果标签中包含多个字符串 ,可以使用 .strings 来循环获取
for string in soup.strings:
    print(repr(string))

#输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容
for string in soup.stripped_strings:
print(repr(string))

<2>兄弟标签
通过.next_sibling可以获取当前标签的下一个兄弟标签。
通过.previous_sibling可以获取当前标签的上一个兄弟标签。
通过.next_siblings可以获取当前标签下面的所有兄弟标签。
通过.previous_siblings可以获取当前标签上面的所有兄弟标签。

from bs4 import BeautifulSoup

#<b>标签、<c>标签、<d>标签、<e>标签都是<a>标签的子标签
#所以<b>标签、<c>标签、<d>标签、<e>标签可以被称为兄弟标签
soup = BeautifulSoup("<a><b>text1</b><c>text2</c><d>text3</d><e>text4</e></d></c></b></a>")
print(soup.prettify())

#<b>标签有下一个兄弟标签，存在.next_sibling 属性
#<b>标签没有上一个兄弟标签，不存在 .previous_sibling 属性
print(soup.b.next_sibling)
print(soup.b.previous_sibling)

#<e>标签没有下一个兄弟标签，不存在.next_sibling 属性
#<e>标签有上一个兄弟标签，存在 .previous_sibling 属性
print(soup.e.next_sibling)
print(soup.e.previous_sibling)

#注意：字符串“text1”和“text2”不是兄弟标签,因为它们的父标签不同

#通过.next_siblings可以获取当前标签下面的所有兄弟标签
for sibling in soup.b.next_siblings:
    print(repr(sibling))

#通过.previous_siblings可以获取当前标签上面的所有兄弟标签
for sibling in soup.e.previous_siblings:
    print(repr(sibling))

#.next_element 属性指向当前被解析的对象的下一个被解析的对象
print(soup.a.next_element)

#.previous_element 属性指向当前被解析的对象的前一个被解析对象
print(soup.e.previous_element)

#通过 .next_elements 和 .previous_elements 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样
for element in soup.a.next_elements:
    print(repr(element))

for element in soup.e.previous_elements:
print(repr(element))

<3>父标签：在抓取网页时，查找父标签的需求比查找子标签和兄弟标签要少很多。通常情况下，如果抓取网页内容为目的来观察HTML页面，我们都是从最上层标签开始的，然后思考如何定位我们想要的数据块所在的位置。但是,偶尔在特殊情况下我们也会用到BeautifulSoup的父标签查找函数。
通过.parent查找当前标签的父标签。
通过.parents可以递归查找标签的父标签。

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc)

#通过.parent查找当前标签的父标签
print(soup.title.parent)

#文档title的字符串也有父标签：<title>标签
print(soup.title.string.parent)

#文档的顶层标签比如<html>的父标签是 BeautifulSoup 对象
print(soup.html.parent)

#BeautifulSoup 对象的父标签是None
print(soup.parent)

#通过 .parents 可以递归得到标签的所有父标签
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)