BeautifulSoup 学习笔记

最新推荐文章于 2022-01-12 17:49:28 发布

靠谱的人

最新推荐文章于 2022-01-12 17:49:28 发布

阅读量237

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/weixin_43085185/article/details/104335308

版权

爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

BeautifulSoup学习笔记

1.基础介绍

GitHub地址：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

按照github举得例子我们也按照那个分析一下，分析对象命名为html_doc，内容是一段html代码，正常情况下，我们解析的是response.content.decode(‘utf-8’)

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

2.解析并按标准格式输出

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

3.几个快速浏览的方法

# <title>The Dormouse's story</title>

soup.title.name 
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  # <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

以上都是通过soup.标签名字，快速浏览标签，应用很局限，下面我们来深入探讨一下。

4.对象的种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment .

Tag标签

html的基础就是Tag标签如

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

就是一个Tag标签，Tag有很多方法和属性,在遍历文档树和搜索文档树中有详细解释.现在介绍一下tag中最重要的属性: name和attributes。

Name

每个tag都有自己的名字,通过 .name 来获取:
如上例 soup.a.name 结果当然是 a 了，因为他是个 a 标签嘛，貌似这个没啥用，后边配合树遍历的时候才能看出应用。

Attributes

一个tag可能有很多个属性. tag 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同:可以有三种

soup.a['class']
soup.a.get('class')
soup.a.attrs['class']

在说一遍，Tag的属性操作方法与字典相同！！！

NavigableString 字符串

Beautiful Soup用 NavigableString 类来包装tag中的字符串，操作的时候使用tag.string

soup.a.string  //结果是 Elise

如果想要一次性获取所有标签的字符串，使用.text，相当于加强版的.string对于爬小说很有用哦！
一个 NavigableString 字符串与Python中的Unicode字符串相同,并且还支持包含在遍历文档树和搜索文档树中的一些特性. 通过 unicode() 方法可以直接将 NavigableString 对象转换成Unicode字符串:

unicode_string = unicode(tag.string)
unicode_string
# u'Extremely bold'
type(unicode_string)
# <type 'unicode'>

tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 replace_with() 方法:

tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>

NavigableString 对象支持遍历文档树和搜索文档树中定义的大部分属性, 并非全部.尤其是,一个字符串不能包含其它内容(tag能够包含字符串或是其它tag),字符串不支持 .contents 或 .string 属性或 find() 方法.

如果想在Beautiful Soup之外使用 NavigableString 对象,需要调用 unicode() 方法,将该对象转换成普通的Unicode字符串,否则就算Beautiful Soup已方法已经执行结束,该对象的输出也会带有对象的引用地址.这样会浪费内存.

BeautifulSoup 对象

BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象,它支持遍历文档树和搜索文档树中描述的大部分的方法.

因为 BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 .name 属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name
、

注释 Comment

Tag , NavigableString , BeautifulSoup 几乎覆盖了html和xml中的所有内容,但是还有一些特殊对象.容易让人担心的内容是文档的注释部分:

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

Comment 对象是一个特殊类型的 NavigableString 对象:

comment
# u'Hey, buddy. Want to buy a used parser'

5.一些使用操作

获取所有a标签的文字内容

两种方法

soup = BeautifulSoup(html_doc, 'html.parser')
strs = soup.find_all('a')
for str in strs:
    print(str.string)

soup = BeautifulSoup(html_doc, 'html.parser')
strs = soup.find_all('a')
for str in strs:
    print(list(str.strings)[0])

第一种很容易理解，这个例子比较简单，如果标签嵌套很多的话，第二种就比较简单了，str.strings是直接获取所有标签的文字内容，但结果是一个生成器，就需要用list（）方法转换成一个列表，再取内容。不理解的话，多操作几次就明白了。
类似的还有

靠谱的人

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
BeautifulSoup 学习笔记

BeautifulSoup学习笔记1.基础介绍GitHub地址：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.按照gi...
复制链接

扫一扫

专栏目录