BeautifulSoup学习笔记

最新推荐文章于 2024-08-19 09:22:32 发布

高级cv算法设计师

最新推荐文章于 2024-08-19 09:22:32 发布

阅读量244

点赞数

分类专栏： BeautifulSoup 文章标签：爬虫数据挖掘

本文链接：https://blog.csdn.net/qq_44732013/article/details/114699631

版权

BeautifulSoup 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

1. 查找tag的方法：点（.）节点名，只能获取第一个匹配子节点，可以多次调用

soup.p
#<p class="title"><b>The Dormouse's story</b></p>
soup.p.b#查找tag的方法，直接.tag名，soup对象可以多次调用这个方法（点取属性，只能获得第一个匹配结果）
#<b>The Dormouse's story</b>

2. .contents与.children与.descendants方法的比较

.contents方法返回由该节点的直接子节点构成的列表
.children方法返回生成该节点的直接字节点的迭代器
.descendants方法返回生成该节点的所有子孙节点的生成器，第一个元素是第一个子节点

soup.body
"""
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
"""
soup.body.contents
"""
['\n',
 <p class="title"><b>The Dormouse's story</b></p>,
 '\n',
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 '\n',
 <p class="story">...</p>,
 '\n']
 """
soup.body.children#与.contents一样，得到的是tag的直接子节点，但返回的是一个迭代器
#<list_iterator at 0x1cb15c3cdd8>
list(soup.body.children)#转换为list
"""
['\n',
 <p class="title"><b>The Dormouse's story</b></p>,
 '\n',
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 '\n',
 <p class="story">...</p>,
 '\n']
"""
for i in soup.body.children:
    print(i)
"""


<p class="title"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>
"""



#.descendants 属性可以对所有tag的子孙节点进行递归循环,返回一个生成器
soup.p.descendants
#<generator object Tag.descendants at 0x000001CB15C157C8>
list(soup.p.descendants)
#[<b>The Dormouse's story</b>, "The Dormouse's story"]

3. .string的注意事项

如果tag只包含一个子节点，并且改子节点为字符串或者它的子节点只有一个，则.string返回的是唯一的字符串节点
如果子孙节点有包含多个子节点，则.string不知道定位到哪一个节点的string，返回none

soup.body
"""
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
"""
soup.body.string
soup.p
#<p class="title"><b>The Dormouse's story</b></p>
soup.p.string
#"The Dormouse's story"
soup.p.b.string
#"The Dormouse's story"

4. .strings：返回文档中的所有字符串的生成器

type(soup.strings)
#generator
for string in soup.strings:#返回文档中多条字符串
    print(string)
"""
The Dormouse's story




The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.
"""

5. .stripped_strings返回删除了回车和每条字符串两边的空格

lt=""
for string in soup.stripped_strings:#删除了回车和每行两边多余的空格
    lt+=string
print(lt)
"""
The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,LacieandTillie;
and they lived at the bottom of a well....
"""
soup.getText()
"""
"The Dormouse's story\n\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n"
"""