BeautifulSoup学习笔记

1. 查找tag的方法:点(.)节点名,只能获取第一个匹配子节点,可以多次调用

soup.p
#<p class="title"><b>The Dormouse's story</b></p>
soup.p.b#查找tag的方法,直接.tag名,soup对象可以多次调用这个方法(点取属性,只能获得第一个匹配结果)
#<b>The Dormouse's story</b>

2. .contents与.children与.descendants方法的比较

  • .contents方法返回由该节点的直接子节点构成的列表
  • .children方法返回生成该节点的直接字节点的迭代器
  • .descendants方法返回生成该节点的所有子孙节点的生成器,第一个元素是第一个子节点
soup.body
"""
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
"""
soup.body.contents
"""
['\n',
 <p class="title"><b>The Dormouse's story</b></p>,
 '\n',
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 '\n',
 <p class="story">...</p>,
 '\n']
 """
soup.body.children#与.contents一样,得到的是tag的直接子节点,但返回的是一个迭代器
#<list_iterator at 0x1cb15c3cdd8>
list(soup.body.children)#转换为list
"""
['\n',
 <p class="title"><b>The Dormouse's story</b></p>,
 '\n',
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 '\n',
 <p class="story">...</p>,
 '\n']
"""
for i in soup.body.children:
    print(i)
"""


<p class="title"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>
"""



#.descendants 属性可以对所有tag的子孙节点进行递归循环,返回一个生成器
soup.p.descendants
#<generator object Tag.descendants at 0x000001CB15C157C8>
list(soup.p.descendants)
#[<b>The Dormouse's story</b>, "The Dormouse's story"]


3. .string的注意事项

如果tag只包含一个子节点,并且改子节点为字符串或者它的子节点只有一个,则.string返回的是唯一的字符串节点
如果子孙节点有包含多个子节点,则.string不知道定位到哪一个节点的string,返回none

soup.body
"""
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
"""
soup.body.string
soup.p
#<p class="title"><b>The Dormouse's story</b></p>
soup.p.string
#"The Dormouse's story"
soup.p.b.string
#"The Dormouse's story"

4. .strings:返回文档中的所有字符串的生成器

type(soup.strings)
#generator
for string in soup.strings:#返回文档中多条字符串
    print(string)
"""
The Dormouse's story




The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.
"""

5. .stripped_strings返回删除了回车和每条字符串两边的空格

lt=""
for string in soup.stripped_strings:#删除了回车和每行两边多余的空格
    lt+=string
print(lt)
"""
The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,LacieandTillie;
and they lived at the bottom of a well....
"""
soup.getText()
"""
"The Dormouse's story\n\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n"
"""

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值