Beautiful Soap 使用注意事项

最新推荐文章于 2024-02-14 20:50:12 发布

Code_LT

最新推荐文章于 2024-02-14 20:50:12 发布

阅读量357

点赞数

分类专栏： Web 文章标签： python 爬虫

本文链接：https://blog.csdn.net/Code_LT/article/details/113144916

版权

Web 专栏收录该内容

8 篇文章 23 订阅

订阅专栏

最全教程：Beautiful Soup 4.4.0 文档 — Beautiful Soup 4.2.0 中文文档

备用网址：https://github.com/DeronW/beautifulsoup/blob/v4.4.0/docs/index.rst

0. 解决中文乱码和图片加载问题

解决Requests中文乱码_chaowanghn的博客-CSDN博客_requests 乱码

1. 注意html有些tag的属性是具有多值的（但凡在某个html历史版本中，被定义成为过多值属性），多值属性取用时，会以list返回。但如果转换的文档是XML格式,那么tag中不包含多值属性

勘误：官方文档在讲解Tag属性时，最开始的class属性全为字符串，是不对的。

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
>>> soup.b['class']
['boldest']
>>> tag = soup.b
>>> tag['class']
['boldest']
>>> tag.attrs
{'class': ['boldest']}
>>>


>>>xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
>>>xml_soup.p['class']
u'body strikeout'
>>>

2. ele.prettify()会把元素结构化输出（按层次分行），注意，和__repr__是不一样的

如果只想得到结果字符串,不重视格式,那么可以对一个 BeautifulSoup 对象或 Tag 对象使用Python的 unicode() 或 str() 方法:

str(soup)
# '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>

unicode(soup.a)
# u'<a href="http://example.com/">I linked to <i>example.com</i></a>'

>>> doc1="<html> <head>  <title>   The Dormouse's story  </title> </head></html>"
>>> soup2=BeautifulSoup(doc1,"html.parser")
>>> print(soup2)
<html> <head> <title>   The Dormouse's story  </title> </head></html>
>>> print(soup2.prettify())
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
</html>

3.节点可通过.<tag_name>的方式取得，但是通过点取属性的方式只能获得当前名字的第一个tag

>>> doc3="""
...  <body>
...  <p class="title"><b>The Dormouse's story</b></p>
...  <p class="story">
...    Once upon a time there were three little sisters; and their names were
...    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
...    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
...    <a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>
...    ;and they lived at the bottom of a well.
...   </p>
...  </body>
... """
>>> soup3=BeautifulSoup(doc3,"html.parser")
>>> soup3.p
<p class="title"><b>The Dormouse's story</b></p>
>>> soup3.b
<b>The Dormouse's story</b>

4..contents得到子节点列表（注意，原文中的'\n'也是元素，会被分割出来，但该元素.name为None）和 .children 得到子节点生成器。但这两个都只能得到直接子节点，.descendants才能一层层递归得到所有子节点（返回的生成器）

>>> doc4="""
...  <body>
...  <p>p1</p>
...  <p>p2</p>
...  </body>
... """
>>> soup4=BeautifulSoup(doc4,"html.parser")
>>> soup4.contents
['\n', <body>
<p>p1</p>
<p>p2</p>
</body>, '\n']
>>> soup4.body.contents
['\n', <p>p1</p>, '\n', <p>p2</p>, '\n']
>>> doc5="""
...  <body>
...  <p>p1</p><p>p2</p>
...  </body>
... """
>>> soup5=BeautifulSoup(doc5,"html.parser")
>>> soup5.body.contents
['\n', <p>p1</p>, <p>p2</p>, '\n']


>>> soup3.body.contents
['\n', <p class="title"><b>The Dormouse's story</b></p>, '\n', <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
   <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
   <a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>
   ;and they lived at the bottom of a well.
  </p>, '\n']


>>> list(soup3.body.descendants)
['\n', <p class="title"><b>The Dormouse's story</b></p>, <b>The Dormouse's story</b>, "The Dormouse's story", '\n', <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
   <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
   <a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>
   ;and they lived at the bottom of a well.
  </p>, '\n   Once upon a time there were three little sisters; and their names were\n   ', <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 'Elsie', ',\n   ', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 'Lacie', '
and\n   ', <a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>, 'Tillie', '\n   ;and they lived at the bottom of a well.\n  ', '\n']

5 .string返回唯一一个子节点，不论子节点为NavigableString还是Tag。如果子节点不唯一，则返回None。要想取出所有NavigableString对象，用.strings 或 .strped_strings(去除空白行)

>>> doc5="""
...  <a>
...  line1
...
...  line2
...
...  line3
...
...  </a>
... """
>>> soup=BeautifulSoup(doc5,"html.parser")
>>> soup.a.string
'\n line1\n \n line2\n \n line3\n \n '

>>> for line in soup.strings:
...      print(repr(line))
...
'\n'
'\n line1\n \n line2\n \n line3\n \n '
'\n'
>>>


>>> doc6="""
...  <head> line1  </head>
...  <a> line2  </a>
...  <p> line3  </p>
... """
>>>
>>> soup=BeautifulSoup(doc6,"html.parser")
>>> for line in soup.strings:
...     print(repr(line))
...
'\n'
' line1  '
'\n'
' line2  '
'\n'
' line3  '
'\n'

>>> for line in soup.stripped_strings:
...      print(repr(line))
...
'line1'
'line2'
'line3'

6. .parent得到父节点，.parents递归得到所有父节点。

7.在文档树中,使用 .next_sibling 和 .previous_sibling 属性来查询兄弟节点。tag_first没有.previous_sibling, tag_last没有.next_sibling 会返回None。
注意实际文档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白。

通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出

8 .next_element 属性指向解析过程中下一个被解析的对象(字符串或tag) .previous_element 前一个解析对象

通过 .next_elements 和 .previous_elements 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样

9.soup.head.title 是 tag的名字方法的简写.这个简写的原理就是多次调用当前tag的 find() 方法:

soup.head.title
# <title>The Dormouse's story</title>

soup.find("head").find("title")
# <title>The Dormouse's story</title>

同样的，tag.find_all('h1') 和 tag('h1')等价

10。 find_all() 和 find() 只搜索当前节点的所有子节点,孙子节点等

11.

find_all( name , attrs , recursive , string , **kwargs )

find( name , attrs , recursive , string , **kwargs )

find_parents( name , attrs , recursive , string , **kwargs )

find_parent( name , attrs , recursive , string , **kwargs )

find_next_siblings( name , attrs , recursive , string , **kwargs )

find_next_sibling( name , attrs , recursive , string , **kwargs )

find_previous_siblings( name , attrs , recursive , string , **kwargs )

find_previous_sibling( name , attrs , recursive , string , **kwargs )

find_all_next( name , attrs , recursive , string , **kwargs )

find_next( name , attrs , recursive , string , **kwargs )

find_all_previous( name , attrs , recursive , string , **kwargs )

find_previous( name , attrs , recursive , string , **kwargs )

12. Beautiful Soup支持大部分的CSS选择器 Selectors [6] , 在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数, 即可使用CSS选择器的语法找到tag.

13. get_text()
如果只想得到tag中包含的文本内容,那么可以嗲用 get_text() 方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回

Code_LT

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录