最全教程:Beautiful Soup 4.4.0 文档 — Beautiful Soup 4.2.0 中文 文档
备用网址:https://github.com/DeronW/beautifulsoup/blob/v4.4.0/docs/index.rst
0. 解决中文乱码和图片加载问题
解决Requests中文乱码_chaowanghn的博客-CSDN博客_requests 乱码
1. 注意html有些tag的属性是具有多值的(但凡在某个html历史版本中,被定义成为过多值属性),多值属性取用时,会以list返回。 但如果转换的文档是XML格式,那么tag中不包含多值属性
勘误:官方文档在讲解Tag属性时,最开始的class属性全为字符串,是不对的。
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
>>> soup.b['class']
['boldest']
>>> tag = soup.b
>>> tag['class']
['boldest']
>>> tag.attrs
{'class': ['boldest']}
>>>
>>>xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
>>>xml_soup.p['class']
u'body strikeout'
>>>
2. ele.prettify()会把元素结构化输出(按层次分行),注意,和__repr__是不一样的
如果只想得到结果字符串,不重视格式,那么可以对一个 BeautifulSoup 对象或 Tag 对象使用Python的 unicode() 或 str() 方法:
str(soup)
# '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>
unicode(soup.a)
# u'<a href="http://example.com/">I linked to <i>example.com</i></a>'
>>> doc1="<html> <head> <title> The Dormouse's story </title> </head></html>"
>>> soup2=BeautifulSoup(doc1,"html.parser")
>>> print(soup2)
<html> <head> <title> The Dormouse's story </title> </head></html>
>>> print(soup2.prettify())
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
</html>
3.节点可通过.<tag_name>的方式取得,但是通过点取属性的方式只能获得当前名字的第一个tag
>>> doc3="""
... <body>
... <p class="title"><b>The Dormouse's story</b></p>
... <p class="story">
... Once upon a time there were three little sisters; and their names were
... <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
... <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
... <a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>
... ;and they lived at the bottom of a well.
... </p>
... </body>
... """
>>> soup3=BeautifulSoup(doc3,"html.parser")
>>> soup3.p
<p class="title"><b>The Dormouse's story</b></p>
>>> soup3.b
<b>The Dormouse's story</b>
4..contents得到子节点列表(注意,原文中的'\n'也是元素,会被分割出来,但该元素.name为None) 和 .children 得到子节点生成器。 但这两个都只能得到直接子节点,.descendants才能一层层递归得到所有子节点(返回的生成器)
>>> doc4="""
... <body>
... <p>p1</p>
... <p>p2</p>
... </body>
... """
>>> soup4=BeautifulSoup(doc4,"html.parser")
>>> soup4.contents
['\n', <body>
<p>p1</p>
<p>p2</p>
</body>, '\n']
>>> soup4.body.contents
['\n', <p>p1</p>, '\n', <p>p2</p>, '\n']
>>> doc5="""
... <body>
... <p>p1</p><p>p2</p>
... </body>
... """
>>> soup5=BeautifulSoup(doc5,"html.parser")
>>> soup5.body.contents
['\n', <p>p1</p>, <p>p2</p>, '\n']
>>> soup3.body.contents
['\n', <p class="title"><b>The Dormouse's story</b></p>, '\n', <p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
<a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>
;and they lived at the bottom of a well.
</p>, '\n']
>>> list(soup3.body.descendants)
['\n', <p class="title"><b>The Dormouse's story</b></p>, <b>The Dormouse's story</b>, "The Dormouse's story", '\n', <p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
<a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>
;and they lived at the bottom of a well.
</p>, '\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 'Elsie', ',\n ', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 'Lacie', '
and\n ', <a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>, 'Tillie', '\n ;and they lived at the bottom of a well.\n ', '\n']
5 .string返回唯一一个子节点,不论子节点为NavigableString还是Tag。如果子节点不唯一,则返回None。 要想取出所有NavigableString对象,用.strings 或 .strped_strings(去除空白行)
>>> doc5="""
... <a>
... line1
...
... line2
...
... line3
...
... </a>
... """
>>> soup=BeautifulSoup(doc5,"html.parser")
>>> soup.a.string
'\n line1\n \n line2\n \n line3\n \n '
>>> for line in soup.strings:
... print(repr(line))
...
'\n'
'\n line1\n \n line2\n \n line3\n \n '
'\n'
>>>
>>> doc6="""
... <head> line1 </head>
... <a> line2 </a>
... <p> line3 </p>
... """
>>>
>>> soup=BeautifulSoup(doc6,"html.parser")
>>> for line in soup.strings:
... print(repr(line))
...
'\n'
' line1 '
'\n'
' line2 '
'\n'
' line3 '
'\n'
>>> for line in soup.stripped_strings:
... print(repr(line))
...
'line1'
'line2'
'line3'
6. .parent得到父节点,.parents递归得到所有父节点。
7.在文档树中,使用 .next_sibling 和 .previous_sibling 属性来查询兄弟节点。tag_first没有.previous_sibling, tag_last没有.next_sibling 会返回None。
注意实际文档中的tag的 .next_sibling
和 .previous_sibling
属性通常是字符串或空白。
通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出
8 .next_element 属性指向解析过程中下一个被解析的对象(字符串或tag) .previous_element 前一个解析对象
通过 .next_elements 和 .previous_elements 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样
9.soup.head.title
是 tag的名字 方法的简写.这个简写的原理就是多次调用当前tag的 find()
方法:
soup.head.title
# <title>The Dormouse's story</title>
soup.find("head").find("title")
# <title>The Dormouse's story</title>
同样的,tag.find_all('h1') 和 tag('h1')等价
10。 find_all()
和 find()
只搜索当前节点的所有子节点,孙子节点等
11.
find_all( name , attrs , recursive , string , **kwargs )
find( name , attrs , recursive , string , **kwargs )
find_parents( name , attrs , recursive , string , **kwargs )
find_parent( name , attrs , recursive , string , **kwargs )
find_next_siblings( name , attrs , recursive , string , **kwargs )
find_next_sibling( name , attrs , recursive , string , **kwargs )
find_previous_siblings( name , attrs , recursive , string , **kwargs )
find_previous_sibling( name , attrs , recursive , string , **kwargs )
find_all_next( name , attrs , recursive , string , **kwargs )
find_next( name , attrs , recursive , string , **kwargs )
find_all_previous( name , attrs , recursive , string , **kwargs )
find_previous( name , attrs , recursive , string , **kwargs )
12. Beautiful Soup支持大部分的CSS选择器 Selectors [6] , 在 Tag
或 BeautifulSoup
对象的 .select()
方法中传入字符串参数, 即可使用CSS选择器的语法找到tag.
13. get_text()
如果只想得到tag中包含的文本内容,那么可以嗲用 get_text() 方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回