Beautiful Soup库
B和S要大写
1.作用
- Beautiful Soup库是解析、遍历、维护“标签树”的功能。
标签树:
<html>
<body>
<p class="title">...</p>
</body>
</html>
2.BeautifulSoup类
- HTML页面<——>标签树<——>BeautifulSoup类
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html>data</html>", "html.parser") # "html.parser"是HTML解析器
soup2 = BeautifulSoup(open("D://demo.html"), "html.parser")
- BeautifulSoup类对应一个HTML/XML文档的全部内容
3.基本元素
- NavigableString可以跨越多个标签层次
4.库的理解
from bs4 import BeautifulSoup
newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>", "html.parser")
newsoup.b.string
'This is a comment'
type(newsoup.b.string)
<class 'bs4.element.Comment'>
newsoup.p.string
- <!- -> 是注释标签,解析时会自动忽略,只提取文本。为了区分b标签和p标签中的文本内容,可以通过字符类型进行区分。
5.基于bs4库的HTML内容遍历方法
- 标签树的下行遍历
- 标签树的上行遍历
- 标签树的平行遍历
6.基于bs4库的HTML格式输出
- 对HTML格式输出进行美化
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
>>>demo
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, "html.parser")
>>>soup.prettify()
'<html>\n <head>\n <title>\n This is a python demo page\n </title>\n </head>\n <body>\n <p class="title">\n <b>\n The demo python introduces several python courses.\n </b>\n </p>\n <p class="course">\n Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n Basic Python\n </a>\n and\n <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n Advanced Python\n </a>\n .\n </p>\n </body>\n</html>'
print(soup.prettify()) #美化,添加回车符
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>