Beautifulsoup 库 -- 05 -- 输出

最新推荐文章于 2023-07-12 10:52:37 发布

S_numb

最新推荐文章于 2023-07-12 10:52:37 发布

阅读量388

点赞数

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/S_numb/article/details/120218236

版权

Python 专栏收录该内容

22 篇文章 0 订阅

订阅专栏

本文介绍了BeautifulSoup库中关于输出的三个关键点：格式化输出的prettify()方法，压缩输出的str()和unicode()，以及get_text()用于提取文本内容。通过实例演示了如何优雅地呈现HTML结构和提取所需信息。

摘要由CSDN通过智能技术生成

文章目录

1. 输出

1. 输出

1.1 格式化输出

prettify() 方法将 Beautiful Soup 的文档树格式化后以 Unicode 编码输出；
每个 XML/HTML 标签都独占一行；

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
print(soup.prettify())

输出：

<html>
 <body>
  <a href="http://example.com/">
   I linked to
   <i>
    example.com
   </i>
  </a>
 </body>
</html>

BeautifulSoup 对象和它的 tag 节点都可以调用 prettify() 方法。

1.2 压缩输出

unicode() 或 str() 方法：
- 只得到结果字符串，不重视格式；

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
print(str(soup))
print(unicode(soup.a))

输出：

<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>
<a href="http://example.com/">I linked to <i>example.com</i></a>

str() 方法返回 UTF-8 编码的字符串，可以指定编码的设置。
encode() 方法获得字节码或调用 decode() 方法获得Unicode。

1.3 输出格式

Beautiful Soup 输出是会将 HTML 中的特殊字符转换成 Unicode，比如“&lquot;”；

soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.")
print(unicode(soup))

输出：

<html><head></head><body>\u201cDammit!\u201d he said.</body></html>

如果将文档转换成字符串，Unicode 编码会被编码成 UTF-8；
这样就无法正确显示 HTML 特殊字符了；

soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.")
print(str(soup))

输出：

<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>

1.4 get_text()

get_text() 方法：得到 tag 中包含的文本内容；
这个方法获取到 tag 中包含的所有文版内容包括子孙 tag 中的内容，并将结果作为 Unicode 字符串返回；

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)
print(soup.get_text())
print(soup.i.get_text())

输出：

'\nI linked to example.com\n'
example.com

可以通过参数指定 tag 的文本内容的分隔符：

print(soup.get_text("|"))

输出：

\nI linked to |example.com|\n

去除获得文本内容的前后空白：

print(soup.get_text("|", strip=True))

输出：

I linked to|example.com

S_numb

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录