【Python】BeautifulSoup使用

最新推荐文章于 2023-07-12 10:52:37 发布

Jmsp

最新推荐文章于 2023-07-12 10:52:37 发布

阅读量1k

点赞数 1

分类专栏： Python 文章标签： python 对象

本文链接：https://blog.csdn.net/zz110731/article/details/56842813

版权

Python 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

title: 【Python】BeautifulSoup基本使用
type: categories
date: 2017-02-24 14:26:55
categories: Python

tags:

BeautifulSoup是Python中用来解析HTML、XML等文档的强大工具。

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment。

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>这个是b标签</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc)

1、Tag标签对象

tag = soup.p
print(type(tag))        # <class 'bs4.element.Tag'>

标签的属性可以被添加,删除或修改，操作如字典

tag['class'] = "very"   # 修改属性值
tag['id'] = 1           # 添加id属性
print(tag)              # <p class="verybold" id="1"><b>你好</b></p>

del tag['id']           # 删除id属性
print(tag)              # <p class="very"><b>你好</b></p>

多值属性返回list类型数据

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
print(css_soup.p['class'])      # ['body', 'strikeout']

非多值属性返回string

n_css_soup = BeautifulSoup('<p id="body name"></p>')
print(n_css_soup.p['id'])       # body name2、NavigableString：标签中的字符串

2、NavigableString标签中的字符串

print(type(tag.string))         # <class 'bs4.element.NavigableString'>
print(tag.string)               # 这个是b标签
tag.string.replace_with('you are beautiful')
print(tag.string)               # you are beautiful

3、BeautifulSoup对象

print(type(soup))           # <class 'bs4.BeautifulSoup'>
print(soup.name)            # [document]

4、Comment注释及特殊字符串

Comment 对象是一个特殊类型的 NavigableString 对象

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
mark_soup = BeautifulSoup(markup)
print(type(mark_soup.b.string))     # <class 'bs4.element.Comment'>
print(mark_soup.b.string)           # Hey, buddy. Want to buy a used parser?
print(mark_soup.b.prettify())       # comment 也可以prettify()格式化输出
#<b>
# <!--Hey, buddy. Want to buy a used parser?-->
#</b>

子节点以list形式输出

print(soup.head)                # <head><title>The Dormouse's story</title></head>
print(soup.head.contents)       # [<title>The Dormouse's story</title>]
print(soup.head.contents[0])    # <title>The Dormouse's story</title>

children

子节点迭代器

print(type(soup.head.children))     # <class 'list_iterator'>
for child in soup.head.children:
    print(child)

descendants

子孙节点生成器递归循环的列出子节点和子孙节点

print(type(soup.head.descendants))  # <class 'generator'>
for child in soup.head.descendants:
    print(child)
    # <title>The Dormouse's story</title>   子节点
    # The Dormouse's story                  孙节点

strings

标签中的字符串

for str in soup.strings:
    print(repr(str))
# '\n'
# "The Dormouse's story"
# '\n'
# '\n'
# 'you are beautiful'
# '\n'
# 'Once upon a time there were three little sisters; and their names were\n'
# 'Elsie'
# ',\n'
# 'Lacie'
# ' and\n'
# 'Tillie'
# ';\nand they lived at the bottom of a well.'
# '\n'
# '...'
# '\n

stripped_strings

去除空格空行

for str in soup.stripped_strings:
    print(repr(str))
# "The Dormouse's story"
# 'you are beautiful'
# 'Once upon a time there were three little sisters; and their names were'
# 'Elsie'
# ','
# 'Lacie'
# 'and'
# 'Tillie'
# ';\nand they lived at the bottom of a well.'
# '...'

http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id7

Jmsp

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【Python】BeautifulSoup使用

title: 【Python】BeautifulSoup基本使用 type: categories date: 2017-02-24 14:26:55 categories: Pythontags: BeautifulSoup是Python中用来解析HTML、XML等文档的强大工具。Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所
复制链接

扫一扫

专栏目录