【Python】BeautifulSoup使用

title: 【Python】BeautifulSoup基本使用
type: categories
date: 2017-02-24 14:26:55
categories: Python

tags:

BeautifulSoup是Python中用来解析HTML、XML等文档的强大工具。

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment。

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>这个是b标签</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc)
1、Tag标签对象
tag = soup.p
print(type(tag))        # <class 'bs4.element.Tag'>

标签的属性可以被添加,删除或修改,操作如字典

tag['class'] = "very"   # 修改属性值
tag['id'] = 1           # 添加id属性
print(tag)              # <p class="verybold" id="1"><b>你好</b></p>

del tag['id']           # 删除id属性
print(tag)              # <p class="very"><b>你好</b></p>

多值属性 返回list类型数据

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
print(css_soup.p['class'])      # ['body', 'strikeout']

非多值属性 返回string

n_css_soup = BeautifulSoup('<p id="body name"></p>')
print(n_css_soup.p['id'])       # body name2、NavigableString:标签中的字符串
2、NavigableString标签中的字符串
print(type(tag.string))         # <class 'bs4.element.NavigableString'>
print(tag.string)               # 这个是b标签
tag.string.replace_with('you are beautiful')
print(tag.string)               # you are beautiful
3、BeautifulSoup对象
print(type(soup))           # <class 'bs4.BeautifulSoup'>
print(soup.name)            # [document]
4、Comment注释及特殊字符串

Comment 对象是一个特殊类型的 NavigableString 对象

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
mark_soup = BeautifulSoup(markup)
print(type(mark_soup.b.string))     # <class 'bs4.element.Comment'>
print(mark_soup.b.string)           # Hey, buddy. Want to buy a used parser?
print(mark_soup.b.prettify())       # comment 也可以prettify()格式化输出
#<b>
# <!--Hey, buddy. Want to buy a used parser?-->
#</b>
contents

子节点以list形式输出

print(soup.head)                # <head><title>The Dormouse's story</title></head>
print(soup.head.contents)       # [<title>The Dormouse's story</title>]
print(soup.head.contents[0])    # <title>The Dormouse's story</title>
children

子节点迭代器

print(type(soup.head.children))     # <class 'list_iterator'>
for child in soup.head.children:
    print(child)
descendants

子孙节点生成器 递归循环的列出子节点和子孙节点

print(type(soup.head.descendants))  # <class 'generator'>
for child in soup.head.descendants:
    print(child)
    # <title>The Dormouse's story</title>   子节点
    # The Dormouse's story                  孙节点
strings

标签中的字符串

for str in soup.strings:
    print(repr(str))
# '\n'
# "The Dormouse's story"
# '\n'
# '\n'
# 'you are beautiful'
# '\n'
# 'Once upon a time there were three little sisters; and their names were\n'
# 'Elsie'
# ',\n'
# 'Lacie'
# ' and\n'
# 'Tillie'
# ';\nand they lived at the bottom of a well.'
# '\n'
# '...'
# '\n
stripped_strings

去除空格空行

for str in soup.stripped_strings:
    print(repr(str))
# "The Dormouse's story"
# 'you are beautiful'
# 'Once upon a time there were three little sisters; and their names were'
# 'Elsie'
# ','
# 'Lacie'
# 'and'
# 'Tillie'
# ';\nand they lived at the bottom of a well.'
# '...'

http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id7

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值