beautifulsoup使用记录

最新推荐文章于 2022-09-08 14:39:53 发布

linuxvfast

最新推荐文章于 2022-09-08 14:39:53 发布

阅读量56

点赞数

分类专栏：日常记录文章标签： beautifulsoup

本文链接：https://blog.csdn.net/linuxvfast/article/details/117626289

版权

日常记录专栏收录该内容

58 篇文章 0 订阅

订阅专栏

from bs4 import BeautifulSoup
html_str='''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2"><!-- Lacie --></a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>
</html>
'''
soup = BeautifulSoup(html_str,'lxml',from_encoding='utf-8')
#soup = BeautifulSoup(open('index.html')) 此方式调用需要将html_str写入到index.html文件中
print soup.prettify()

BeautifulSoup会自动选择适合的解析器解析html

beautifulsoup将复杂的html转换成树形结构，每个结点都是python对象，所有对象可以归纳为4种：如下

1）tag对象：

<title>The Dormouse's story</title>

<a href="http://example.com/elsie" class="sister" id="link1"></a>

title和a标签中的内容被称为tag对象

内容该被如何获取呢？？

结合上面的代码加入
print soup.title
print soup.a
print soup.p

结果为：

<title>The Dormouse's story</title>
<a class="sister" href="http://example.com/elsie" id="link1"></a>
The Dormouse's story

获取对象名和标签名：

print soup.name
print soup.title.name

修改对象标签名：

print soup.title
soup.title.name = 'mytitle'
print soup.name
print soup.mytitle.name
print soup.mytitle
print soup.title

获取标签属性：

print soup.p['class']
print soup.p.get('class')
print soup.p.attrs

修改标签属性：

soup.p['class']='myclass'
print soup.p

结果为：

The Dormouse's story

2）NavigableString对象：

获取标签中的值

print soup.p.string

BeautifuSoup使用NavigableString类包装标签中的字符串，与python中的Unicode字符串相同，骑过unicode()方法可以直接将NavigableString对象转换成Unicode字符串。

unicode_string = unicode(soup.p.string)

3）beautifulsoup对象：

beautifulsoup不是真正的html或xml的标记，没有name和attribute属性

为了标准化Tag对象，实现接口的统一，可以获取name和attribute属性

print type(soup.name)
print soup.name
print soup.attrs

结果为：

<type 'unicode'>
[document]
{}

4）Comment对象：文档注释

print soup.a.string ====> 输出：Elsie

print type(soup.a.string)===》输出：<class 'bs4.element.Comment'>

提取注释时可以根据字符串类型获取：

if type(soup.a.string)==bs4.element.Comment:

print soup.a.string

linuxvfast

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
beautifulsoup使用记录

from bs4 import BeautifulSouphtml_str='''<!DOCTYPE html><html lang="en"><head> <meta charset="UTF-8"> <title>The Dormouse's story</title></head><body>The Dormou.
复制链接

扫一扫