python中beautifulsoup的用法_python系列之（1）BeautifulSoup的用法

最新推荐文章于 2023-06-08 19:39:51 发布

wx48c2fcbcb6699ceb

最新推荐文章于 2023-06-08 19:39:51 发布

阅读量144

点赞数

文章标签： python中beautifulsoup的用法

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_33908564/article/details/112891246

版权

好久没更新博客了。打算写一个python的爬虫系列及数据分析。falg也不能随便立，以免打脸。

python爬取内容，是过程，分析数据是结果，最终得出结论才是目的。python爬虫爬取了内容，一般都是从网页上获取，那我们从html页面中如何提取出自己想要的信息呢？那就需要解析。目前常用的有BeautifulSoup、PyQuery、XPath和正则表达式。正则容易出错，而且一直是弱项，就讲讲其他三个的使用，今天先看下BeautifulSoup.

一、简介

BeautifulSoup直译为美丽的汤。是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式。

二、安装

pip install beautifulsoup4

三、准备测试代码

这是爱丽丝梦游仙境的的一段内容(以后内容中简称为爱丽丝的文档)

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; andtheir names wereElsie,Lacie and

Tillie;and they lived at the bottom of a well.

...

我们先以上述代码为例进行测试

四、使用

from bs4 importBeautifulSoup

html_doc= """

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

and they lived at the bottom of a well.

...

"""soup= BeautifulSoup(html_doc, features="html.parser")#print(soup.prettify())

print(soup.title)#

The Dormouse's story

print(soup.title.name)#title

print(soup.title.string)#The Dormouse's story

print(soup.title.parent.name)#head

print(soup.p)#

The Dormouse's story

print(soup.p['class'])#[u'title']

print(soup.a)#Elsie

print(soup.find_all('a'))#[Elsie, Lacie, Tillie]

print(soup.find(id='link3'))#Tillie

for link in soup.find_all('a'):print(link.get('href'))#http://example.com/elsie#http://example.com/lacie#http://example.com/tillie

print(soup.get_text())#The Dormouse's story

#The Dormouse's story#Once upon a time there were three little sisters; and their names were#Elsie,#Lacie and#Tillie;#and they lived at the bottom of a well.#...

以上注释的都是上一行输出的

五、BeautifulSoup可以传入字符串或文件句柄

from bs4 importBeautifulSoup

soup= BeautifulSoup('Extremely bold', features="lxml")

tag=soup.bprint(tag)#Extremely bold

tag.name = "blockquote"

print(tag)#

Extremely bold

print(tag['class'])#['boldest']

print(tag.attrs)#{'class': ['boldest']}

tag['id']="stylebs"

print(tag)#

Extremely bold

del tag['id']print(tag)#

Extremely bold

css_soup= BeautifulSoup('

id_soup= BeautifulSoup('

rel_soup= BeautifulSoup('

Back to the homepage

', features="lxml")print(rel_soup.a['rel'])#['index']

rel_soup.a['rel'] = ['index', 'contents']print(rel_soup.p)

wx48c2fcbcb6699ceb

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。