python中beautifulsoup的用法_python系列之(1)BeautifulSoup的用法

好久没更新博客了。打算写一个python的爬虫系列及数据分析。falg也不能随便立,以免打脸。

python爬取内容,是过程,分析数据是结果,最终得出结论才是目的。python爬虫爬取了内容,一般都是从网页上获取,那我们从html页面中如何提取出自己想要的信息呢?那就需要解析。目前常用的有BeautifulSoup、PyQuery、XPath和正则表达式。正则容易出错,而且一直是弱项,就讲讲其他三个的使用,今天先看下BeautifulSoup.

一、简介

BeautifulSoup直译为美丽的汤。是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式。

二、安装

pip install beautifulsoup4

三、准备测试代码

这是 爱丽丝梦游仙境的 的一段内容(以后内容中简称为 爱丽丝 的文档)

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; andtheir names wereElsie,Lacie and

Tillie;and they lived at the bottom of a well.

...

我们先以上述代码为例进行测试

四、使用

from bs4 importBeautifulSoup

html_doc= """

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

"""soup= BeautifulSoup(html_doc, features="html.parser")#print(soup.prettify())

print(soup.title)#

The Dormouse's story

print(soup.title.name)#title

print(soup.title.string)#The Dormouse's story

print(soup.title.parent.name)#head

print(soup.p)#

The Dormouse's story

print(soup.p['class'])#[u'title']

print(soup.a)#Elsie

print(soup.find_all('a'))#[Elsie, Lacie, Tillie]

print(soup.find(id='link3'))#Tillie

for link in soup.find_all('a'):print(link.get('href'))#http://example.com/elsie#http://example.com/lacie#http://example.com/tillie

print(soup.get_text())#The Dormouse's story

#The Dormouse's story#Once upon a time there were three little sisters; and their names were#Elsie,#Lacie and#Tillie;#and they lived at the bottom of a well.#...

以上注释的都是上一行输出的

五、BeautifulSoup可以传入字符串或文件句柄

from bs4 importBeautifulSoup

soup= BeautifulSoup('Extremely bold', features="lxml")

tag=soup.bprint(tag)#Extremely bold

tag.name = "blockquote"

print(tag)#

Extremely bold

print(tag['class'])#['boldest']

print(tag.attrs)#{'class': ['boldest']}

tag['id']="stylebs"

print(tag)#

Extremely bold

del tag['id']print(tag)#

Extremely bold

css_soup= BeautifulSoup('

id_soup= BeautifulSoup('

rel_soup= BeautifulSoup('

Back to the homepage

', features="lxml")print(rel_soup.a['rel'])#['index']

rel_soup.a['rel'] = ['index', 'contents']print(rel_soup.p)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值