Beautiful Soup 4解析网页

最新推荐文章于 2022-09-26 18:02:30 发布

aican_yu

最新推荐文章于 2022-09-26 18:02:30 发布

阅读量1.2k

点赞数 1

分类专栏：搜索

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/aican_yu/article/details/8805336

版权

搜索专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Beautiful Soup 4的安装及相关问题

Beautiful Soup的最新版本是4.1.1可以在此获取（http://www.crummy.com/software/BeautifulSoup/bs4/download/）

文档：

（http://www.crummy.com/software/BeautifulSoup/bs4/doc/）

使用：

from bs4 import BeautifulSoup

Example：

html文件：

html_doc = """ <html><head><title>The Dormouse's story</title></head> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """

代码：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)

接下来可以开始使用各种功能

soup.X (X为任意标签，返回整个标签，包括标签的属性，内容等）

如：soup.title

# <title>The Dormouse's story</title>

soup.p

# <p class="title"><b>The Dormouse's story</b></p>

soup.a （注：仅仅返回第一个结果）

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a') （find_all 可以返回所有）

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

find还可以按属性查找

soup.find(id="link3")

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

要取某个标签的某个属性，可用函数有 find_all,get

for link in soup.find_all('a'):

print(link.get('href'))

# http://example.com/elsie

# http://example.com/lacie

# http://example.com/tillie

要取html文件中的所有文本，可使用get_text()

print(soup.get_text())

# The Dormouse's story

#

# The Dormouse's story

#

# Once upon a time there were three little sisters; and their names were

# Elsie,

# Lacie and

# Tillie;

# and they lived at the bottom of a well.

#

# ...

如果是打开html文件，语句可用：

soup = BeautifulSoup(open("index.html"))

BeautifulSoup中的Object

tag （对应html中的标签）

tag.attrs (以字典形式返回tag的所有属性）

可以直接对tag的属性进行增、删、改，跟操作字典一样

tag['class'] = 'verybold'

tag['id'] = 1

tag

# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']

del tag['id']

tag

# <blockquote>Extremely bold</blockquote>

tag['class']

# KeyError: 'class'

print(tag.get('class'))

# None

X.contents (X为标签，可返回标签的内容）

eg.

head_tag = soup.head

head_tag

# <head><title>The Dormouse's story</title></head>

head_tag.contents

[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]

title_tag

# <title>The Dormouse's story</title>

title_tag.contents

# [u'The Dormouse's story']

解决解析网页出现乱码问题：

import urllib2

`2`	`from` `BeautifulSoup` `import` `BeautifulSoup`

3

`4`	`page` `=` `urllib2.urlopen('http://www.leeon.me');`

`5`	`soup` `=` `BeautifulSoup(page,fromEncoding="gb18030")`

6

`7`	`print` `soup.originalEncoding`

`8`	`print` `soup.prettify()`

如果中文页面编码是gb2312，gbk，在BeautifulSoup构造器中传入fromEncoding="gb18030"参数即可解决乱码问题，即使分析的页面是utf8的页面使用gb18030也不会出现乱码问题！

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Beautiful Soup 4解析网页

Beautiful Soup 4的安装及相关问题Beautiful Soup的最新版本是4.1.1可以在此获取（http://www.crummy.com/software/BeautifulSoup/bs4/download/）文档：（http://www.crummy.com/software/BeautifulSoup/bs4/doc/）使
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。