BeautifulSoup4的安装及使用

最新推荐文章于 2024-02-22 17:40:57 发布

二月齐飞

最新推荐文章于 2024-02-22 17:40:57 发布

阅读量1.0k

点赞数 1

分类专栏： python 文章标签： BeautifulSoup bs4

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

一、BeautifulSoup4的安装
方法一：cmd->easy_install BeautifulSoup
方法二：从 http://www.crummy.com/software/BeautifulSoup/bs4/download/
下载->cmd->进入下载的文件目录->python setuyp.py install

二、 BeautifulSoup4的使用
1、导入
from bs4 import BeautifulSoup
注意：要是BeautifulSoup的版本为3.x，则导入方式为：from BeautifulSoup import BeautifulSoup
2、example
html文件：
html_doc = """

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""

代码：
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

接下来可以开始使用各种功能

soup.X (X为任意标签，返回整个标签，包括标签的属性，内容等）

如：soup.title

#

soup.p

#

The Dormouse's story

soup.a （注：仅仅返回第一个结果）

# Elsie

soup.find_all('a') （find_all 可以返回所有）

# [ Elsie,

# Lacie,

# Tillie]

find还可以按属性查找
soup.find(id="link3")
# Tillie

要取某个标签的某个属性，可用函数有 find_all,get
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

要取html文件中的所有文本，可使用get_text()
print(soup.get_text())
# The Dormouse's story
# The Dormouse's story
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
# ...

如果是打开html文件，语句可用：
soup = BeautifulSoup(open("index.html"))
BeautifulSoup中的Object
tag （对应html中的标签）
tag.attrs (以字典形式返回tag的所有属性）
可以直接对tag的属性进行增、删、改，跟操作字典一样

tag['class'] = 'verybold'

tag['id'] = 1

tag

# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']

del tag['id']

tag

# <blockquote>Extremely bold</blockquote>

tag['class']

# KeyError: 'class'

print(tag.get('class'))

# None

X.contents (X为标签，可返回标签的内容）

eg.

head_tag = soup.head

head_tag

# <head><title>The Dormouse's story</title></head>

head_tag.contents

[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]

title_tag

# <title>The Dormouse's story</title>

title_tag.contents

# [u'The Dormouse's story']

解决解析网页出现乱码问题：
import urllib2
2 from BeautifulSoup import BeautifulSoup
3
4 page = urllib2.urlopen('http://www.leeon.me');
5 soup = BeautifulSoup(page,fromEncoding="gb18030")
6
7 print soup.originalEncoding
8 print soup.prettify()

二月齐飞

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
BeautifulSoup4的安装及使用

一、BeautifulSoup4的安装方法一：cmd->easy_install BeautifulSoup 方法二：从http://www.crummy.com/software/BeautifulSoup/bs4/download/下载->cmd->进入下载的文件目录->python setuyp.py install二、 BeautifulSoup4的使用
复制链接

扫一扫