BeautifulSoup 安装及其使用

最新推荐文章于 2024-08-19 09:22:32 发布

lakeheart879

最新推荐文章于 2024-08-19 09:22:32 发布

阅读量1.4k

点赞数

分类专栏： Python

Python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

BeautifulSoup 安装及其使用

BeautifulSoup 是个好东东。

官网见这里： http://www.crummy.com/software/BeautifulSoup/

下载地址见这里：http://www.crummy.com/software/BeautifulSoup/bs4/download/4.1/ ，附件有4.1.2的安装源码

文档见这里： http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html ，是中文翻译的，不过文档有点旧，是 3.0 的文档版本，看起来没有什么意思。

我推荐大家看个： http://www.crummy.com/software/BeautifulSoup/bs4/doc/ ，这个是 python 的官网英文版，看起来要舒服，清晰很多。

在 python 下，你想按照 jquery 格式来读取网页，免除网页格式、标签的不规范的困扰，那么 BeautifulSoup 是个不错的选择。按照官网所说， BeautifulSoup 是 Screen-Scraping 应用，旨在节省大家处理 HTML 标签，并且从网络中获得信息的工程。 BeautifulSoup 有这么几个优点，使得其功能尤其强大：

1 ： Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn't take much code to write an application 。关键词： python 风格、提供简单方法

2 ： Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding 。关键词：编码转换，使用 Python 的同学都会认同Python 编码格式的繁琐， BeautifulSoup 能简化这一点。

3 ： Beautiful Soup sits on top of popular Python parsers like lxml and html5lib , allowing you to try out different parsing strategies or trade speed for flexibility 。关键词：兼容其它 html 解析器，能够让你随心替换。

看完这几个特性，想必有人心动了吧，我们先看下 BeautifulSoup 的安装：

安装方法：

1 ： apt-get install python-bs4

2 ： easy_install beautifulsoup4

3 ： pip install beautifulsoup4

4 ：源码安装： python setup.py install

根据不同的操作系统，选用不同的安装方法，这些方法都能安装成功，不同点在于安装的工具不同。我自己的系统采用的是第四种安装方法，下面我来简要介绍下第四种安装方法：

Python代码

curl http://www.crummy.com/software/BeautifulSoup/bs4/download/4.1/beautifulsoup4-4.1.2.tar.gz >> beautifulsoup4-4.1.2.tar.gz
tar zxvf beautifulsoup4-4.1.2.tar.gz
cd beautifulsoup4-4.1.2
python setup.py install

Ok ，你就能看到安装信息，提示安装成功。

安装成功，肯定想迫不及待的使用，你打开 python command 窗口，你很 happy 的输入：

Python代码

from beautifulsoup import beautifulsoup

sorry ， ImportError ，为什么会有这个 import error ，我都安装好了的。打开官网，重新看下说明，原来安装的是 BeautifulSoup 4.1 版本，这个 import 是 3.x 的说法。重新打开 command ，输入：

Python代码

from bs4 import BeautifulSoup

咦，没有输出提示。恭喜你， BeautifulSoup 包引入成功。

看文上篇博客， http://isilic.iteye.com/blog/1733560 ，想试下 dir 命令，看看 BeautifulSoup 提供了哪些方法：

Python代码

dir(BeautifulSoup)

看到一堆的方法，有点头大，将方法列出来会方便看许多。

Python代码

>>> for method in dir(BeautifulSoup):
... print method
...

请仔细看下其中的 findXxx ， nextXxx ， previousXxx 方法，这些方法提供了 html 页面的遍历、回溯、查找、匹配功能；这些功能已经能够提供获取页面信息的方法了。

我们以百度首页为例，试用下 BeautifulSoup 的强大功能。

Python代码

>>> import urllib2
>>> page=urllib2.urlopen('http://www.baidu.com')
>>> soup=BeautifulSoup(page)
>>> print soup.title
>>> soup.title.string

看到结果显示不错， helloworld 的教程让人心里真是舒服啊。

想进一步试用功能，我想找出百度首页上所有的链接，这个貌似很难，需要各种正则匹配，各种处理；等等，我们现在是在谈论这个 BeautifulSoup ，看看 BeautifulSoup 怎么实现这个功能。

Python代码

>>> for lind in soup.find_all('a'):
... print lind['href']
...

看到输出了吗？是不是很简单。

对于熟悉 Jquery 和 CSS 的同学，这种操作就是个折磨，需要不停的根据选择出来的结果进行遍历。看到上面的输出，看到有很多的 # 这些非正常的 URL ，现在想把这些 URL 全部过滤掉，使用 select 语法就很简单了。

Python代码

>>> for link in soup.select('a[href^=http]'):
... print link['href'];
...

有人说我根据判断出来的 URL 做处理不行嘛，当然可以，我这里只是想试下 select 的语法，至于 select 中的语法定义，大家可以自行度之。准确的说，这个 select 语法都能重新开篇文章了。

再进一步，连接中的 / 或者 /duty 链接都是有含义的，是相对于本站的绝对地址，这些 / 开头的怎么不被过滤掉？如果是绝对地址的话，又该怎么防止被过滤掉？ href 标签里面是个 javascript 又该怎么过滤？如果考虑 css文件和 js 文件的话，怎么把这些文件的 url 也给找出来？还有更进一步的，怎么分析出 js 中 ajax 的请求地址？这些都是可以进一步扩展的一些要求。

好吧，我承认后面这些 URL 过滤已经超出了 BeautifulSoup 的能力范围了，但是单纯考虑功能的话，这些都是要考虑的内容，这些疑问大家考虑下实现原理就行，如果能做进一步的学习的话，算是本文额外的功劳了。

下面简单过下 BeautifulSoup 的用法：

Python代码

DEFAULT_BUILDER_FEATURES
FORMATTERS
ROOT_TAG_NAME
STRIP_ASCII_SPACES：BeautifulSoup的内置属性
__call__
__class__
__contains__
__delattr__
__delitem__
__dict__
__doc__
__eq__
__format__
__getattr__
__getattribute__
__getitem__
__hash__
__init__
__iter__
__len__
__module__
__ne__
__new__
__nonzero__
__reduce__
__reduce_ex__
__repr__
__setattr__
__setitem__
__sizeof__
__str__
__subclasshook__
__unicode__
__weakref__
_all_strings
_attr_value_as_string
_attribute_checker
_feed
_find_all
_find_one
_lastRecursiveChild
_last_descendant
_popToTag：BeautifulSoup的内置方法，关于这些方法使用需要了解Python更深些的内容。
append：修改element tree
attribselect_re
childGenerator
children
clear：清除标签内容
decode
decode_contents
decompose
descendants
encode
encode_contents
endData
extract：这个方法很关键，后面有介绍
fetchNextSiblings下一兄弟元素
fetchParents：父元素集
fetchPrevious：前一元素
fetchPreviousSiblings：前一兄弟元素：这几个能够对当前元素的父级别元素和兄弟级别进行查找。
find：只找到limit为1的结果
findAll
findAllNext
findAllPrevious
findChild
findChildren：子集合
findNext：下一元素
findNextSibling：下一个兄弟
findNextSiblings：下一群兄弟
findParent：父元素
findParents：所有的父元素集合
findPrevious
findPreviousSibling
findPreviousSiblings：对当前元素和子元素进行遍历查找。
find_all_next
find_all_previous
find_next
find_next_sibling
find_next_siblings
find_parent
find_parents
find_previous
find_previous_sibling
find_previous_siblings：这些下划线方法命名是bs4方法，推荐使用这类
format_string
get
getText
get_text：得到文档标签内的内容，不包括标签和标签属性
handle_data
handle_endtag
handle_starttag
has_attr
has_key
index
insert
insert_after
insert_before：修改element tree
isSelfClosing
is_empty_element
new_string
new_tag
next
nextGenerator
nextSibling
nextSiblingGenerator
next_elements
next_siblings
object_was_parsed
parentGenerator
parents
parserClass
popTag
prettify：格式化HTML文档
previous
previousGenerator
previousSibling
previousSiblingGenerator
previous_elements
previous_siblings
pushTag
recursiveChildGenerator
renderContents
replaceWith
replaceWithChildren
replace_with
replace_with_children：修改element tree 元素内容
reset
select：适用于jquery和css的语法选择。
setup
string
strings
stripped_strings
tag_name_re
text
unwrap
wrap

需要注意的是，在BeautifulSoup中的方法有些有两种写法，有些是驼峰格式的写法，有些是下划线格式的写法，但是看其方法的含义是一样的，这主要是BeautifulSoup为了兼容3.x的写法。前者是3.x的写法，后者是4.x的写法，推荐使用后者，也就是下划线的方法。

根据这些方法，应该能够得到遍历、抽取、修改、规范化文档的一系列方法。大家如果能在工作中使用BeautifulSoup ，一定会理解更深。

BeautifulSoup 支持不同的 parser ，默认是 Html 格式解析，还有 xml parser 、 lxml parser 、 html5lib parser 、 html.parser ，这些 parser 都需要响应的解析器支持。

html，这个是默认的解析器

Python代码

BeautifulSoup("<a></a>")
# <html><head></head><body><a></a></body></html>

xml格式解析器

Python代码

BeautifulSoup("<a></a>", "xml")
# <?xml version="1.0" encoding="utf-8"?>
# <a></a>

lxml格式解析器

Python代码

BeautifulSoup("<a>", "lxml")
# <html><body><a></a></body></html>

html5lib格式解析器

Python代码

BeautifulSoup("<a>", "html5lib")
# <html><head></head><body><a></a></body></html>

html.parser解析器

Python代码

BeautifulSoup("<a>", "html.parser")
# <a></a>

其中 parser 的区别大家看下这几个例子就知道了。

在使用 BeautifulSoup 解析文档的时候，会将整个文档以一颗大又密集的数据载入到内存中，如果你只是从数据结构中获得一个字符串，内存中保存一堆数据感觉就不划算了。并且如果你要获得指向某个 Tag 的内容，这个Tag 又会指向其它的 Tag 对象，因此你需要保存这棵树的所有部分，也就是说整棵树都在内存中。 extract 方法可以破坏掉这些链接，它会将树的连接部分断开，如果你得到某个 Tag ，这个 Tag 的剩余部分会离开这棵树而被垃圾收集器捕获；当然，你也可以实现其它的功能：如文档中的某一块你本身就不关心，你可以直接把它 extract 出树结构，扔给垃圾收集器，优化内存使用的同时还能完成自己的功能。

正如 BeautifulSoup 的作者 Leonard 所说，写 BeautifulSoup 是为了帮助别人节省时间，减小工作量。一旦习惯使用上 BeautifulSoup 后，一些站点的内容很快就能搞定。这个就是开源的精神，将工作尽可能的自动化，减小工作量；从某个程度上来说，程序员应该是比较懒惰的，但是这种懒惰正好又促进了软件行业的进步。