beautifulsoup4教程（一）基础知识和第一个爬虫

最新推荐文章于 2024-02-22 17:40:57 发布

tyson Lee

最新推荐文章于 2024-02-22 17:40:57 发布

阅读量8.3k

点赞数 5

分类专栏：爬虫

本文链接：https://blog.csdn.net/chinaltx/article/details/86748755

版权

爬虫专栏收录该内容

6 篇文章 2 订阅

订阅专栏

beautifulsoup4教程（一）基础知识和第一个爬虫

 beautifulsoup4教程（二）bs4中四大对象

 beautifulsoup4教程（三）遍历和搜索文档树

 beautifulsoup4教程（四）css选择器

一、基础知识

1.

BeautifulSoup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。

2.

Beautiful Soup 3 目前已经停止开发，推荐在现在的项目中使用Beautiful Soup 4，不过它已经被移植到BS4了，也就是说导入时我们需要 import bs4 。所以这里我们用的版本是 Beautiful Soup 4.3.2 (简称BS4)，另外据说 BS4 对 Python3 的支持不够好，不过我用的是 Python2.7.7，如果有小伙伴用的是 Python3 版本，可以考虑下载 BS3 版本。

3.

可以利用 pip 或者 easy_install 来安装，以下两种方法均可
1 easy_install beautifulsoup4,
1 pip install beautifulsoup4

4.

autiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐安装。
pip install lxml

5.

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	Python的内置标准库执行速度适中文档容错能力强 Python 2.7.3 or 3.2.2)前的版本中文档	容错能力差
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, [“lxml”, “xml”])BeautifulSoup(markup, “xml”)	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

二、第一个简单爬虫

#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象，例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html)

#格式化输出
print soup.prettify()

tyson Lee

关注

5
点赞
踩
21

收藏

觉得还不错? 一键收藏
2
评论
beautifulsoup4教程（一）基础知识和第一个爬虫

一、基础知识1.BeautifulSoup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。2.Beautiful Soup 3 目前已经停止开发，推荐在现在的项目中使用Beautiful Soup 4，不过它已经...
复制链接

扫一扫