python爬虫之BeautifulSoup

最新推荐文章于 2024-07-26 17:15:21 发布

雪小妮

最新推荐文章于 2024-07-26 17:15:21 发布

阅读量380

点赞数 1

分类专栏： # Python基础爬虫文章标签： python

本文链接：https://blog.csdn.net/qq_35249586/article/details/117266227

版权

Python基础爬虫专栏收录该内容

17 篇文章 0 订阅

订阅专栏

参考：https://blog.csdn.net/weixin_34127717/article/details/90583410?utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-1.control&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-1.control

一、BeautifulSoup
1.介绍：Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据。
Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。
安装：Beautiful Soup 和lxml
在这里插入图片描述

2.导入库：from bs4 import BeautifulSoup
创建 beautifulsoup 对象：soup = BeautifulSoup(html)
将本地 index.html 文件打开，用它来创建 soup 对象：soup = BeautifulSoup(open(‘index.html’))
打印一下 soup 对象的内容，格式化输出:print(soup.prettify())

from bs4 import BeautifulSoup
html= '''<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>'''
soup= BeautifulSoup(html,'lxml')#创建 soup 对象
soup= BeautifulSoup(open('index.html'))#创建 soup 对象
print(soup.prettify())#格式化输出
print(soup.title.string)#输出HTML 中title 节点的文本内容;soup.title可以选出HTML 中的title 节点，再调用string 属性就可以得到里面的文本了

3.四大对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag:通俗点讲就是 HTML 中的一个个标签,下面的就是Tag， title a 等等 HTML 标签加上里面包括的内容就是 Tag。

<title>The Dormouse's story</title>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print soup.title
#<title>The Dormouse's story</title>
print soup.head
#<head><title>The Dormouse's story</title></head>
print soup.a
#<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
print soup.p
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

注意：查找的是在所有内容中的第一个符合要求的标签。若要获取更多则用：
a.Tag，它有两个重要的属性，是 name (获取名字)和 attrs（获取属性）

print soup.name
print soup.head.name
#[document]
#head

b.想要单独获取某个属性，可以这样，例如我们获取它的 class 叫什么（与 attrs（获取属性）用法一样）

print soup.p['class']
#['title']

NavigableString:获取标签里的内容（.string）,类型是一个 NavigableString，翻译过来叫可以遍历的字符串

print soup.p.string
#The Dormouse's story

BeautifulSoup:BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象，是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性

print type(soup.name)
#<type 'unicode'>
print soup.name
# [document]
print soup.attrs
#{} 空字典

Comment:Comment 对象是一个特殊类型的 NavigableString 对象，其实输出的内容仍然不包括注释符号，但是如果不好好处理它，可能会对我们的文本处理造成意想不到的麻烦。

print soup.a
print soup.a.string
print type(soup.a.string)

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie
<class 'bs4.element.Comment'>

首先判断了它的类型，是否为 Comment 类型，然后再进行其他操作，如打印输出。

if type(soup.a.string)==bs4.element.Comment:
  print soup.a.string

关联选择

子节点和子孙节点：选取节点元素之后，如果想要获取它的直接子节点，可以调用contents 属性，得到的结果是直接子节点的列表。

调用children 属性得到相应的结果：返回结果是生成器类型。接下来，
我们用于or 循环输出相应的内容。

如果要得到所有的子孙节点的话，可以调用descendants 属性：

此时返回结果还是生成器。遍历输出一下可以看到，这次的输出结果就包含了span 节点。descendants
会递归查询所有子节点，得到所有的子孙节点。
父节点和祖先节点
如果要获取某个节点元素的父节点，可以调用parent 属性：

关联元素节点的选择方法，如果想要获取它们的一些信息，比如文本、属性等，也用
同样的方法

4.方法选择器

find_all()：查询所有符合条件的元素
find（）
者返回的是单个元素，也就是第一个匹配的元素

二、BeautifulSoup+CSS
使用css 选择器时，只需要调用select （）方法，传人相应的css 选择器即可
在这里插入图片描述

嵌套选择
例如，先选择所有ul 节点，再遍历每个ul 节点，选择其li 节点，样例如下：

推荐使用lxml 解析库，必要时使用html.parser。
节点选择筛选功能弱但是速度’快。
建议使用f i nd （）或者f i nd_all （）查询匹配单个结果或者多个结果。
如果对cs s 选择器熟悉的话，可以使用se l ect （）方法选择。

雪小妮

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
python爬虫之BeautifulSoup

参考：https://blog.csdn.net/weixin_34127717/article/details/90583410?utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-1.control&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7E
复制链接

扫一扫

专栏目录