第三讲：爬虫——BeautifulSoup（Python)

最新推荐文章于 2024-06-19 10:36:05 发布

荔枝科研社

最新推荐文章于 2024-06-19 10:36:05 发布

阅读量4.3k

点赞数 6

分类专栏：爬虫&自动化办公文章标签： BeautifulSoup HTML解析数据提取 Python爬虫 tag匹配

本文链接：https://blog.csdn.net/weixin_46039719/article/details/124002313

版权

爬虫&自动化办公专栏收录该内容

35 篇文章 30 订阅

订阅专栏

0 知识回顾

1 BeautifulSoup 是什么

2 BeautifulSoup 怎么用

2.1 BeautifulSoup 安装

2.2 BeautifulSoup 解析数据

2.3 BeautifulSoup 提取数据

3 对象的变化过程

0 知识回顾

第一讲：最能入门的爬虫教程（Python实现）

第二讲：HTML基础（python）

1 BeautifulSoup 是什么

解析和提取⽹⻚中的数据：

（1）解析数据：把服务器返回来的 HTML 源代码翻译为我们能理解的⽅式；

（2）提取数据：把我们需要的数据从众多数据中挑选出来。

2 BeautifulSoup 怎么用

2.1 BeautifulSoup 安装

win：pip install BeautifulSoup4；

Mac：pip3 install BeautifulSoup4。

2.2 BeautifulSoup 解析数据

bs对象 = BeautifulSoup（要解析的文本,'解析器'）

括号中，要输⼊两个参数：

①、第 0 个参数是要被解析的⽂本（必须是字符串）

②、第 1 个参数⽤来标识解析器，我们要⽤的是⼀个Python内置库： html.parser 。（不

是唯⼀的解析器）

import requests
from bs4 import BeautifulSoup#引入BS库
res = requests.get( 'https: / /localprod .pandateacher.com/python-manuscript/ crawler-html/spider-men5.0.html ' )
html = res.text
soup = BeautifulSoup(html, ' html.parser') #把网页解析为BeautifulSoup对象

2.3 BeautifulSoup 提取数据

(1)find() 与 find_all()

find() 与 find_all() 是 BeautifulSoup 对象的两个⽅法，它们可以匹配 html 的标签和属

性，把 BeautifulSoup 对象⾥符合要求的数据都提取出来：

①、find()只提取⾸个满⾜要求的数据

import requests
from bs4 import BeautifulSoup
url = 'https: / /localprod .pandateacher.com/ python-manuscript/crawler-html/ spder-men0.0.html'
res = requests.get (url)
soup = BeautifulSoup(res.text, ' html.parselr ' )
item = soup.find ( ' div') #使用find()方法提取首个<div>元素，并放到变量item里。
print(item)   #打印item
#结果: <div>大家好，我是一个块</div>

②、find_all()提取出的是所有满⾜要求的数据。

import requests
from bs4 import BeautifulSoup
url = 'https: / /localprod. pandateacher.com/ python-manuscript/crawler-html/spder-men0.0.html'
res = requests.get (url)
soup = BeautifulSoup( res.text, ' html.parser ' )
items = soup.find_all( 'div') #用find_all()把所有符合要求的数据提取出来，并放在变量items里
print(items)
#打印items
#结果:[<div>大家好，我是一个块</div>，<div>我也是一个块</div>，<div>我还是一个块</div>]

注意：

find() 或 find_all() 括号中的参数：标签和属性可以任选其⼀，也可以两个⼀起使⽤，这

取决于我们要在⽹⻚中提取的内容。

（1）中括号⾥的class_，这⾥有⼀个下划线，是为了和python语法中的类 class区分，避免

程序冲突。当然，除了⽤class属性去匹配，还可以使⽤其它属性，⽐如style属性等；

（2）只⽤其中⼀个参数就可以准确定位的话，就只⽤⼀个参数检索。如果需要标签和属性同

时满⾜的情况下才能准确定位到我们想找的内容，那就两个参数⼀起使⽤。

import requests #调用requests库
from bs4 import BeautifulSoup #调用BeautifulSoup库
res = requests.get( ' https: / /localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')#返回一个Response对象，赋值给res
html = res.text#把Response对象的内容以字符串的形式返回
soup = BeautifulSoup( html, ' html.parser' ) #把网页解析为BeautifulSoup对象
items = soup.find_all(class_='books') #通过匹配标签和属性提取我们想要的数据
print(items) # 打印items

(2)Tag 对象

import requests #调用requests库
from bs4 import BeautifulSoup #调用BeautifulSoup库
res =requests.get ( ' https: / /localprod.pandateacher.com/python-manuscript/ crawler-html/ spider-men5.0.html ' )
#返回一个response对象，赋值给res
html=res.text
# 把res解析为字符串
soup = BeautifulSoup( html , ' html.parser ' )#把网页解析为BeautifulSoup对象
items = soup .find_all(class_= 'books ' )  #通过匹配属性class= ' books'提取出我们想要的元素
for item in items:   #遍历列表items
    kind = item.find ( ' h2 ' )  #在列表中的每个元素里，匹配标签<h2>提取出数据
    title = item.find (class_='title ')     #在列表中的每个元素里,匹配属性
class_='title' #提取出数据
    brief = item.find (class_= 'info ' )  #在列表中的每个元素里，匹配属性
class_= 'info ' #提取出数据
print(kind.text, ' \n',title.text, ' \n',title[ ' href'], ' \n',brief.text) #打印书籍的类型、名字、链接和简介的文字

3 对象的变化过程

对象操作： Response对象 —— 字符串 —— BS对象：

①、⼀条是 BS对象 —— Tag对象；

②、另⼀条是 BS对象 —— 列表 —— Tag对象。

荔枝科研社

关注

6
点赞
踩
43

收藏

觉得还不错? 一键收藏
打赏
0
评论
第三讲：爬虫——BeautifulSoup（Python)

目录0 知识回顾1BeautifulSoup 是什么2BeautifulSoup 怎么用2.1 BeautifulSoup 安装2.2BeautifulSoup 解析数据 2.3 BeautifulSoup 提取数据3对象的变化过程0 知识回顾第一讲：最能入门的爬虫教程（Python实现）第二讲：HTML基础（python）1BeautifulSoup 是什么解析和提取⽹⻚中的数据：（1）解析数据：把服务器返回来...
复制链接

扫一扫