02 BeautifulSoup

最新推荐文章于 2024-06-19 10:36:05 发布

KeepChasing1

最新推荐文章于 2024-06-19 10:36:05 发布

阅读量180

点赞数

分类专栏： # 风变编程_爬虫精进

本文链接：https://blog.csdn.net/qq_40678779/article/details/107094747

版权

风变编程_爬虫精进专栏收录该内容

27 篇文章 14 订阅

订阅专栏

本文是关于BeautifulSoup库的入门教程，介绍了如何使用该库进行网页抓取和解析的基本操作，包括选择器的使用、元素遍历和提取数据的方法。

摘要由CSDN通过智能技术生成

# Author:Nimo_Ding

'''
爬虫四个步骤：
获取数据 - requests库完成
解析数据 - BeautifulSoup网页解析库完成
提取数据 - BeautifulSoup网页解析库完成
保存数据

BeautifulSoup库目前已经进阶到第4版了
安装：pip3 install BeautifulSoup4
'''


# 调用requests库
import requests

# 调用BeautifulSoup库
from bs4 import BeautifulSoup

# 获取网页源代码，返回一个Response对象，赋值给res
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')

# 把Response对象的内容以字符串的形式返回
html= res.text

# 1、解析数据：
# 把网页解析为BeautifulSoup对象
# 第0个参数是要被解析的文本，必须是字符串。
# 第1个参数是解析器，html.parser是Python的一个内置库，最简单的那个。
soup = BeautifulSoup( html,'html.parser')

print(type(soup))
# <class 'bs4.BeautifulSoup'>
# 查看soup数据类型，是一个BeautifulSoup对象

# print(soup)
# soup打印出来的源代码和我们之前用response.text打印出来的源代码是一模一样的。
# 但是response.text是字符串<class 'str'>
# soup是<class 'bs4.BeautifulSoup'>

# 2、提取数据：
#       I：find()与find_all()是BeautifulSoup对象的两个方法，它们可以匹配
#           html的标签和属性，把BeautifulSoup对象里符合要求的数据都提取出来。
#           两者用法一样，区别在于工作量
#           find()：只提取首个满足要求的数据
#           find_all()：提取所有满足要求的数据
#           用法：BeautifulSoup对象.find(标签,属性)
#                   soup.find('div',class_='book')
#                   soup.find_all('div',class_='book')
#                   class后面的_是为了和Python中的class类做区分，避免程序冲突。
#       II：Tag对象（标签对象）

# 测试：爬取网页中的三本书名、连接、书籍介绍。
# 通过定位标签div和属性class_提取我们想要的数据
items = soup.find_all(class_='books')
# 这个items是<class 'bs4.element.ResultSet'>
# 可以当成一个列表看待

for item in items:
    print(item.find('h2').text) # 书类别，主要取第一个匹配上的即可，用find。
    print(type(item.find('h2'))) # 是个tag：<class 'bs4.element.Tag'>

    print(item.find(class_='title').text) # 书名
    a=item.find(class_='title')
    print(a['href']) # 书链接
    print(type(item.find(class_='title'))) # 是个tag：<class 'bs4.element.Tag'>

    print(item.find(class_='info').text) # 书的介绍
    print(type(item.find(class_='info'))) # 是一个tag：<class 'bs4.element.Tag'>


'''

HTML常用标签：
    <html> 定义html文档
    <head> 定义文档头部
    <body> 定义文档主体
    <a> 定义超链接
    <audio> 定义音频
    <button> 定义按钮
    <div> 定义块区域
    <h1>、<h2>、<h3> 定义标题
    <p> 定义段落-paragraph
    <img> 定义图片
    <ol> 定义有序列表
    <ul> 定义无序列表
    <li> 定义单个列表条目

HTML属性：
    class 为html元素定义一个或多个类名classname
    id 定义元素的唯一id
    href 用来定义链接
    style 规定元素的行内样式 inline style

response对象：
    response.status_code
    response.content
    response.text
    response.encoding

BeautifulSoup对象.find(标签，属性)
BeautifulSoup对象.find_all(标签，属性)

soup.find('div',class_='books')
soup.find_all('div',class_='books')
标签和属性可以任选其一来进行find，

Tag对象（标签对象）的三种常用属性与方法：
    Tag.find()     -- 提取Tag中的Tag
    Tag.find_all() -- 提取Tag中的Tag
    Tag.text       -- 提取Tag中的文字
    Tag['属性名']   -- 输入参数：属性名，可以提取Tag中这个属性的值


<class 'bs4.element.ResultSet'>
这是列表结构，可以当成列表来处理。

补充：
在BeautifulSoup中，不止find()和find_all()，还有select()也可以达到相同目的
'''

总结：