03-数据解析_BeautifulSoup+CSS选择器（01 BeautifulSoup）

最新推荐文章于 2022-08-31 00:16:51 发布

静水``流深

最新推荐文章于 2022-08-31 00:16:51 发布

阅读量139

点赞数

分类专栏：学习笔记 # python爬虫

本文链接：https://blog.csdn.net/weixin_46623003/article/details/108470849

版权

学习笔记同时被 2 个专栏收录

57 篇文章 0 订阅

订阅专栏

python爬虫

33 篇文章 0 订阅

订阅专栏

1-BeautifulSoup4库的基本介绍

注意：一旦加载，beautifulsoup会自动建立模型，所以开销较大；lxml是C语言编写，速度较快。

2-BeautifulSoup4库的基本使用

简单使用：

from bs4 import BeautifulSoup

html = """  
<a href="https://www.doutula.com/article/detail/6394359" class="list-group-item random_list tg-article">
    <div class="random_title">斗图<div class="date">2020-09-07</div>
    </div>
    <div class="random_article">
        <div class="col-xs-6 col-sm-3">
            <img referrerpolicy="no-referrer" class="lazy image_dtb img-responsive loaded"
                src="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_HBeQpW.jpg"
                data-original="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_HBeQpW.jpg"
                data-backup="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_HBeQpW.jpg"
                alt="" data-was-processed="true">
            <p></p>
        </div>
        <div class="col-xs-6 col-sm-3">
            <img referrerpolicy="no-referrer" class="lazy image_dtb img-responsive loaded"
                src="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_kIOgEX.png"
                data-original="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_kIOgEX.png"
                data-backup="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_kIOgEX.png"
                alt="" data-was-processed="true">
            <p></p>
        </div>
    </div>
</a>
"""
# 创建beautiful Soup对象
# 使用lxml来解析
soup = BeautifulSoup(html, 'lxml')  # 会自动不全body、html等标签

print(soup.prettify())  # 按照格式美化输出

解析器：

注意：html5lib的容错力最强（html不规范，可以自动解决）

四个常用的对象：

Beautiful Soup将复杂的HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种：

Tag
NavigatableString
BeautifulSoup
Comment

3-BeautifulSoup4库提取数据

# 需求：
# 1.获取所有tr标签
# 2.获取第2个tr标签
# 3.获取所有class等于even的标签
# 4.获取所有id等于test，class也等于test的a标签提取出来
# 5.获取所有a标签的href属性
# 6.获取所有职位信息（纯文本）

from bs4 import BeautifulSoup 
# html = 'tencent.html'
soup = BeautifulSoup(html, 'lxml')
# 1.获取所有tr标签
trs = soup.find_all('tr')
for tr in trs:
    print(tr)
    print('='*30)
    print(type(tr))  # bs4.element.Tag 数据类型(调用了一个方法所以可以以字符串形式打印出来)
# 2.获取第2个tr标签
# trs = soup.find_all('tr', limit=2)  # 返回列表；limit:限定最多获取多少个元素
tr = soup.find_all('tr', limit=2)[1]
print(tr)
# 3.获取所有class等于even的标签
# trs = soup.find_all('tr', class_='even')  # 因为在pyhon中class为关键字，所以加下划线：class_给以区分
# attribute
trs = soup.find_all('tr', attrs={'class':"even"})
for tr in trs:
    print(tr)
    print('='*30)
# 4.获取所有id等于test，class也等于test的a标签提取出来
# aList = soup.find_all(a, id='test', class='test')
aList = soup.find_all(a, attrs={'id':"test", "class"="test"})
for a in aList:
    print(a)
# 5.获取所有a标签的href属性
aList = soup.find_all('a')
for a in aList:
    # 1.通过下表操作的方式
    href = a['href']
    print(href)
    # 2.通过attrs属性的方式
    href = a.attrs['href']
    print(href)
# 6.获取所有职位信息（纯文本）
trs = soup.find_all('tr')[1:]
movies = []
for tr in trs:
    movie = {}
    # 方法1：
    # tds = tr.find_all('td')
    # title = tds[0].string
    # category = tds[1].string
    # nums = tds[2].string
    # city = tds[3].string
    # pubtime = tds[4].string
    # movie['title'] = title
    # movie['category'] = category
    # movie['nums'] = nums
    # movie['city'] = city
    # movie['pubtime'] = pubtime
    # movies.append(movie)
    # 方法2：
    # infos = tr.strings  # 所有的非标签字符，返回的是生成器
    # infos = list(infos) # 转换成列表
    infos = list(tr.stripped_strings)  # 获取非空白字符
    movie['title'] = infos[0]
    movie['category'] = infos[1]
    movie['nums'] = infos[2]
    movie['city'] = infos[3]
    movie['pubtime'] = infos[4]
    movies.append(movie)

print(movies)

BeautifulSoup总结：

4-BeaufifulSoup拾遗

1) comment类型：

html = """
<p>
<!--我是注释字符串-->
</p>
"""
# 这是注释

from bs4 import BeautifulSoup
# from bs4.element import Tag
# from bs4.element import NavigableString
soup = BeautifulSoup(html, 'lxml')
p = soup.find('p')
print(type(p))
print(type(p.string))

结果:

2) contents和children：

3）注意：string与contents

1、对于数据：一行形式

<p>字符串</p>

---p=soup.find('p')

用：print(p.string)，此时可以得到 “字符串”

2、对于数据：3行的形式

<p>
字符串
</p>

用：print(p.string)，此时返回为空，应为当有多个字符串时，.string函数不能使用（有换行符\n)
用：p.contents，可以打印：['\n', '字符串', '\n']

静水``流深

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
03-数据解析_BeautifulSoup+CSS选择器（01 BeautifulSoup）

1-BeautifulSoup4库的基本介绍注意：一旦加载，beautifulsoup会自动建立模型，所以开销较大；lxml是C语言编写，速度较快。2-BeautifulSoup4库的基本使用简单使用：from bs4 import BeautifulSouphtml = """ <a href="https://www.doutula.com/article/detail/6394359" class="list-group-item random_list t...
复制链接

扫一扫