BeautifulSoup解析库的简单使用

BeautifulSoup

中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
from bs4 import BeautifulSoup

html = """
<ol class="grid_view">
        <li>
            <div class="item">
                <div class="pic">
                    <em class="">1</em>
                    <a href="https://movie.douban.com/subject/1292052/">
                        <img width="100" alt="肖申克的救赎" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp" class="">
                    </a>
                </div>
                <div class="info">
                    <div class="hd">
                        <a href="https://movie.douban.com/subject/1292052/" class="">
                            <span class="title">肖申克的救赎</span>
                                    <span class="title">&nbsp;/&nbsp;The Shawshank Redemption</span>
                                <span class="other">&nbsp;/&nbsp;月黑高飞(港)  /  刺激1995(台)</span>
                        </a>


                            <span class="playable">[可播放]</span>
                    </div>
                    <div class="bd">
                        <p class="">
                            导演: 弗兰克·德拉邦特 Frank Darabont&nbsp;&nbsp;&nbsp;主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
                            1994&nbsp;/&nbsp;美国&nbsp;/&nbsp;犯罪 剧情
                        </p>

                        
                        <div class="star">
                                <span class="rating5-t"></span>
                                <span class="rating_num" property="v:average">9.7</span>
                                <span property="v:best" content="10.0"></span>
                                <span>1836121人评价</span>
                        </div>

                            <p class="quote">
                                <span class="inq">希望让人自由。</span>
                            </p>
                    </div>
                </div>
            </div>
        </li>
        <li>
            <div class="item">
                <div class="pic">
                    <em class="">2</em>
                    <a href="https://movie.douban.com/subject/1291546/">
                        <img width="100" alt="霸王别姬" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2561716440.webp" class="">
                    </a>
                </div>
                <div class="info">
                    <div class="hd">
                        <a href="https://movie.douban.com/subject/1291546/" class="">
                            <span class="title">霸王别姬</span>
                                <span class="other">&nbsp;/&nbsp;再见,我的妾  /  Farewell My Concubine</span>
                        </a>


                            <span class="playable">[可播放]</span>
                    </div>
                    <div class="bd">
                        <p class="">
                            导演: 陈凯歌 Kaige Chen&nbsp;&nbsp;&nbsp;主演: 张国荣 Leslie Cheung / 张丰毅 Fengyi Zha...<br>
                            1993&nbsp;/&nbsp;中国大陆 中国香港&nbsp;/&nbsp;剧情 爱情 同性
                        </p>

                        
                        <div class="star">
                                <span class="rating5-t"></span>
                                <span class="rating_num" property="v:average">9.6</span>
                                <span property="v:best" content="10.0"></span>
                                <span>1349883人评价</span>
                        </div>

                            <p class="quote">
                                <span class="inq">风华绝代。</span>
                            </p>
                    </div>
                </div>
            </div>
        </li>
        <li>
            <div class="item">
                <div class="pic">
                    <em class="">3</em>
                    <a href="https://movie.douban.com/subject/1292720/">
                        <img width="100" alt="阿甘正传" src="https://img9.doubanio.com/view/photo/s_ratio_poster/public/p1484728154.webp" class="">
                    </a>
                </div>
                <div class="info">
                    <div class="hd">
                        <a href="https://movie.douban.com/subject/1292720/" class="">
                            <span class="title">阿甘正传</span>
                                    <span class="title">&nbsp;/&nbsp;Forrest Gump</span>
                                <span class="other">&nbsp;/&nbsp;福雷斯特·冈普</span>
                        </a>


                            <span class="playable">[可播放]</span>
                    </div>
                    <div class="bd">
                        <p class="">
                            导演: 罗伯特·泽米吉斯 Robert Zemeckis&nbsp;&nbsp;&nbsp;主演: 汤姆·汉克斯 Tom Hanks / ...<br>
                            1994&nbsp;/&nbsp;美国&nbsp;/&nbsp;剧情 爱情
                        </p>

                        
                        <div class="star">
                                <span class="rating5-t"></span>
                                <span class="rating_num" property="v:average">9.5</span>
                                <span property="v:best" content="10.0"></span>
                                <span>1405913人评价</span>
                        </div>

                            <p class="quote">
                                <span class="inq">一部美国近现代史。</span>
                            </p>
                    </div>
                </div>
            </div>
        </li>
"""
soup = BeautifulSoup(html, 'lxml')
# 获取所有的li标签
lis = soup.find_all('li')
for li in lis:
    print(type(li))
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
# 获取第二个li标签
tr = soup.find_all('li', limit=2)[1]# limit指定获取元素的个数  返回一个列表
print(tr)

# 区别于Xpath语法, bs4操作的是在列表的层面
<li>
<div class="item">
<div class="pic">
<em class="">2</em>
<a href="https://movie.douban.com/subject/1291546/">
<img alt="霸王别姬" class="" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2561716440.webp" width="100"/>
</a>
</div>
<div class="info">
<div class="hd">
<a class="" href="https://movie.douban.com/subject/1291546/">
<span class="title">霸王别姬</span>
<span class="other"> / 再见,我的妾  /  Farewell My Concubine</span>
</a>
<span class="playable">[可播放]</span>
</div>
<div class="bd">
<p class="">
                            导演: 陈凯歌 Kaige Chen   主演: 张国荣 Leslie Cheung / 张丰毅 Fengyi Zha...<br/>
                            1993 / 中国大陆 中国香港 / 剧情 爱情 同性
                        </p>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.6</span>
<span content="10.0" property="v:best"></span>
<span>1349883人评价</span>
</div>
<p class="quote">
<span class="inq">风华绝代。</span>
</p>
</div>
</div>
</div>
</li>
# 获取所有class等于star的标签
divs = soup.find_all('div', class_='star')     # 后边可以添加多个属性过滤标签
for div in divs:
    print(div)
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.7</span>
<span content="10.0" property="v:best"></span>
<span>1836121人评价</span>
</div>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.6</span>
<span content="10.0" property="v:best"></span>
<span>1349883人评价</span>
</div>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.5</span>
<span content="10.0" property="v:best"></span>
<span>1405913人评价</span>
</div>
divs = soup.find_all('div', attrs={'class':'star'})   # 后边可以添加多个属性过滤标签
for div in divs:
    print(div)
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.7</span>
<span content="10.0" property="v:best"></span>
<span>1836121人评价</span>
</div>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.6</span>
<span content="10.0" property="v:best"></span>
<span>1349883人评价</span>
</div>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.5</span>
<span content="10.0" property="v:best"></span>
<span>1405913人评价</span>
</div>
# 获取所有a标签的href属性
# 1.通过下标操作的方式提取属性      (推荐)
alists = soup.find_all('a')
for alist in alists:
    href = alist['href']
    print(href)
https://movie.douban.com/subject/1292052/
https://movie.douban.com/subject/1292052/
https://movie.douban.com/subject/1291546/
https://movie.douban.com/subject/1291546/
https://movie.douban.com/subject/1292720/
https://movie.douban.com/subject/1292720/
# 2.通过attrs属性的方式
alists = soup.find_all('a')
for alist in alists:
    href = alist.attrs['href']
    print(href)
https://movie.douban.com/subject/1292052/
https://movie.douban.com/subject/1292052/
https://movie.douban.com/subject/1291546/
https://movie.douban.com/subject/1291546/
https://movie.douban.com/subject/1292720/
https://movie.douban.com/subject/1292720/

笔记

find_all:
在提取标签的时候,第一个参数是标签的名字,然后如果在提取标签的时候想要使用标签属性进行过滤,那么可以在这个方法中通过关键字参数的形式,将属性的名字及对应的值传进去,或者使用‘attrs’属性,将所有的属性以及对应的值放在一个字典中传给‘attrs’属性。
若不想提取很多标签,可以使用limit参数 限制长度

find与find_all的区别:
find:返回第一个符合条件的标签
find_all:返回所有符合条件的标签

find与find_all的过滤条件:
关键字参数
attrs参数

获取标签的属性:
通过下标索引获取
通过attrs属性获取

# 通过下标索引获取
href = a['href']
# 通过attrs属性获取
href = a.attrs['href']

string, strings 和 stripped_strings属性以及get_text方法
string: 获取某个标签下的非标签字符串。 返回字符串
strings: 获取某个标签下的子孙非标签字符串。 返回生成器
stripped_strings:获取某个标签下的子孙非标签字符串,会去掉空白字符。 返回生成器
get_text:获取某个标签下的子孙非标签字符串。不是以列表的形式返回,是返回字符串

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值