BeautifulSoup
中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
from bs4 import BeautifulSoup
html = """
<ol class="grid_view">
<li>
<div class="item">
<div class="pic">
<em class="">1</em>
<a href="https://movie.douban.com/subject/1292052/">
<img width="100" alt="肖申克的救赎" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp" class="">
</a>
</div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1292052/" class="">
<span class="title">肖申克的救赎</span>
<span class="title"> / The Shawshank Redemption</span>
<span class="other"> / 月黑高飞(港) / 刺激1995(台)</span>
</a>
<span class="playable">[可播放]</span>
</div>
<div class="bd">
<p class="">
导演: 弗兰克·德拉邦特 Frank Darabont 主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
1994 / 美国 / 犯罪 剧情
</p>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.7</span>
<span property="v:best" content="10.0"></span>
<span>1836121人评价</span>
</div>
<p class="quote">
<span class="inq">希望让人自由。</span>
</p>
</div>
</div>
</div>
</li>
<li>
<div class="item">
<div class="pic">
<em class="">2</em>
<a href="https://movie.douban.com/subject/1291546/">
<img width="100" alt="霸王别姬" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2561716440.webp" class="">
</a>
</div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1291546/" class="">
<span class="title">霸王别姬</span>
<span class="other"> / 再见,我的妾 / Farewell My Concubine</span>
</a>
<span class="playable">[可播放]</span>
</div>
<div class="bd">
<p class="">
导演: 陈凯歌 Kaige Chen 主演: 张国荣 Leslie Cheung / 张丰毅 Fengyi Zha...<br>
1993 / 中国大陆 中国香港 / 剧情 爱情 同性
</p>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.6</span>
<span property="v:best" content="10.0"></span>
<span>1349883人评价</span>
</div>
<p class="quote">
<span class="inq">风华绝代。</span>
</p>
</div>
</div>
</div>
</li>
<li>
<div class="item">
<div class="pic">
<em class="">3</em>
<a href="https://movie.douban.com/subject/1292720/">
<img width="100" alt="阿甘正传" src="https://img9.doubanio.com/view/photo/s_ratio_poster/public/p1484728154.webp" class="">
</a>
</div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1292720/" class="">
<span class="title">阿甘正传</span>
<span class="title"> / Forrest Gump</span>
<span class="other"> / 福雷斯特·冈普</span>
</a>
<span class="playable">[可播放]</span>
</div>
<div class="bd">
<p class="">
导演: 罗伯特·泽米吉斯 Robert Zemeckis 主演: 汤姆·汉克斯 Tom Hanks / ...<br>
1994 / 美国 / 剧情 爱情
</p>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.5</span>
<span property="v:best" content="10.0"></span>
<span>1405913人评价</span>
</div>
<p class="quote">
<span class="inq">一部美国近现代史。</span>
</p>
</div>
</div>
</div>
</li>
"""
soup = BeautifulSoup(html, 'lxml')
# 获取所有的li标签
lis = soup.find_all('li')
for li in lis:
print(type(li))
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
# 获取第二个li标签
tr = soup.find_all('li', limit=2)[1]# limit指定获取元素的个数 返回一个列表
print(tr)
# 区别于Xpath语法, bs4操作的是在列表的层面
<li>
<div class="item">
<div class="pic">
<em class="">2</em>
<a href="https://movie.douban.com/subject/1291546/">
<img alt="霸王别姬" class="" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2561716440.webp" width="100"/>
</a>
</div>
<div class="info">
<div class="hd">
<a class="" href="https://movie.douban.com/subject/1291546/">
<span class="title">霸王别姬</span>
<span class="other"> / 再见,我的妾 / Farewell My Concubine</span>
</a>
<span class="playable">[可播放]</span>
</div>
<div class="bd">
<p class="">
导演: 陈凯歌 Kaige Chen 主演: 张国荣 Leslie Cheung / 张丰毅 Fengyi Zha...<br/>
1993 / 中国大陆 中国香港 / 剧情 爱情 同性
</p>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.6</span>
<span content="10.0" property="v:best"></span>
<span>1349883人评价</span>
</div>
<p class="quote">
<span class="inq">风华绝代。</span>
</p>
</div>
</div>
</div>
</li>
# 获取所有class等于star的标签
divs = soup.find_all('div', class_='star') # 后边可以添加多个属性过滤标签
for div in divs:
print(div)
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.7</span>
<span content="10.0" property="v:best"></span>
<span>1836121人评价</span>
</div>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.6</span>
<span content="10.0" property="v:best"></span>
<span>1349883人评价</span>
</div>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.5</span>
<span content="10.0" property="v:best"></span>
<span>1405913人评价</span>
</div>
divs = soup.find_all('div', attrs={'class':'star'}) # 后边可以添加多个属性过滤标签
for div in divs:
print(div)
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.7</span>
<span content="10.0" property="v:best"></span>
<span>1836121人评价</span>
</div>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.6</span>
<span content="10.0" property="v:best"></span>
<span>1349883人评价</span>
</div>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.5</span>
<span content="10.0" property="v:best"></span>
<span>1405913人评价</span>
</div>
# 获取所有a标签的href属性
# 1.通过下标操作的方式提取属性 (推荐)
alists = soup.find_all('a')
for alist in alists:
href = alist['href']
print(href)
https://movie.douban.com/subject/1292052/
https://movie.douban.com/subject/1292052/
https://movie.douban.com/subject/1291546/
https://movie.douban.com/subject/1291546/
https://movie.douban.com/subject/1292720/
https://movie.douban.com/subject/1292720/
# 2.通过attrs属性的方式
alists = soup.find_all('a')
for alist in alists:
href = alist.attrs['href']
print(href)
https://movie.douban.com/subject/1292052/
https://movie.douban.com/subject/1292052/
https://movie.douban.com/subject/1291546/
https://movie.douban.com/subject/1291546/
https://movie.douban.com/subject/1292720/
https://movie.douban.com/subject/1292720/
笔记
find_all:
在提取标签的时候,第一个参数是标签的名字,然后如果在提取标签的时候想要使用标签属性进行过滤,那么可以在这个方法中通过关键字参数的形式,将属性的名字及对应的值传进去,或者使用‘attrs’属性,将所有的属性以及对应的值放在一个字典中传给‘attrs’属性。
若不想提取很多标签,可以使用limit参数 限制长度
find与find_all的区别:
find:返回第一个符合条件的标签
find_all:返回所有符合条件的标签
find与find_all的过滤条件:
关键字参数
attrs参数
获取标签的属性:
通过下标索引获取
通过attrs属性获取
# 通过下标索引获取
href = a['href']
# 通过attrs属性获取
href = a.attrs['href']
string, strings 和 stripped_strings属性以及get_text方法
string: 获取某个标签下的非标签字符串。 返回字符串
strings: 获取某个标签下的子孙非标签字符串。 返回生成器
stripped_strings:获取某个标签下的子孙非标签字符串,会去掉空白字符。 返回生成器
get_text:获取某个标签下的子孙非标签字符串。不是以列表的形式返回,是返回字符串