python爬虫五：beautifulsoup4的安装使用

最新推荐文章于 2024-08-14 12:08:11 发布

慢羊羊6379.*?

最新推荐文章于 2024-08-14 12:08:11 发布

阅读量519

点赞数

分类专栏： python爬虫学习

本文链接：https://blog.csdn.net/weixin_49088841/article/details/107588676

版权

python爬虫学习专栏收录该内容

27 篇文章 21 订阅

订阅专栏

1、bs4简介

①概念：Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库
②安装：先安装它的依赖模块（lxml）再安装（bs4）pip install lxml —>pip install bs4(最基本的安装方法如果出了问题可以参考如何导入第三方库)

2、bs4的基本使用

from bs4 import BeautifulSoup


html = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')#features:'lxml'
# print(type(soup),soup)
# a = soup.prettify()#漂亮结构化的打印
a = soup.title.name#标签名称
a= soup.title.string#标签的内容
# a = soup.find_all('a')#找到所有的标签
print(a)
# for i in a:
    # print(i.get('href'))#找到href属性下的链接地址

2.1bs4的对象种类

tag : 标签
NavigableString : 可导航的字符串
BeautifulSoup : bs对象
Comment : 注释

from bs4 import BeautifulSoup


html = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<c><!--dd--></c>
"""
soup = BeautifulSoup(html,'lxml')
print(type(soup.p))#<class 'bs4.element.Tag'>
print(type(soup.a.string))#<class 'bs4.element.NavigableString'>
print(type(soup))#<class 'bs4.BeautifulSoup'>
print(type(soup.c.string))#<class 'bs4.element.Comment'>

3、遍历

①contents 返回的是一个列表
soup.cotents将soup的内容以列表形式返回，通过遍历列表获得想要的数据
②children 返回的是一个迭代器通过这个迭代器可以进行迭代

links = soup2.div.children
print(type(links))#<class 'list_iterator'>
for link in links:
    print(link)

③descendants 返回的是一个生成器遍历子子孙孙

soup.descendants
for x in soup.descendants:
    print('----------------')
    print(x)

④string获取标签里面的内容
⑤strings 返回是一个生成器对象用过来获取多个标签内容
⑥stripped strings 和strings基本一致但是它可以把多余的空格去掉

3.1遍历树遍历父节点

parent直接获得父节点
parents获取所有的父节点

3.2遍历树遍历兄弟结点

next_sibling 下一个兄弟结点
previous_sibling 上一个兄弟结点
next_siblings 下一个所有兄弟结点
previous_siblings上一个所有兄弟结点

4、搜索树

4.1字符串过滤器

a_tag = soup.find('a') # 找一个直接返回结果
a_tags = soup.find_all('a') # 找所有 返回列表
print(a_tags)

4.2正则表达式过滤器

我们用正则表达式里面compile方法编译一个正则表达式传给 find 或者 findall这个方法可以实现一个正则表达式的一个过滤器的搜索

print(soup.find(re.compile('title')))
print(soup.find_all(re.compile('t')))

4.3列表过滤器

print(soup.find_all(['p','a']))
print(soup.find_all(['title','b']))

4.4方法过滤器

def fn(tag):
    return tag.has_attr('class')
print(soup.find_all(fn))

5、find_all()和find()

5.1find_all()

find_all()方法以列表形式返回所有的搜索到的标签数据
find()方法返回搜索到的第一条数据

def find_all(self, name=None, attrs={}, recursive=True, text=None,
                 limit=None, **kwargs):

name : tag名称
attr : 标签的属性
recursive : 是否递归搜索
text : 文本内容
limli : 限制返回条数
kwargs : 关键字参数

5.2 find的一些其它用法

find_parents() 搜索所有父亲
find_parrent() 搜索单个父亲
find_next_siblings()搜索所有兄弟
find_next_sibling()搜索单个兄弟
find_previous_siblings() 往上搜索所有兄弟
find_previous_sibling() 往上搜索单个兄弟
find_all_next() 往下搜索所有元素
find_next()往下查找单个元素

6、修改文档树

修改tag的名称和属性
修改string 属性赋值,就相当于用当前的内容替代了原来的内容
append() 像tag中添加内容,就好像Python的列表的 .append() 方法
decompose() 修改删除段落，对于一些没有必要的文章段落我们可以给他删除掉

7、案例实现

通过bs4爬取全中国所有城市以及对应的温度

import requests
from bs4 import BeautifulSoup
# 定义一个函数来解析网页
def parse_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
    }
    response = requests.get(url,headers=headers)
    # print(response.content.decode('utf-8'))
    text = response.content.decode('utf-8')
    #     # 解析网页
    # 先获取conMidtab 这个div标签 pip install html5lib
    soup = BeautifulSoup(text,'html5lib')
    conMidtab = soup.find('div',class_ = 'conMidtab')
    # 找到所有的table标签
    tables = conMidtab.find_all('table')
    for table in tables:
        # print('----------------------------')
        # 找到所有的tr标签 并且把前2个过滤掉
        trs = table.find_all('tr')[2:]
        # enumerate(trs) 返回2个值 第一个是下标索引 第二个是下标索引所 对应的值
        for index,tr in enumerate(trs):
            # 找td标签里面的城市和对应温度
            tds = tr.find_all('td')
            city_td = tds[0]
            # 解决直辖市和省份问题 通过判断下标索引值来取第1个值
            if index == 0:
                city_td = tds[1] # 直辖市也OK 省会更OK
            temp_td = tds[-2]
            city = list(city_td.stripped_strings)[0] # 城市
            temp = list(temp_td.stripped_strings)[0] # 温度
            print('城市:',city,'温度:',temp)

            # print(tr)
        # break # 找到北京结束
def main():
    # url = 'http://www.weather.com.cn/textFC/hb.shtml' # 华东
    # url = 'http://www.weather.com.cn/textFC/db.shtml' # 东北
    # url = 'http://www.weather.com.cn/textFC/gat.shtml' # 港澳台
    urls = ['http://www.weather.com.cn/textFC/hb.shtml','http://www.weather.com.cn/textFC/db.shtml','http://www.weather.com.cn/textFC/gat.shtml']
    for url in urls:
        parse_page(url)
if __name__ == '__main__':
    main()