python爬虫第九讲- bs4上

最新推荐文章于 2024-09-08 22:15:11 发布

yerennuo

最新推荐文章于 2024-09-08 22:15:11 发布

阅读量113

点赞数

本文链接：https://blog.csdn.net/yerennuo/article/details/118363340

版权

bs4

bs4的概述
bs4快速入门
bs4的对象种类
遍历文档树
find()和find_all()方法
修改文档树

bs4的概述

bs4是什么？
BeautifulSoup4简单理解就是一个可以从HTML或XML文件中提取数据的网页信息提取库
有什么作用？
提取和解析网页中的数据
学习的意义？
随着我们技术的增长你遇到的网站会越来越多去寻找最适合解决这个网站的技术
正则正则表达式不好写容易出错
xpath 需要记住一些语法
bs4 只需要我们记住一些方法就可以啦

如何学习？
1 提供了中文的学习文档
2 在整个的bs4模块当中需要我们掌握的是 BeautifulSoup这个核心类
3 在这个核心类当中封装了一些方法那么这些方法就是我们本堂课的学习目标
拓展
C Class类
m Method 方法
f Field 字段
p Poperty装饰器

官网
 4.90文档

bs4快速入门

如何入门？
1 安装
pip install lxml
pip install bs4

2 导入
from bs4 import BeautifulSoup

3 创建soup对象
soup = BeautifulSoup(tag)

4 根据需求调用方法
例如 soup.find() / soup.find_all()

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, features='lxml')
# print(soup.prettify())  # 打印出html格式化文档
# print(soup.title)  # <title>The Dormouse's story</title>
# print(soup.title.string)  # The Dormouse's story
# print(soup.find_all("p"))  # 找到所有p标签，返回一个列表

# 找到href对应的网址
for i in soup.find_all("a"):
    print(i.get("href"))

bs4的对象种类

tag : 标签
NavigableString : 可导航的字符串
BeautifulSoup : soup对象
Comment : 注释

遍历文档树

# print(soup.strings)  # 生成一个迭代器对象 <generator object Tag._all_strings at 0x0000020EA20CD2E0>
# for i in soup.strings:
#     print(i)  # 显示所有字符串 包括空格

for i in soup.stripped_strings:  # 去空格
    print(i)

find()和find_all()方法

字符串过滤器
soup.find(“p”)
soup.find_all(“a”)
列表过滤器
print(soup.find_all([“p”, “a”])) # 同时找到p和a标签

from bs4 import BeautifulSoup
html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
    <tbody>
        <tr class="h">
            <td class="l" width="374">职位名称</td>
            <td>职位类别</td>
            <td>人数</td>
            <td>地点</td>
            <td>发布时间</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云区块链高级研发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高级后台开发</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐运营开发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐业务运维工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高级研发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高级图像算法研发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高级AI开发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>4</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高级业务运维工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
    </tbody>
</table>
"""
soup = BeautifulSoup(html,'lxml')

# 1 获取所有的tr标签
# trs = soup.find_all('tr')
# for tr in trs:
#     print(tr)
#     print('-'*80)

# 2 获取第二个tr标签
# tr = soup.find_all('tr')[1]
# print(tr)

# 3 获取所有class属性等于even的tr标签 class是python的关键字
# trs = soup.find_all('tr',class_='even') # class_代表的一眼是class 解决了啥问题
# # trs = soup.find_all('tr',class1='even') # class1就不是class了
# a='xxx' a_
# for tr in trs:
#     print(tr)
#     print('-' * 80)

# trs = soup.find_all('tr',attrs={'class':'even'})
# for tr in trs:
#     print(tr)
#     print('-' * 80)


# 4 将所有的a标签 id等于test class等于test 提取出来
# lst = soup.find_all('a',id='test',class_='test')
# for a in lst:
#     print(a)


# 5 获取所有a标签的href属性
# a_lst = soup.find_all('a')
# for a in a_lst:
#     href = a.get('href')
#     print(href)


# a_lst = soup.find_all('a')
# for a in a_lst:
#     href = a['href']
#     print(href)


# 6 获取职位名称
trs = soup.find_all('tr')[1:]
for tr in trs:
    tds = tr.find_all('td')
    job_name = tds[0].string
    print(job_name)

修改文档树

'''
• 修改tag的名称和属性
• 修改string  属性赋值,就相当于用当前的内容替代了原来的内容
• append() 向tag中添加内容,就好像Python的列表的 .append() 方法
• decompose() 修改删除段落，对于一些没有必要的文章段落我们可以给他删除掉
'''

# 修改tag的名称和属性
# tag_p = soup.p
# print(tag_p)
#
# tag_p.name = 'w'# 修改标签名称
# tag_p['class'] = 'content'# 修改标签属性
# print(tag_p)


# 修改string  属性赋值,就相当于用当前的内容替代了原来的内容
# tag_p = soup.p
# print(tag_p.string)
#
# tag_p.string = 'you need python'
# print(tag_p.string)

# append() 像tag中添加内容,就好像Python的列表的 .append() 方法
# tag_p = soup.p
# print(tag_p)
# tag_p.append('abc')
# print(tag_p)

# decompose() 修改删除段落，对于一些没有必要的文章段落我们可以给他删除掉
r = soup.find(class_='title')
r.decompose()
print(soup)