Python爬虫----bs4入门到精通（一）

最新推荐文章于 2025-02-19 23:57:19 发布

猩猩文学

最新推荐文章于 2025-02-19 23:57:19 发布

阅读量604

点赞数

分类专栏： python爬虫

本文链接：https://blog.csdn.net/R71802/article/details/124769711

版权

python 爬虫数据挖掘

python爬虫专栏收录该内容

19 篇文章 4 订阅

订阅专栏

Python爬虫----bs4入门到精通（一）

BeautifulSoup4介绍

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

1、用来解析数据

2、用不同的解析模块处理不同的网页结构的网站

正则：用正则表达式去匹配数据比较复杂
xpath：语法节点关系较为难找
bs4：find() find_all() 方法

基本概念

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库

源码分析

c class 类
m method 方法
f field 字段
p 被@property所修饰的方法可以当成属性来用
v variable 变量

bs4快速入门

一、安装

要先安装 lxml:pip install lxml
再装bs4: pip install bs4

二、导入模块

from bs4 import BeautifulSoup

三、创建soup对象

soup = BeautifulSoup(html_doc, 'lxml')

bs4对象种类

● tag : 标签
● NavigableString : 可导航的字符串
● BeautifulSoup : bs对象
● Comment : 注释

代码演示，详细注解

# 1、导入模块
from bs4 import BeautifulSoup

"""
仅作了解即可
● tag : 标签
● NavigableString : 可导航的字符串
● BeautifulSoup : bs对象
● Comment : 注释
"""

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

<span><!--comment注释内容举例--></span>
"""

# 2、创建soup对象
soup = BeautifulSoup(html_doc, features='lxml')
# print(type(soup.title))  # <class 'bs4.element.Tag'> 标签
# print('-' * 30)
# print(type(soup.a))  # <class 'bs4.element.Tag'> 标签
# print('-' * 30)
# print(type(soup.p))  # <class 'bs4.element.Tag'> 标签
# print('-' * 30)
# print(type(soup.body))  # <class 'bs4.element.Tag'> 标签
# print('-' * 30)
# print(type(soup.title.string))  # <class 'bs4.element.NavigableString'> 可导航的字符串
# print('-' * 30)
# print(type(soup))  # <class 'bs4.BeautifulSoup'> bs对象
# print('-' * 30)
# print(type(soup.span.string))  # <class 'bs4.element.Comment'> 注释

遍历文档树

contents，children，descendants

● contents 返回的是一个所有子节点的列表
● children 返回的是一个子节点的迭代器通
● descendants 返回的是一个生成器遍历子子孙孙

代码演示，详细注解

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

<span><!--comment注释内容举例--></span>
"""

"""
 contents children descendants 
● contents 返回的是一个所有子节点的列表
● children 返回的是一个子节点的迭代器通
● descendants 返回的是一个生成器遍历子子孙孙
"""

soup = BeautifulSoup(html_doc, 'lxml')
head = soup.head
a = soup.a
# print(head.contents)  # [<title>The Dormouse's story</title>] 回的是一个所有子节点的列表
# print('-' * 30)
# for i in head.contents:
#     print(i)
# print(head.children)  # <list_iterator object at 0x000001B13C39B708> 返回的是一个子节点的迭代器的对象
# print('-' * 30)
# for i in head.children: # (凡是迭代器 都是可以遍历的)
#     print(i)
# print(head.descendants)  # <generator object Tag.descendants at 0x000001B13C456648> 返回的是一个生成器遍历子子孙孙
# print('-' * 30)
# for i in a.descendants:
#     print(i)

# 会把换行也当成子节点 匹配到
html = soup.html
# print(html.contents)
# print(html.descendants)
"""
[<head><title>The Dormouse's story</title></head>, '\n', <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<span><!--comment注释内容举例--></span>
</body>]

"""
# for h in html.descendants:
#     print(h)

string ，strings，stripped_strings

● string获取标签里面的内容
● strings 返回是一个生成器对象用过来获取多个标签内容
● stripped_strings 和strings基本一致但是它可以把多余的空格去掉

代码演示，详细注解

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

<span><!--comment注释内容举例--></span>
"""

"""
 contents children descendants 
● contents 返回的是一个所有子节点的列表
● children 返回的是一个子节点的迭代器通
● descendants 返回的是一个生成器遍历子子孙孙
"""

soup = BeautifulSoup(html_doc, 'lxml')
head = soup.head
a = soup.a
"""
需要重点掌握
string  strings stripped_strings
● string获取标签里面的内容
● strings 返回是一个生成器对象用过来获取多个标签内容
● stripped_strings 和strings基本一致 但是它可以把多余的空格去掉
"""
# 用来获取标签里的文本内容
# print(soup.title.string)
# 返回是一个生成器对象用过来获取多个标签内容
# 返回一个生成器对象
# print(html.strings)  # <generator object Tag._all_strings at 0x000001E54C0D85C8>
# for i in html.strings:
#     print(i)
# 和strings基本一致 但是它可以把多余的空格去掉
# print(html.stripped_strings)  # <generator object PageElement.stripped_strings at 0x000001A59E8A85C8>
# for i in html.stripped_strings:
#     print(i)
'''
    <generator object PageElement.stripped_strings at 0x000001D66FDA6648>
    The Dormouse's story
    The Dormouse's story
    Once upon a time there were three little sisters; and their names were
    Elsie
    ,
    Lacie
    and
    Tillie
    ;
    and they lived at the bottom of a well.
    ...
'''

parent 和 parents

● parent直接获得父节点
● parents获取所有的父节点

代码演示，详细注解

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

<span><!--comment注释内容举例--></span>
"""

"""
 contents children descendants 
● contents 返回的是一个所有子节点的列表
● children 返回的是一个子节点的迭代器通
● descendants 返回的是一个生成器遍历子子孙孙
"""

soup = BeautifulSoup(html_doc, 'lxml')
head = soup.head
a = soup.a
"""
 遍历文档树 遍历父节点
 parent 和 parents
● parent直接获得父节点
● parents获取所有的父节点
"""

title = soup.title
# parent 找直接父节点
# print(title.parent)
# 返回一个生成器
# print(title.parents)  # <generator object PageElement.parents at 0x0000018B3E8C8548>
# for p in title.parents:
#     print(p)
'''
1、首先找到title的父节点：<head><title>The Dormouse's story</title></head>
2、紧接着找到父节点的父节点（head的父节点）：
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<span><!--comment注释内容举例--></span>
</body></html>
3、最后找到父节点的父节点的父节点（html的父节点）：最后找到父节点的父节点的父节点：
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<span><!--comment注释内容举例--></span>
</body></html>

'''
# html的父节点就是整个文档

find() 和 find_all()----[重点学习]

● find_all()方法以列表形式返回所有的搜索到的标签数据
● find()方法返回搜索到的第一条数据
● find_all()方法参数

● name : tag名称
● attr : 标签的属性
● recursive : 是否递归搜索
● text : 文本内容
● limli : 限制返回条数
● kwargs : 关键字参数

代码演示，详细注解

from lxml import etree
from bs4 import BeautifulSoup
import pprint

# html_doc是网页源码
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')

# 字符串过滤器
# 获取所有的a标签，find_all 把所有找到的数据放在列表里面返回
# a_list = soup.findAll('a')
# print(a_list)
# print('-' * 70)
# for a in a_list:
#     print(a)
# print('-' * 70)
# print(soup.find('a'))  # find()返回匹配到的第一个结果


# 找到title节点和p节点
# result= soup.findAll('title','p')# [] 不可行
result = soup.findAll(['title', 'p'])
print(result)

案例练习，复习总结bs4

from bs4 import BeautifulSoup

html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
    <tbody>
        <tr class="h">
            <td class="l" width="374">职位名称</td>
            <td>职位类别</td>
            <td>人数</td>
            <td>地点</td>
            <td>发布时间</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云区块链高级研发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高级后台开发</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐运营开发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐业务运维工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高级研发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高级图像算法研发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高级AI开发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>4</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高级业务运维工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
    </tbody>
</table>
"""

soup = BeautifulSoup(html, 'lxml')

# 1、拿到所有tr节点
# tr_list = soup.find_all('tr')
# for tr in tr_list:
#     print(tr)
#     print('-'*74)

# 2、获取第二个tr节点
# print(tr_list[1])

# 3、找到所有class="even"的tr节点
# 方式一
# class_list = soup.find_all('tr',class_='even')
# for c in class_list:
#     print(c)
#     print('-'*74)

# 方式二 以字典的方式传入
# class_list = soup.find_all('tr', attrs={'class': 'even'})
# for c in class_list:
#     print(c)
#     print('-' * 74)


# 4、定位到id="test"的a标签
# a_list = soup.find_all('a', id="test")
# for a in a_list:
#     print(a)
#     print('-'*74)
# a_lists = soup.find_all('a', attrs={'id': "test", 'class': 'test'})
# for a in a_lists:
#     print(a)
#     print('-' * 74)

# 5、获取所有a标签里面的href属性值
# a_list= soup.find_all('a')
# for a in a_list:
#     推荐使用第一种
#     print(a.get('href'))
#     print(a.attrs['href'])
#     print(a['href'])

# 6、获取所有职位名称
# 第一个tr是标头可以过滤掉
tr_list = soup.find_all('tr')[1:]
for tr in tr_list:
    a_list = tr.find_all('a')
    print(a_list)