bs4

最新推荐文章于 2024-01-23 17:55:23 发布

原创最新推荐文章于 2024-01-23 17:55:23 发布 · 448 阅读

1 ·

CC 4.0 BY-SA版权

bs4简介

一、bs4是什么？
beautifulsoup4
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取

二、基本概念
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库

三、有什么意义？
1、PC端网站中去爬取数据百度网站腾讯的网站
随着网站的种类增多，去寻找最适合解决这个网站的技术
2、正则：正则表达式有的时候不太好写容易出错
3、xpath：记住一些语法
4、bs4它的特点只需要同学们记住一些方法就可以了

四、bs4源码分析
我们为什么要学习看源码？
1、了解原理、了解使用
2、学习思路
3、开发者必备的一项技能
4、源码当中有一些小图标

c Class 类
m Method 方法
f Field 字段
p Property 装饰器
v Variable 变量

bs4的快速入门

1 安装
pip install lxml
pip install bs4
2 导入
form bs4 import BeautifulSoup
3 创建soup对象
soup = BeautifulSoup(tag)
4 可以使用对象当中的方法
例如 find() find_all()
5、bs4的种类

'''
tag : 标签
NavigableString : 可导航的字符串
BeautifulSoup : soup对象
Comment : 注释
'''
from bs4 import BeautifulSoup
from bs4.element import NavigableString
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
# # tag : 标签
# soup = BeautifulSoup(html_doc,"lxml")
# print(type(soup.title))  # 结果为 <class 'bs4.element.Tag'>
# print(type(soup.a))  # 结果为 <class 'bs4.element.Tag'>
# print(type(soup.p))  # 结果为 <class 'bs4.element.Tag'>
#
# # NavigableString : 可导航的字符串
# print(type(soup.title.string))  # 结果为 <class 'bs4.element.NavigableString'>
#
# # BeautifulSoup : soup对象
# print(type(soup))  # 结果为 <class 'bs4.BeautifulSoup'>

# Comment : 注释
html = '<a><!--我爱python--></a>'
soup2 = BeautifulSoup(html, 'lxml')
print(type(soup2.string))  # 结果为 <class 'bs4.element.Comment'>

快速入门

# pip install lxml 安装lxml 要安装bs4要先安装lxml
# pip install bs4 安装bs4
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
# soup = BeautifulSoup(html_doc, features="lxml")
soup = BeautifulSoup(html_doc, "lxml")
# 打印文档内容(把我们的标签更加规范的打印)
print(soup.prettify())

# 需求：获取title标签里面的文本 The Dormouse's story
# 通过标签获取数据 然后在解析数据
# print(soup.title)  # 结果为<title>The Dormouse's story</title>
# print(soup.title.string)  # 结果为The Dormouse's story
# 获取title标签名称 title
# print(soup.title.name)  # title

# 需求：获取所有的p段落
# print(soup.p)  # 结果为 <p class="title"><b>The Dormouse's story</b></p>
# r = soup.findAll('p')  # 返回一个列表
# r = soup.find_all('p')  # 和上面代码的效果原因
# print(len(r), r)  # len为3，

# 需求: 找到a标签当中的href链接
links = soup.find_all('a')
for link in links:
    print(link.get('href'))  # 进行遍历取出a里面href中的数据

遍历文档树

一、遍历子节点
1、contents 返回的是一个所有子节点的列表
2、children 返回的是一个子节点的迭代器通
3、descendants 返回的是一个生成器遍历子子孙孙

from bs4 import BeautifulSoup
from bs4.element import NavigableString
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, "lxml")
'''
contents 返回的是一个所有子节点的列表
children 返回的是一个子节点的迭代器
descendants 返回的是一个生成器遍历子子孙孙
'''
# head_tag = soup.head
# contents 返回的是一个所有子节点的列表
# print(head_tag.contents)  # 结果为 [<title>The Dormouse's story</title>]
# children 返回的是一个子节点的迭代器
# print(head_tag.children)  # 结果为 <list_iterator object at 0x00000209775B41F0>
# descendants 返回的是一个生成器遍历子子孙孙
# for i in head_tag.children:  # 结果为 <title>The Dormouse's story</title> 因为head中只有一个使用遍历出一个
#     print(i)
# html_tag = soup.html
# for i in html_tag.descendants:  # 全部的节点都打印出来
#     print(i)

string获取标签里面的内容
strings 返回是一个生成器对象用过来获取多个标签内容
stripped_strings 和strings基本一致但是它可以把多余的空格去掉

'''
string获取标签里面的内容
strings 返回是一个生成器对象用过来获取多个标签内容
stripped_strings 和strings基本一致 但是它可以把多余的空格去掉
'''
# string获取标签里面的内容第一种写法
# title_tag = soup.title   # 获取title标签里的内容
# print(title_tag.string)  # 结果为  The Dormouse's story
# 第二种写法
# print(soup.title.string)  # 结果为  The Dormouse's story
# print(soup.head.string)  # 结果为 The Dormouse's story
# strings 返回是一个生成器对象用过来获取多个标签内容
# print(soup.html.strings)
# s = soup.html.strings  # 用s来接收soup.html.strings
# for i in s:  # 进行遍历
#     print(i)  结果有空格
#
# stripped_strings 和strings基本一致 但是它可以把多余的空格去掉
# print(soup.html.stripped_strings)
# s = soup.html.stripped_strings # 用s来接收soup.html.strings
# for i in s:  # 进行遍历
#     print(i) 将空格去掉了

二、遍历父节点
1、parent直接获得父节点
2、parents获取所有的父节点

'''
parent直接获得父节点
parents获取所有的父节点
'''
# title_tag = soup.title
# print(title_tag)  # <title>The Dormouse's story</title>
# # parent直接获得父节点
# print(title_tag.parent)  # 获取了title的父节点  <head><title>The Dormouse's story</title></head>
# print(soup.html.parent)  # 获取了html的父节点  整个文档

# parents获取所有的父节点
a_tag = soup.a
# print(a_tag.parents)  # 结果为<generator object PageElement.parents at 0x000002102E6BB200>的生成器对象

for p in a_tag.parents:
    print(p)
    print('-'*50)

三、遍历兄弟结点
1、next_sibling 下一个兄弟结点
2、previous_sibling 上一个兄弟结点
3、next_siblings 下一个所有兄弟结点
4、previous_siblings上一个所有兄弟结点

'''
next_sibling 下一个兄弟结点
previous_sibling 上一个兄弟结点
next_siblings 下一个所有兄弟结点
previous_siblings上一个所有兄弟结点
'''
# html2 = '<a><b>bbb</b><c>ccc</c></a>'
# soup2 = BeautifulSoup(html2, 'lxml')
# print(soup2.prettify())
# b_tag = soup2.b
# print(b_tag)
# # next_sibling 下一个兄弟结点
# print(b_tag.next_sibling)  # 结果为 <c>ccc</c>
# previous_sibling 上一个兄弟结点
# c_tag = soup2.c
# print(c_tag.previous_sibling)  # 结果为 <b>bbb</b>
# next_siblings 下一个所有兄弟结点
# a_tag = soup.a
# print(a_tag.next_siblings)  # 结果为 <generator object PageElement.next_siblings at 0x000002A0DFF6C270>生成器对象
# # previous_sibling
# for p in a_tag.next_siblings:  # 遍历获取
#     print(p)
#     print('-'*50)
# previous_siblings上一个所有兄弟结点
# a_tag = soup.a
# print(a_tag.previous_siblings)  # 结果为 <generator object PageElement.previous_siblings at 0x00000256929DC270>生成器对象
# # previous_sibling
# for p in a_tag.previous_siblings:  # 遍历获取
#     print(p)
#     print('-'*50)

搜索树

1、字符串过滤器
2、列表过滤器

print(soup.find('a'))  # 此时'a'是字符串过滤器
print(soup.find_all(['title','b']))  # 此时['title','b'] 列表过滤器

select()方法

我们也可以通过css选择器的方式来提取数据。但是需要注意的是这里面需要我们掌握css语法
https://www.w3school.com.cn/cssref/css_selectors.asp

from bs4 import BeautifulSoup
# html_doc = """
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="title"><b>The Dormouse's story</b></p>
#
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
# <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
#
# <p class="story">...</p>
# """
# # soup = BeautifulSoup(html_doc,features="lxml")
# soup = BeautifulSoup(html_doc,"lxml")

# 1 找a标签
# print(soup.select('a')) # 通过标签的名称查找

# 2 通过类名来查找class="sister"
'''
选择 class="intro" 的所有元素。.intro
class="sister"  .sister
'''
# print(soup.select(class_='sister'))
# print(soup.select('.sister'))


# 3 通过id查找
'''
选择 id="firstname" 的元素。
#firstname

id="link1" --> #link1
'''
# print(soup.select('#link1'))

# 特殊的查找方式
# print(soup.select('head > title'))

# 获取文本内容
# print(soup.select('title')[0].string)
# print(soup.select('title')[0].get_text())

修改文档树

1、修改tag的名称和属性
2、修改string 属性赋值,就相当于用当前的内容替代了原来的内容
3、append() 像tag中添加内容,就好像Python的列表的 .append() 方法
4、decompose() 修改删除段落，对于一些没有必要的文章段落我们可以给他删除掉

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
# soup = BeautifulSoup(html_doc,features="lxml")
soup = BeautifulSoup(html_doc,"lxml")
'''
• 修改tag的名称和属性
• 修改string  属性赋值,就相当于用当前的内容替代了原来的内容
• append() 像tag中添加内容,就好像Python的列表的 .append() 方法
• decompose() 修改删除段落，对于一些没有必要的文章段落我们可以给他删除掉
'''

# tap_p = soup.p
# print(tap_p)  # 结果为 <p class="title"><b>The Dormouse's story</b></p> 原内容
# #修改tag的名称和属性
# tap_p.name = 'w'  # 修改名称
# tap_p['class'] = 'content'
# print(tap_p)  # 结果为 <w class="content"><b>The Dormouse's story</b></w> 修改后

# tap_p = soup.p
# print(tap_p.string)  # 结果为 The Dormouse's story 原内容
# #  修改string  属性赋值,就相当于用当前的内容替代了原来的内容
# tap_p.string = 'you need python'
# print(tap_p.string)  # 结果为 you need python  修改后

# tap_p = soup.p
# print(tap_p)  # 结果为 <p class="title"><b>The Dormouse's story</b></p> 原内容
# # append() 像tag中添加内容,就好像Python的列表的 .append() 方法
# tap_p.append('123')
# print(tap_p)  # 结果为 <p class="title"><b>The Dormouse's story</b>123</p> 修改后

# decompose() 修改删除段落，对于一些没有必要的文章段落我们可以给他删除掉
r = soup.find(class_='title')
r.decompose()
print(soup)