【Python模块学习】BeautifulSoup简介

最新推荐文章于 2023-07-06 21:52:05 发布

Buffedon

最新推荐文章于 2023-07-06 21:52:05 发布

阅读量982

点赞数 1

分类专栏： python基础文章标签： python beautifulsoup 爬虫 tag

本文链接：https://blog.csdn.net/weixin_46684578/article/details/121172631

版权

python基础专栏收录该内容

13 篇文章 0 订阅

订阅专栏

BeautifulSoup

是python 的一个库，最主要的功能就是 从网页爬取我们需要的数据。

BeautifulSoup 将 html 解析为对象处理，全部页面转变为字典或者数组，相对于正则表达式的方式，可以大大简化处理过程

BeautifulSoup 有四种对象：Tag、NavigableString、BeautifulSoup、Comment

HTML 转化为对象的过程

import requests
from bs4 import BeautifulSoup

response=request.get("https://www.baidu.com")
html=response.text
soup=BeautifulSoup(html,"html.parser")

四种对象(类型)

一、Tag

通俗来讲，就是 HTML 中的标签

<html>
<head>
   <title>Hello world</title>
</head>
<body>
  <h2>这是标题</h2>
  <p class="xie" name="p标签">你好，世界</p>
  <img src="1.jpg">
  <form action='action.php' method="get">
  用户名：<input type="text" name="username" />    <br/>
  密码：  <input type="password" name="passwd" />  <br/>
          <input type="submit" value="登录" />
  </form>
</body>
</html>

上面的title、p 等 HTML标签加上里面的内容就是 Tag

BeautifulSoup获取标签内容

import requests
from bs4 import BeautifulSoup

response=requests.get("https://www.baidu.com")
response.encoding='utf-8'
html=response.text
soup=BeautifulSoup(html,"html.parser")
print(soup.title)
print('-'*80)
print(soup.p) 
print(soup.p.name)              #p标签的名字
print(soup.p.string)            #p标签的内容,只针对于class类型
print(soup.p["id"])             #p标签的id属性
print('-'*80)
print(soup.a) 
print(soup.a.name) 
print(soup.a.string)
print(soup.a['class'])                    
print(soup.a.attrs)

对于Tag ，它有两个重要的属性，是name 和 attrs

每个tag 都有自己的名字，通过 .name 来获取

attrs 返回一个字典，其中包含了 Tag 的各种属性

输出：

请添加图片描述

二、NavigableString

获取标签和获取标签的内容

import requests
from bs4 import BeautifulSoup

response=requests.get("https://www.baidu.com")
response.encoding='utf-8'
html=response.text
soup=BeautifulSoup(html)

print(type(soup.a.string))
print(soup.a.string)
print('-'*80)
print(type(soup.name))
print(soup.name)
print(soup.attrs)

输出：

在这里插入图片描述

三、BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容。是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性

import requests
from bs4 import BeautifulSoup

response=requests.get("https://www.baidu.com")
response.encoding='utf-8'
html=response.text
soup=BeautifulSoup(html)

print(type(soup))
#print(soup)

在这里插入图片描述

四、Comment

Comment 对象是一个特殊类型的 NavigableString，其输出的内容仍然不包括注释符号

print(type(soup.p.string))
print(soup.p.string)

五、四种对象的区别

如何区分四个对象

其中 NavigableString 和 Comment 的区别在于网页源代码的差别

NavigableString

<p class="Buffedon" name="p标签">你好，世界</p>

Comment

<p class="Buffedon" name="p标签"><!--你好，世界!--></p>

import requests
from bs4 import BeautifulSoup

response=requests.get("http://127.0.0.1/crawler/spider.html")
response.encoding='utf-8'
html=response.text
soup=BeautifulSoup(html)

print(type(soup))                   # 也表明了 BeautifulSoup 对象是整个网页
print(type(soup.p))                 # Tag 表示一个 标签
print(type(soup.p.string))          # NavigableString 表示一个 标签的内容，comment 是特殊的 NavigableString对象

在这里插入图片描述

多值属性

这些属性有 class id rel rev headers 等

使用方式 print(soup.p[‘class’]) 返回的是列表

import requests
from bs4 import BeautifulSoup

response=requests.get("http://127.0.0.1/crawler/spider.html")
response.encoding='utf-8'
html=response.text
soup=BeautifulSoup(html)

print(soup.p)
print(soup.p['class'])
print(soup.attrs)

在这里插入图片描述

遍历文档树

方法一：.contents 返回一个list列表，通过索引的方式获取
方法二：.children 返回一个list生产器，我们需要遍历所有子孙节点

遍历直接子节点

不使用递归的方式

import requests
from bs4 import BeautifulSoup

response = requests.get("http://127.0.0.1/crawler/spider.html")
response.encoding='utf-8'
html=response.text
soup=BeautifulSoup(html)

print(soup.body.contents)
print('-'*60)
print(soup.body.contents[1])
print(soup.body.contents[3])
print(soup.body.contents[5])
print(soup.body.contents[7])        #输出了整个form表单，每个tag 占用一个索引
print('-'*60)

print(soup.body.children)           # 遍历子节点
for i in soup.body.children:
    print(i)

在这里插入图片描述

遍历所有子孙节点

.contents 和 .children 仅仅包含tag 的直接直接子节点

.descendants 属性可以对所有 tag 的子孙节点进行递归循环，和 .children 相似

也就是先输出一个tag，然后再递归输出tag中的所有内容

import requests
from bs4 import BeautifulSoup

response = requests.get("http://127.0.0.1/crawler/spider.html")
response.encoding='utf-8'
html=response.text
soup=BeautifulSoup(html)

for j in soup.body.descendants:
    print(j)

在这里插入图片描述

搜索标签（重点）

类似于 re 模块中的 正则表达式
函数
    find(name,sttrs,recursive,text,**kwargs)         返回第一个符合条件的标签
    find_all(name,sttrs,recursive,text,**kwargs)     返回所有符合条件的标签

name 参数

可以传tag 名称、正则表达式、列表、True、方法

attrs 参数

该参数被当做tag的属性，匹配符合条件的Tag

example: print(soup.find_all(value='登录'))

text 参数

和name 参数一样，可以传入tag 名称、正则表达式、列表、True、方法

recursive 参数

缺省值检索子孙节点

print(soup.find_all('head',recursive=False)) 检索直接子节点

limit 参数

限制返回的数量

soup.find_all('p',limit=2)          #返回前两个符合条件的tag

代码实例

import requests
import re
from bs4 import BeautifulSoup

response = requests.get("http://127.0.0.1/crawler/spider.html")
response.encoding='utf-8'
html=response.text
soup=BeautifulSoup(html)

print(soup.find('form'))
print('-'*60)
print(soup.find_all(['p','h2']))
print(soup.find_all(['p','h2'])[1])
print('-'*60)
print(soup.find_all(value='登录'))

#匹配 标签名为 title 的 标签
for i in soup.find_all(re.compile('title')):                
    print(i)

在这里插入图片描述

CSS选择器

我们在写CSS的时候，标签名不加任何修饰，类名前加点，id名前加#，在这里我们可以用类似的方法进行筛选
使用的是soup.select(),返回的类型是list
功能作用和上面的 find_all()是一样的，只不过语法有些区别

 import requests
 import re
 from bs4 import BeautifulSoup
 
 response = requests.get("http://127.0.0.1/crawler/spider.html")
 response.encoding='utf-8'
 html=response.text
 soup=BeautifulSoup(html)
 
 print(soup.select('.Buffedon'))             #返回class值为 Buffedon 的Tag
 print(soup.select('#Buffedon'))             #返回id值为 Buffedon 的Tag
 print(soup.select('h2'))

在这里插入图片描述

Buffedon

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
【Python模块学习】BeautifulSoup简介

BeautifulSoup是python 的一个库，最主要的功能就是从网页爬取我们需要的数据。BeautifulSoup 将 html 解析为对象处理，全部页面转变为字典或者数组，相对于正则表达式的方式，可以大大简化处理过程BeautifulSoup 有四种对象：Tag、NavigableString、BeautifulSoup、CommentHTML 转化为对象的过程import requestsfrom bs4 import BeautifulSoupresponse=request.
复制链接

扫一扫