Python网络爬虫与信息提取MOOC学习——Beautiful Soup库入门

最新推荐文章于 2023-04-19 09:45:00 发布

二5678七

最新推荐文章于 2023-04-19 09:45:00 发布

阅读量258

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/LTCSDN7/article/details/113177283

版权

Python 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

Beautiful Soup 库

Python第三方库，能对HTML, XML进行解析，并且提取相关信息，可以对提供给的任何格式进行爬取并且进行树形解析。（可以把给它的任何文档当作一锅汤，熬制这锅汤）

一、安装Beautiful soup库
打开cmd进入Python的安装目录下的Scripts目录
执行安装语句

pip install beautifulsoup4

安装成功
在这里插入图片描述
二、Beautiful Soup库的使用
网站 http://www.crummy.com/software/BeautifulSoup
获取HTML页面的源代码有三种方法https://python123.io/ws/demo.html

在浏览器打开网页，右键选择查看页面源代码

在这里插入图片描述

使用Requests库进行爬取

import requests
r = requests.get("http://python123.io/ws/demo.html")
r.text

在这里插入图片描述

使用BeautifulSoup库

from bs4 import BeautifulSoup  #从bs4库中引入BeautifulSoup类
demo = r.text  #HTML代码
soup = BeautifulSoup(demo, "html.parser")  #HTML代码，“熬制一锅汤” 解析器。
print(soup.prettify())

在这里插入图片描述
Beautiful Soup库基本使用方法

引用
from bs4 import BeautifulSoup  #从bs4库中引入BeautifulSoup类
import bs4
解析
soup = BeautifulSoup('<p>data</p>', 'html.parser')  #元素：HTML代码，“熬制一锅汤” 解析器。

三、Beautiful Soup库的基本元素
Beautiful Soup库（beautifulsoup4/bs4)是解析、遍历、维护“标签树”的功能库。 BeautifulSoup类是能够代表标签树的类型（HTML,对应一个HTML/XML文档的全部内容。

  BeautifulSoup('<p>data</p>', 'html.parser')  #HTML文件+解析器
  BeautifulSoup(open("D://demo.html", 'html.parser')

Beautiful Soup库解析器
在这里插入图片描述
Beautiful Soup库类的基本元素
（title是用浏览器打开上方的标签标题）

在这里插入图片描述
获得Tag标签的方法以及查看标签属性等

#将HTML信息存放在demo变量里
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text

#引入BeautifulSoup类
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, "html.parser")  #demo是存放之前的r.text
#查看title的HTML
soup.title   
#获得标签
tag = soup.a  #获得链接标签的内容（当有多个相同的a标签时返回第一个标签内容），任何HTML中的标签都可以用soup.<tag>访问
tag  #将标签打印出来

#获得标签的名字Name 字符串类型
soup.a.name
soup.a.parent.name   #a的父亲（上一层标签）
soup.a.parent.parent.name  #'body' #再上一层

#标签的属性，一个tag有0或多个属性
tag = soup.a
tag.attrs  #打印出字典，属性名字和值对应的关系
tag.attrs['class']  #获得class属性的值['py1']
tag.attrs['href']
type(tag.attrs)  #<lass 'dict'> 字典类型
type(tag)   #<class 'bs4.element.Tag'>  Tag类型

#标签的NavigableString 标签内非属性字符串
soup.a
soup.a.string  #'Basic Python'
soup.p
soup.p.string # NavigableString'是可以跨越多个标签层次的
type(soup.p.string) 

#注释
newsoup = BeautifulSoup("<b><!-- a comment--></b><p>This is not comment</p>","html.parser")
newsoup.b.string
type(newsoup.b.string)
newsoup.p.string
type(newsoup.p.string)

用Python IDLE演示结果
在这里插入图片描述

四、基于bs4库的HTML内容遍历方法
HTML的基本格式——树型 <>…</>构成了从属关系
在这里插入图片描述

三种遍历方式：下行遍历，上行遍历，平行遍历

1. 下行遍历：从父节点到子节点

BeautifulSoup类型是标签树的根节点
在这里插入图片描述

soup = BeautifulSoup(demo,"html.parser")
soup.head   #打印头节点
soup.head.contents  #更加清晰，将所有的儿子节点列出
soup.body.contents
len(soup.body.contents)   #一共有几个儿子节点
soup.body.contents[1]   #打印第一个儿子节点

for child in soup.body.children:
	print(child)
for child in soup.body.descendants:  #遍历所有子孙节点（数量可能更多）
	print(child)

IDLE演示结果
在这里插入图片描述

2. 上行遍历：从子节点到父节点
在这里插入图片描述

soup.title.parent
soup.html.parent
soup.parent   #没有父节点

for parent in soup.a.parents:  #需要遍历所有先辈节点，包括soup本身，要判断是否有父节点
	if parent is None:
		print(parent)
	else:
		print(parent.name)

IDLE演示结果
在这里插入图片描述

3. 平行遍历
在这里插入图片描述
！必须在同一个父节点下

soup.a.next_sibling
soup.a.next_sibling.next_sibling
soup.a.previous_sibling
soup.a.previous_sibling.previous_sibling
soup.a.parent
#遍历后续节点
for sibling in soup.a.next_sibling:  
	print(sibling)
#遍历前续节点
for sibling in soup.a.previous.sibling:
	print(sibling)

在这里插入图片描述
五、基于bs4库的HTML格式输出

三种输出方法

demo   #引入requests库后令demo = r.text直接输出 【形式较乱】
soup.prettify()  #增加换行符\n 【不整齐】，可用于标签，方法
print(soup.prettify())  #打印出来-自动换行（\n）【清晰】

在这里插入图片描述
bs4库的编码
bs4库将任何HTML输入都变成utf-8编码，Python 3.x默认支持的编码是uft-8，解析无障碍。

二5678七

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python网络爬虫与信息提取MOOC学习——Beautiful Soup库入门

Beautiful Soup库
复制链接

扫一扫

专栏目录

Python网络爬虫与信息提取MOOC学习——Beautiful Soup库入门

Beautiful Soup 库

“相关推荐”对你有帮助么？