【数据解析】bs4和xpath

最新推荐文章于 2024-10-17 15:16:02 发布

AWei02

最新推荐文章于 2024-10-17 15:16:02 发布

阅读量394

点赞数 3

分类专栏：爬虫基础文章标签：爬虫

本文链接：https://blog.csdn.net/AWei02/article/details/139479603

版权

爬虫基础专栏收录该内容

10 篇文章

订阅专栏

本篇目录

bs4
xpath

爬虫流程

指定url
发起请求
获取响应数据
数据解析（标签定位、提取内容）
持久化存储

数据解析技术栈

正则
bs4（Python独有）
xpath（灵活、通用）
pyquery

bs4

安装

pip install bs4
pip install lxml  # 可选解释器
pip3 install html5lib  # 可选解释器

解析器介绍

流程

实例化BeautifulSoup对象，把页面源码数据加载到对象中

【解析】BeautifulSoup(fp,‘lxml’)：fp为本地html文件，对其进行数据解析
【解析】BeautifulSoup(page_text,‘lxml’)：page_text是网络请求到的页面源码数据(字符串)，对其进行数据解析
【定位&提取】调用BeautifulSoup对象中的属性和方法实现标签定位和数据提取

实操

解析

from bs4 import BeautifulSoup
# 两种读取方式
soup = BeautifulSoup(open("index.html"), 'lxml')    # 传入文件
soup = BeautifulSoup("<html>data</html>", 'lxml')   # 传入文本

# 小示范
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


soup = BeautifulSoup(html_doc, 'html.parser')  # 添加默认解析器(不能省略)
print(soup.find_all("a"))  # 返回一个列表，找所有的a标签

更多查询

定位

## 法1：soup.tagName
# 只可以定位到第一次出现的该标签
soup.title  # 获取第一次出现的title标签，soup.title.text就是该标签内的值
soup.p  # 获取第一次出现的P标签


## 法2：soup.find(tagName, attrName='value')
# find只可以定位满足要求的第一个标签，如果使用class属性值的话，find参数class_
# 定位到了class属性值为song的div标签
div_tag = soup.find('div', class_='song')
# 定位到class属性值为du的a标签
a_tag = soup.find('a', class_='du')
# 定位到了id的属性值为feng的a标签
a_tag = soup.find('a', id='feng')


# 法3：soup.find_all(tagName,attrName='value')
# 注意：find_all可以定位到满足要求的所有标签
tags = soup.find_all('a', class_='du')


# 法4：选择器定位
# 常用的选择器：class选择器(.class属性值)  id选择器(#id的属性值)
tags = soup.select('#feng')  # 定位到id的属性值为feng对应的所有标签
tags = soup.select('.du')  # 定位到class属性值为du对应的所有标签
# 层级选择器：>表示一个层级  一个空格可以表示多个层
tags = soup.select('.tang > ul > li > a')
tags = soup.select('.tang a')

选择器定位(层级选择器)

辨析soup.select('.tang > ul > li > a') 和 soup.select('.tang a') 的区别

soup.select('.tang > ul > li > a')会找到所有class="tang"元素下的ul列表中li列表项中的a链接。
soup.select('.tang a')不仅会找到class="tang"元素直接包含的a标签，还会找到class="tang"元素内部任何层级下的a标签。无所谓class="tang"元素内部的结构，只要a标签是class="tang"元素的后代即可。

提取

# 法1：提取标签内的文本数据
# tag.string:只可以将标签直系的文本内容取出
# tag.text:可以将标签内部所有的文本内容取出
tag = soup.find('a', id='feng')
content = tag.string

div_tag = soup.find('div', class_='tang')
content = div_tag.text  # 用\n隔开


# 法2：提取标签的属性值 tag['attrName']
img_tag = soup.find('img')
img_src = img_tag['src']  # 完整应是img_tag.attrs['src']
print(img_src)

xpath

安装

pip install lxml

解析

from lxml import etree

# etree.parse('test.html')
selector = etree.HTML(源码)  # 将源码转化为能被XPath匹配的格式
selector.xpath(表达式)  # 返回为一列表

定位

from lxml import etree

tree = etree.parse('test.html')

# 标签定位
# xpath函数返回的是列表，列表中存储的是满足定位要求的所有标签
title_tag = tree.xpath('/html/head/title')  # 定位到html下面的head下面的title标签
title_tag = tree.xpath('//title')  # 在页面源码中定位到所有的title标签

# 属性定位
div_tag = tree.xpath('//div[@class="song"]')  # 属性为song的div标签

# 索引定位
div_tag = tree.xpath('//div[1]')  # 索引是从1开始

# 层级定位
# /表示一个层级  //表示多个层级
a_list = tree.xpath('//div[@class="tang"]/ul/li/a')
a_list = tree.xpath('//div[@class="tang"]//a')

提取

# 数据提取
# 1.提取标签中的文本内容:/text()取直系文本  //text()取所有文本
a_content = tree.xpath('//a[@id="feng"]/text()')[0]
div_content = tree.xpath('//div[@class="song"]//text()')
# 2.提取标签的属性值：//tag/@attrName
img_src = tree.xpath('//img/@src')[0]
print(img_src)