python数据解析（bs4\xpath）

lovers_better

于 2024-06-01 09:24:16 发布

阅读量713

点赞数 17

分类专栏： python爬虫 python 文章标签： python tensorflow 深度学习

本文链接：https://blog.csdn.net/Al_lover/article/details/139365527

版权

python 同时被 2 个专栏收录

5 篇文章 1 订阅

订阅专栏

python爬虫

2 篇文章 0 订阅

订阅专栏

python数据解析(bs4\xpath)

一、何为数据解析

概念：可以将爬取到的数据中指定的数据进行单独提取。
作用：实现聚焦爬虫。
数据解析通用原理：
- 在一张页面中，想要解析的数据是存在于相关的html的标签中。
- 可以先将指定的标签进行定位，然后可以将该标签中展示的数据进行提取。
聚焦爬虫编码流程:
- 指定url
- 发起请求
- 获取页面源码数据
- 数据解析
- 持久化存储
python中可以实现数据解析的技术：
- 正则表达式（复杂度高）
- bs4（python独有，学习成本较低）
- xpath（通用性最强，最重要）
- pyquery（css语句）

二、数据解析的主流策略

2.1 bs4

环境安装：
- pip install bs4
- pip install lxml
bs4数据解析的流程:
- 创建一个BeautifulSoup对象，把被解析的数据加载到该对象中。
- 调用BeautifulSoup对象相关的属性或者方法进行标签定位和数据提取

具体解析的操作：

在当前目录下新建一个test.html文件，然后将下述内容拷贝到该文件中

<html lang="en">
<head>
	<meta charset="UTF-8" />
	<title>测试bs4</title>
</head>
<body>
	<div>
		<p>百里守约</p>
	</div>
	<div class="song">
		<p>李清照</p>
		<p>王安石</p>
		<p>苏轼</p>
		<p>柳宗元</p>
		<a href="http://www.song.com/" title="赵匡胤" target="_self">
			<span>this is span</span>
		宋朝是最强大的王朝，不是军队的强大，而是经济很强大，国民都很有钱</a>
		<a href="" class="du">总为浮云能蔽日,长安不见使人愁</a>
		<img src="http://www.baidu.com/meinv.jpg" alt="" />
	</div>
	<div class="tang">
		<ul>
			<li><a href="http://www.baidu.com" title="qing">清明时节雨纷纷,路上行人欲断魂,借问酒家何处有,牧童遥指杏花村</a></li>
			<li><a href="http://www.163.com" title="qin">秦时明月汉时关,万里长征人未还,但使龙城飞将在,不教胡马度阴山</a></li>
			<li><a href="http://www.126.com" alt="qi">岐王宅里寻常见,崔九堂前几度闻,正是江南好风景,落花时节又逢君</a></li>
			<li><a href="http://www.sina.com" class="du">杜甫</a></li>
			<li><a href="http://www.dudu.com" class="du">杜牧</a></li>
			<li><b>杜小月</b></li>
			<li><i>度蜜月</i></li>
			<li><a href="http://www.haha.com" id="feng">凤凰台上凤凰游,凤去台空江自流,吴宫花草埋幽径,晋代衣冠成古丘</a></li>
		</ul>
	</div>
</body>
</html>

有了test.html文件后，在练习如下操作

from bs4 import BeautifulSoup
fp = open('test.html','r')
#1.创建一个BeautifulSoup的工具对象，然后把即将被解析的页面源码数据加载到该对象中
#参数1：被解析的页面源码数据
#参数2：固定形式的lxml(一种解析器)
soup = BeautifulSoup(fp,'lxml')

#2.可以调用BeautifulSoup对象的相关函数和属性进行标签定位和数据提取

#标签定位-方式1:soup.tagName(只可以定位到第一次出现的该标签)
title_tag = soup.title
p_tag = soup.p

#标签定位-方式2（属性定位）:soup.find(tagName,attrName='value')
#注意：find只可以定位满足要求的第一个标签，如果使用class属性值的话，find参数class_
#定位到了class属性值为song的div标签
div_tag = soup.find('div',class_='song')
#定位到class属性值为du的a标签
a_tag = soup.find('a',class_='du')
#定位到了id的属性值为feng的a标签
a_tag = soup.find('a',id='feng')

#标签定位-方式3（属性定位）:soup.find_all(tagName,attrName='value')
#注意：find_all可以定位到满足要求的所有标签
tags = soup.find_all('a',class_='du')
#标签定位-方式4(选择器定位):
    #常用的选择器：class选择器(.class属性值)  id选择器(#id的属性值)
tags = soup.select('#feng') #定位到id的属性值为feng对应的所有标签
tags = soup.select('.du') #定位到class属性值为du对应的所有标签
#层级选择器：>表示一个层级  一个空格可以表示多个层
tags = soup.select('.tang > ul > li > a')
tags = soup.select('.tang a')
# print(tags)

#定位到标签内部数据的提取
#方式1：提取标签内的文本数据
    #tag.string:只可以将标签直系的文本内容取出
    #tag.text:可以将标签内部所有的文本内容取出
tag = soup.find('a',id='feng')
content = tag.string

div_tag = soup.find('div',class_='tang')
content = div_tag.text

#方式2：提取标签的属性值 tag['attrName']
img_tag = soup.find('img')
img_src = img_tag['src']
print(img_src)

案例应用：碧血剑文本爬取

url：https://bixuejian.5000yan.com/
需求：将每一个章节的标题和内容进行爬取然后存储到文件中

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

url = 'https://bixuejian.5000yan.com/'
#获取了首页对应的页面源码数据
response = requests.get(url=url,headers=headers)
response.encoding = 'utf-8'
page_text = response.text
#在首页页面源码数据中进行数据解析（章节的标题）
soup = BeautifulSoup(page_text,'lxml')
#所有的a标签定位保存到了a_list这个列表中
a_list = soup.select('.paiban > li > a')
for a in a_list:
    #章节的标题
    title = a.string
    detail_url = a['href'] #章节详情页的url
    #对详情页的url进行请求，为了获取详情页的页面源码数据，将章节内容进行解析
    detail_response = requests.get(url=detail_url,headers=headers)
    detail_response.encoding = 'utf-8'
    detail_page_text = detail_response.text
    #解析详情页，将章节内容进行提取
    detail_soup = BeautifulSoup(detail_page_text,'lxml')
    div_tag = detail_soup.find('div',class_='grap')
    #章节内容
    content = div_tag.text

    fileName = 'xiaoshuo/' + title + '.txt' #xiaoshuo/章节1.txt
    with open(fileName,'w') as fp:
        fp.write(title+'\n'+content)
    print(title,':爬取保存成功！')

2.2 xpath

环境安装：
- pip install lxml
xpath解析的编码流程:
- 创建一个etree类型的对象，把被解析的数据加载到该对象中
- 调用etree对象中xpath函数结合不同形式的xpath表达式进行标签定位和数据的提取
xpath表达式如何理解？

from lxml import etree
#1.创建一个etree的工具对象，然后把即将被解析的页面源码数据加载到该对象中
tree = etree.parse('test.html')
#2.调用etree对象的xpath函数然后结合着不用形式的xpath表达式进行标签定位和数据提取
#xpath函数返回的是列表，列表中存储的是满足定位要求的所有标签
#/html/head/title定位到html下面的head下面的title标签
title_tag = tree.xpath('/html/head/title')
#//title在页面源码中定位到所有的title标签
title_tag = tree.xpath('//title')
#属性定位
    #定位到所有的div标签
div_tags = tree.xpath('//div')
    #定位到class属性值为song的div标签 //tagName[@attrName='value']
div_tag = tree.xpath('//div[@class="song"]')
#索引定位://tag[index]
    #注意：索引是从1开始的
div_tag = tree.xpath('//div[1]')
#层级定位
    # /表示一个层级  //表示多个层级
a_list = tree.xpath('//div[@class="tang"]/ul/li/a')
a_list = tree.xpath('//div[@class="tang"]//a')

#数据提取
    #1.提取标签中的文本内容:/text()取直系文本  //text()取所有文本
a_content = tree.xpath('//a[@id="feng"]/text()')[0]
div_content = tree.xpath('//div[@class="song"]//text()')
    #2.提取标签的属性值：//tag/@attrName
img_src = tree.xpath('//img/@src')[0]
print(img_src)

图片数据爬取：

http://pic.netbian.com/4kmeinv/

将爬取到的图片存储到指定的文件夹中
爬取第一页

from lxml import etree
import requests
import os
#新建一个文件夹
dirName = 'girls'
if not os.path.exists(dirName):#如果文件夹不存在，则新建，否则不新建
    os.mkdir(dirName)

headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
}

url = 'https://pic.netbian.com/4kmeinv/index.html'
response = requests.get(url=url,headers=headers)
response.encoding = 'gbk'
page_text = response.text

#数据解析：图片地址+图片名称
tree = etree.HTML(page_text)#HTML()专门用来解析网络请求到的页面源码数据
#该列表中存储的是每一个li标签
li_list = tree.xpath('//div[@class="slist"]/ul/li')
for li in li_list:
    #局部解析：将li标签中指定的内容解析出来
    img_title = li.xpath('./a/b/text()')[0]+'.jpg'# 左侧./表示xpath的调用者对应的标签
    img_src = 'https://pic.netbian.com'+li.xpath('./a/img/@src')[0]

    #对图片发起请求，存储图片数据
    img_data = requests.get(url=img_src,headers=headers).content
    # girls/123.jpg
    img_path = dirName + '/' + img_title
    with open(img_path,'wb') as fp:
        fp.write(img_data)
    print(img_title,'下载保存成功！')

爬取多页

from lxml import etree
import requests
import os
#新建一个文件夹
dirName = 'girls'
if not os.path.exists(dirName):#如果文件夹不存在，则新建，否则不新建
    os.mkdir(dirName)

headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
}
#创建一个通用的url:除了第一页其他页码的通用url
url = 'https://pic.netbian.com/4kmeinv/index_%d.html'
for page in range(1,6):
    if page == 1:
        new_url = 'https://pic.netbian.com/4kmeinv/index.html'
    else:
        new_url = format(url%page)
    print('----------正在请求下载第%d页的图片数据----------'%page)
    response = requests.get(url=new_url,headers=headers)
    response.encoding =2 'gbk'
    page_text = response.text

    #数据解析：图片地址+图片名称
    tree = etree.HTML(page_text)#HTML()专门用来解析网络请求到的页面源码数据
    #该列表中存储的是每一个li标签
    li_list = tree.xpath('//div[@class="slist"]/ul/li')
    for li in li_list:
        #局部解析：将li标签中指定的内容解析出来
        img_title = li.xpath('./a/b/text()')[0]+'.jpg'# 左侧./表示xpath的调用者对应的标签
        img_src = 'https://pic.netbian.com'+li.xpath('./a/img/@src')[0]

        #对图片发起请求，存储图片数据
        img_data = requests.get(url=img_src,headers=headers).content
        # girls/123.jpg
        img_path = dirName + '/' + img_title
        with open(img_path,'wb') as fp:
            fp.write(img_data)
        print(img_title,'下载保存成功！')