数据解析分类
聚焦爬虫
数据解析原理概述
正则进行数据解析
bs4进行数据解析
xpath进行数据解析
解决响应中文乱码问题
数据解析分类
- 正则解析
- bs4解析
- xpath解析(学习重点)
聚焦爬虫
- 爬取页面中指定的内容(建立在通用爬虫之上)
- 编码流程
- 指定url
- 发起请求
- 获取响应数据
- 数据解析
- 持久化存储
数据解析原理概述
- 解析的局部的文本内容都会在标签之间或是标签对应的属性中进行存储
- 进行指定标签的定位(标签定位)
- 标签或是标签对应属性中存储的数据进行提取(解析)
正则进行数据解析
需求:爬取糗事百科糗图板下所有图片
import requests
import re
import os
if __name__ == "__main__":
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
# 创建一个文件夹用来保存图片
if not os.path.exists('./newQiutuLibs'):
os.mkdir('./newQiutuLibs')
# 通过分析,可以设置一个通过的url模板
url = 'https://www.qiushibaike.com/imgrank/page/%d/'
for pageNum in range(1, 3):
# 对应页面的url
new_url = format(url%pageNum)
# 使用通用爬虫对整张页面进行爬取
page_text = requests.get(url=new_url, headers=headers).text
# 使用聚焦爬虫对页面中所有的的糗图进行解析
# 正则表达式
ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
# 正则作用到数据解析时re.S,即为单行匹配
img_src_list = re.findall(ex, page_text, re.S)
# print(img_src_list)
for src in img_src_list:
# 拼接处一个完整的url
src = 'https:' + src
# 获取图片的二进制文件
img_data = requests.get(url=src, headers=headers).content
# 获取图片名称
img_name = src.split('/')[-1]
# 图片存储路径
img_path = './newQiutuLibs/' + img_name
with open(img_path, 'wb') as fp:
fp.write(img_data)
print(img_name, '下载成功!!!')
bs4解析
bs4数据解析原理
- 实例化一个BeaytifulSoup对象,并且将页面源码数据加载到该对象中
- 通过调用BeaytifulSoup对象中相关属性和方法进行标签定位和数据提取
环境安装
- pip install bs4
- pip install lxml
- 若安装失败可参考该文章
如何实例化BeaytifulSoup对象
- from bs4 import BeaytifulSoup
- 对象的实例化
- 将本地的html文档中的数据加载到该对象中
# 将本地的html文档中数据加载到该对象中
fp = open('./test.html', 'r', encoding='utf-8')
soup = BeautifulSoup(fp, 'lxml')
print(soup)
- 将互联网上获取的页面源码加载到该对象中
page_text = response.text
soup = BeautifulSoup(page_text, 'lxml')
用于提供数据解析的方法和属性
select
获取标签之间的文本数据
获取标签中的属性值
import requests
from bs4 import BeautifulSoup
if __name__ == "__main__":
# 将本地的html文档中数据加载到该对象中
fp = open('./test.html', 'r', encoding='utf-8')
soup = BeautifulSoup(fp, 'lxml')
# print(soup) # soup.tagName 返回的是html中第一次出现的tagName标签
# print(soup.find('div')) 作用等同于soup.div
# print(soup.find('div', class_='song')) 属性定位
# print(soup.find_all('a')) 返回符合要求的所有标签(列表)
# print(soup.select('.tang')) 参数放某种选择器(id、class、标签选择器),返回的是一个列表
# print(soup.select('.tang > ul > li > a')[0]) 层级选择器的使用 > 表示的使一个层级 (空格)表示的使多个层级
# 作用等同于上述 空格标识多个层级
# 获取标签之间的文本数据 soup.a.text/string/get_text()
# test/get_text() 可以获取一个标签中所有文本内容
# string 只可以获取该标签下直系文本内容
# print(soup.select('.tang > ul a')[0].get_text())
# 获取标签中的属性值
print(soup.select('.tang > ul a')[0]['href'])
爬取三国演义各章标题以及文章内容
import requests
from bs4 import BeautifulSoup
if __name__ == "__main__":
# 1、批量获取不同企业的id值
url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
# 2、进行 UA伪装
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
page_text = requests.get(url=url, headers=headers).text
# 实例化一个BeautifulSoup对象
soup = BeautifulSoup(page_text, 'lxml')
# 解析章节标题和详情页url
li_list = soup.select('.book-mulu > ul > li')
fp = open('./sanguo.txt', 'w', encoding='utf-8')
for li in li_list:
title = li.a.string
detail_url = 'https://www.shicimingju.com' + li.a['href']
# 对详情页发起请求,得到详情页中章节内容
detail_page_text = requests.get(url=detail_url, headers=headers).text
# 解析出详情页中相关的章节内容
detail_soup = BeautifulSoup(detail_page_text, 'lxml')
div_tag = detail_soup.find('div', class_='chapter_content')
# 解析到了章节内容
content = div_tag.text
fp.write(title + ':' + content + '\n')
print(title, '爬取成功!!!')
xpath解析
- 最常用、最便捷、最高效的一种解析方式
xpath原理
- 实例化一个etree的对象,且需要将被解析页面的源码数据加载到该对象中
- 调用ertree对象中的xpath方法结合着xpath表达式实现标签的定位和内容的捕获
环境安装
- pip install lxml
- 若安装失败可参考该文章
如何实例化etree对象
- 将本地的html文档中的源码数据加载到etree对象中
from lxml import etree
etree.parse(filePath)
- 可以将互联网上获取的源码数据加载到该对象中
from lxml import etree
etree.HTML('page_text')
xpath表达式
import requests
from lxml import html
if __name__ == "__main__":
etree = html.etree
# 实例化一个etree对象,且将
tree = etree.parse('test.html')
# 第一个 / 表示从根目录遍历(这里要注意),表示一个层级
r = tree.xpath('/html/body/div')
# r = tree.xpath('/html//div') 与上述作用相同//:表示多个层级,
# r = tree.xpath('//div') 与上述相同 //:表示从任意位置开始定位
# r = tree.xpath('//div[@class="song"]') 属性定位
# r = tree.xpath('//div[@class="song"]/p[3]') 索引定位(这里索引从1开始)
# r = tree.xpath('//div[@class="tang"]//li[5]/a/text()')[0] 取内容:/text(),注意这里返回的是列表,后面取得是内容
# /text():获取的是标签中直系文本内容;//text():获取该标签下的所有内容
# /@attrName:取属性 例如:img/@src
print(r)
实战案例
解决响应中文乱码的问题
- 这里提供俩种解决思路(一般都可以解决)
- 方案一
response = requests.get(url=url, headers=headers)
# 解决爬取图片名称乱码问题 (此时手动设置响应数据编码格式解决不了乱码问题)
response.encoding = 'utf-8'
page_text = response.text
- 方法二
# 较为通用的处理中文乱码解决方案(这里img_name为字符串)
img_name = img_name.encode('iso-8859-1').decode('gbk')
4K图片的解析爬取
import requests
import os
from lxml import html
if __name__ == "__main__":
etree = html.etree
url = 'http://pic.netbian.com/4kmeinv/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
# 解决爬取图片名称乱码问题 (此时手动设置响应数据编码格式解决不了乱码问题)
# response.encoding = 'utf-8'
page_text = response.text
# 实例化对象
tree = etree.HTML(page_text)
# 获取li标签对象
li_list = tree.xpath('//div[@class="slist"]/ul/li')
# 创建一个文件夹
if not os.path.exists('./picLibs'):
os.mkdir('./picLibs')
for li in li_list:
img_src = 'http://pic.netbian.com' + li.xpath('./a/img/@src')[0]
img_name = li.xpath('./a/img/@alt')[0] + '.jpg'
# 较为通用的处理中文乱码解决方案
img_name = img_name.encode('iso-8859-1').decode('gbk')
# print(img_name, img_src)
# 请求图片持久化存储
img_data = requests.get(url=img_src, headers=headers).content
img_path = 'picLibs/' + img_name
with open(img_path, 'wb') as fp:
fp.write(img_data)
print(img_name, '下载成功')
print('over!!!')
58同城二手房爬取
import requests
from lxml import html
if __name__ == "__main__":
etree = html.etree
url = 'https://suzhou.58.com/ershoufang/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
page_text = requests.get(url=url, headers=headers).text
# 数据解析
tree = etree.HTML(page_text)
# 存储的就是li标签对象
li_list = tree.xpath('//ul[@class="house-list-wrap"]/li')
fp = open('./58.txt', 'w', encoding='utf-8')
for li in li_list:
# 这里注意./代表当前路径
title = li.xpath('./div[2]/h2/a/text()')[0]
print(title)
fp.write(title)
全国城市名称爬取
import requests
from lxml import html
if __name__ == "__main__":
# etree = html.etree
#
# url = 'https://www.aqistudy.cn/historydata/'
# headers = {
# 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
# }
# page_text = requests.get(url=url, headers=headers).text
# tree = etree.HTML(page_text)
# host_li_list = tree.xpath('//div[@class="bottom"]/ul/li')
# all_city_name = []
# # 解析到的热门城市城市名称
# for li in host_li_list:
# host_city_name = li.xpath('./a/text()')[0]
# all_city_name.append(host_city_name)
#
# # 解析全部城市名称
# city_names_list = tree.xpath('//div[@class="bottom"]/ul/div[2]/li')
# for li in city_names_list:
# city_name = li.xpath('./a/text()')[0]
# all_city_name.append(city_name)
#
# print(all_city_name, len(all_city_name))
etree = html.etree
url = 'https://www.aqistudy.cn/historydata/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
# div/ul/li/a 热门城市的a标签层级关系
# div/ul/div[2]/li/a 全部城市的a标签层级关系
# 存储热门城市和全部城市的列表
all_city_name = []
# 热门城市和全部城市的a标签对象
a_list = tree.xpath('//div[@class="bottom"]/ul/li/a | //div[@class="bottom"]/ul/div[2]/li/a')
for a in a_list:
city_name = a.xpath('./text()')[0]
all_city_name.append(city_name)
print(all_city_name, len(all_city_name))
```python
import requests
import os
from lxml import html
if __name__ == "__main__":
etree = html.etree
url = 'https://sc.chinaz.com/jianli/free.html'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
# 这里解决中文乱码问题
response = requests.get(url=url, headers=headers)
response.encoding = 'utf-8'
page_text = response.text
# 数据解析
tree = etree.HTML(page_text)
# 这里存放简历url
# all_jianli_url = []
# 解析为a标签对象//
a_list = tree.xpath('//div[@id="container"]/div/a')
# 创建文件夹
if not os.path.exists('./jianli'):
os.mkdir('./jianli')
print(a_list)
for a in a_list:
jianli_url = 'https:' + a.xpath('./@href')[0]
# 获取简历模板名称
jianli_name = a.xpath('./img/@alt')[0] + '.rar'
print(jianli_name)
# jianli_name = jianli_name.encode('iso-8859-1').decode('gbk')
# all_jianli_url.append(jianli_url)
# 发起请求
page_text = requests.get(url=jianli_url, headers=headers).text
tree = etree.HTML(page_text)
# 获取简历url
jianli_load_url = tree.xpath('//div[@class="clearfix mt20 downlist"]/ul/li[4]/a/@href')[0]
# 获取简历名称
# jianli_path = 'jianli/' + jianli_load_url.split('/')[-1]
jianli_path = 'jianli/' + jianli_name
# 发起请求
jianli_data = requests.get(url=jianli_load_url, headers=headers).content
# 持久化存储
with open(jianli_path, 'wb') as fp:
fp.write(jianli_data)
print(jianli_path, '下载成功')
print('perfect')
简历模板的爬取
import requests
import os
from lxml import html
if __name__ == "__main__":
etree = html.etree
url = 'https://sc.chinaz.com/jianli/free.html'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
# 这里解决中文乱码问题
response = requests.get(url=url, headers=headers)
response.encoding = 'utf-8'
page_text = response.text
# 数据解析
tree = etree.HTML(page_text)
# 这里存放简历url
# all_jianli_url = []
# 解析为a标签对象//
a_list = tree.xpath('//div[@id="container"]/div/a')
# 创建文件夹
if not os.path.exists('./jianli'):
os.mkdir('./jianli')
print(a_list)
for a in a_list:
jianli_url = 'https:' + a.xpath('./@href')[0]
# 获取简历模板名称
jianli_name = a.xpath('./img/@alt')[0] + '.rar'
print(jianli_name)
# jianli_name = jianli_name.encode('iso-8859-1').decode('gbk')
# all_jianli_url.append(jianli_url)
# 发起请求
page_text = requests.get(url=jianli_url, headers=headers).text
tree = etree.HTML(page_text)
# 获取简历url
jianli_load_url = tree.xpath('//div[@class="clearfix mt20 downlist"]/ul/li[4]/a/@href')[0]
# 获取简历名称
# jianli_path = 'jianli/' + jianli_load_url.split('/')[-1]
jianli_path = 'jianli/' + jianli_name
# 发起请求
jianli_data = requests.get(url=jianli_load_url, headers=headers).content
# 持久化存储
with open(jianli_path, 'wb') as fp:
fp.write(jianli_data)
print(jianli_path, '下载成功')
print('perfect')