一、数据解析概述
爬虫一般分成三种类型:
1)通用式爬虫;
2)聚焦式爬虫;
3)增量式爬虫;
聚焦爬虫:爬取页面中指定的页面内容。
数据解析分类:
——正则
——beautifulsoup4
——xpath
数据解析原理概述:
-解析的局部文本内容都会在标签对应的属性中进行存储;
-进行指定标签的定位;
-标签或者标签对应的属性中存储的数据值进行提取(解析)。
二、数据解析——正则表达式
2.1正则练习
import re
#1.提取出python
key1 = 'javapythonc++php'
re.findall(r'python',key1)
############################################################
#2.提取hello world
key2 = '<html><h1>hello world</h1></html>'
print(re.findall('<h1>(.*)</h1>',key2))
############################################################
#3.提取170
key3 = '小明身高170厘米'
print(re.findall('\d+',key3))
##############################################################
#4.提取出http:// 和 https://
key4 = 'http://baidu.com and https://boob.com'
print(re.findall('https?://',key4))
#############################################################
#5.提取出hello 输出<html>hello</HtMl>
key5 = 'lalalala<html>hello</HtMl>lalalalaa'
print(re.findall('<.*>',key5))
##############################################################
#6.提取出hit.
key6 = 'robot@hit.com'
print(re.findall('h.*?\.',key6))
##############################################################
#7.匹配sas 和 saas saaaaas
key7 = 'sasodfsaasspppssaaaaassaaas'
print(re.findall('sa{1,2}s|saaaaas',key7))
2.2使用正则表达式——爬取糗事百科中糗图板块下所有的糗图图片
import requests
import re
import os
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
#创建一个文件夹,保存所有图片
if not os.path.exists('./qiutuLibs'):
os.mkdir('./qiutuLibs')
#设置一个通用的url模板
for page_num in range(1,36):
url = f'https://www.qiushibaike.com/imgrank/page/{page_num}/'
page_text = requests.get(url=url,headers=headers).text
img_src_list = re.findall('<div class="thumb">.*?<img src="/(.*?)" alt.*?</div>',page_text,re.S)
for src in img_src_list:
#拼接出一个完整的图片url
src = 'https:/'+src
#请求到了二进制响应数据
img_data = requests.get(url=src,headers=headers).content
#生成图片名称,按斜杠切分,取最后一部分
img_name = src.split('/')[-1]
imgPath = './qiutuLibs/'+img_name
with open(imgPath,'wb') as fp:
fp.write(img_data)
print(img_name,'下载成功!')
三、数据解析——BeautifulSoup4解析
3.1bs4概述
bs4数据解析的原理:
——1.实例化一个BeautifulSoup对象,并且将页面源码数据加载到该对象中;
——2.通过调用BeautifulSoup对象中相关的属性或者方法进行标签定位和数据提取。
环境安装:
——pip install bs4
——pip install lxml
如何实例化BeautifulSoup对象?
——from bs4 import BeautifulSoup
——对象的实例化:
————1.将本地的html文档中数据加载到该对象中;
from bs4 import BeautifulSoup
fp = open('./huazhuangpin.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,'lxml')
print(soup)
————2.将互联网上获取的页面源码加载到该对象中。
page_text = response.text
soup = BeatifulSoup(page_text,'lxml')
提供的用于数据解析的方法和属性:
——1.soup.tagName():返回的是文档中第一次出现的tagName标签
——2.soup.find():
——————(1)find(‘tagName’):等同于soup.div
——————(2)soup.find(‘div’,class_/id/attr等其他属性 = “hzblist”)
——————(3)soup.find_all(‘tagName’):返回符合要求的所有标签(列表)
——2.select:
————(1)soup.select(‘某种选择器(id,class,标签…选择器)’):返回一个列表;
————(2)层级选择器:soup.select(’.hzbscbox > .hzbscin > .hzbtabs > span’) 其中大于号>表示一个层级,空格表示多个层级。
获取标签之间的文本数据:
——soup.a.text/string/get_text()
text/get_text()可以获取某一个标签中所有的文本内容
srting只可以获取直系层级下的文本内容
获取标签中的属性
soup.a[‘href’]
3.2bs4之爬取小说标题及内容
# -*-coding:utf-8-*-
from bs4 import BeautifulSoup
import requests
url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
page_text = requests.get(url=url,headers = headers).text
soup = BeautifulSoup(page_text,'lxml')
list_ = soup.select('.book-mulu > ul > li')
fp = open('./三国演义.text','w',encoding='utf-8')
for li in list_:
title = li.a.string
detail_url = 'https://www.shicimingju.com'+li.a['href']
#对详情页发起请求,解析出章节内容
detail_pagetext = requests.get(url=detail_url,headers=headers).text
print(detail_pagetext)
#解析出详情页中相关的章节内容
detail_soup = BeautifulSoup(detail_pagetext,'lxml')
div_tag=detail_soup.find('div',class_='chapter_content')
#解析到章节的内容
content = div_tag.text
fp.write(title+':'+content+'\n')
print(title,'爬取成功')
不知道为什么会出现乱码,后来部分代码更换才解决,参考链接:https://www.cnblogs.com/Yemilice/p/6201224.html
更改后:
# -*-coding:utf-8-*-
from bs4 import BeautifulSoup
import requests
url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
# print(requests.get(url).encoding)
page_text = requests.get(url=url,headers = headers)
page_text.encoding = 'UTF-8'
page_text = page_text.text
soup = BeautifulSoup(page_text,'lxml')
list_ = soup.select('.book-mulu > ul > li')
fp = open('./三国演义.text','w',encoding='utf-8')
for li in list_:
title = li.a.string
detail_url = 'https://www.shicimingju.com'+li.a['href']
#对详情页发起请求,解析出章节内容
detail_page = requests.get(url=detail_url,headers=headers)
detail_page.encoding = 'utf-8'
detail_pagetext = detail_page.text
#解析出详情页中相关的章节内容
detail_soup = BeautifulSoup(detail_pagetext,'lxml')
div_tag=detail_soup.find('div',class_='chapter_content')
#解析到章节的内容
content = div_tag.text
fp.write(title+':'+content+'\n')
print(title,'爬取成功')
三、数据解析——xpath解析
环境:
python:3.6.1
lxml:4.1.0
3.1 xpath概述
xpath数据解析的原理:
——1.实例化一个etree对象,并且将页面源码数据加载到该对象中;
——2.通过调用etree对象中xpath方法结合xpath表达式实现标签的定位和内容进行标签定位和数据提取。
环境安装:
——pip install lxml
如何实例化一个etree对象:
——1.将本地的HTML文档中的源码数据加载到etree对象中:
————etree.parse(filePath)
——2.可以将从互联网上获取的源码数据加载到该对象中:
————etree.HTML(‘page_text’)
——3.xpath(‘xpath表达式’)
xpath表达式:
- /:表示从根节点开始定位;一个斜杠表示一个层级。
- //:表示多个层级;从最左边开始,可以表示从任意位置开始定位。
- 属性定位: //div[@attrName=“attrValue”]
- 索引定位://div[@class=“hzbtabs”]/span[1] 索引是从1开始的
- 如何取文本: ——/text() 获取的是标签中直系的文本内容
————————//text() 获取的是标签中所有的文本内 - 如何取属性:——/@attrName==>img/src
3.2xpath之爬取58二手房信息
from lxml import etree
import requests
#爬取页面源码数据
url = "https://nj.58.com/ershoufang/"
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
page_text = requests.get(url=url,headers=headers).text
#数据解析
tree = etree.HTML(page_text)
list1 = tree.xpath('//section[@class="list"][1]/div')
for li in list1:
title = li.xpath('.//div[@class="property-content-title"]/h3/text()')[0]
print(title)
3.3xpath之4K图片解析
from lxml import etree
import requests
import os
#爬取页面源码数据
url = "https://pic.netbian.com/index.html"
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
page_text = requests.get(url=url,headers = headers)
page_text.encoding = 'gbk'
page_text = page_text.text
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="slist"]/ul[@class="clearfix"]/li')
if not os.path.exists('./picLibs'):
os.mkdir('./picLibs')
for li in li_list:
img_src = 'https://pic.netbian.com'+li.xpath('./a//img/@src')[0]
img_name = li.xpath('./a//img/@alt')[0]+'.jpg'
#请求图片进行持久化存储
img_data = requests.get(url=img_src,headers=headers).content
img_path = 'picLibs/'+img_name
with open(img_path,'wb') as fp:
fp.write(img_data)
print(img_name,'下载成功!!')
3.4xpath之爬取全国城市名称
from lxml import etree
import requests
#爬取页面源码数据
url = "https://www.aqistudy.cn/historydata/"
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
page_text = requests.get(url=url,headers=headers).text
#数据解析
# hot_city_list = tree.xpath('//div[@class="bottom"]/ul/li')
# all_city_names = []
# #解析到了热门城市的城市名称
# for li in hot_city_list:
# hot_city_names = li.xpath('./a/text()')[0]
# all_city_names.append(hot_city_names)
#
# city_names_list = tree.xpath('//div[@class="bottom"]/ul/div[2]/li')
# for li in city_names_list:
# city_name = li.xpath('./a/text()')[0]
# all_city_names.append(city_name)
tree = etree.HTML(page_text)
#解析到热门城市和所有城市对应的a标签 //div[@class="bottom"]/ul/li
#解析到所有城市和所有城市对应的a标签 //div[@class="bottom"]/ul/div[2]/li
all_list = tree.xpath('//div[@class="bottom"]/ul/li/a | //div[@class="bottom"]/ul/div[2]/li/a')
all_city_names = []
for a in all_list:
city_name = a.xpath('./text()')[0]
all_city_names.append(city_name)
print(all_city_names,len(all_city_names))
3.5xpath之批量下载PPT模板
教程作业要求下载站长素材网站中的免费建立==简历模板,而该网站已经没有免费简历了,这里选择了下载免费的PPT模板。
from lxml import etree
import requests
import os
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
url_list = ['https://sc.chinaz.com/ppt/free.html']
i =2
#分页20
while i<21:
url = f'https://sc.chinaz.com/ppt/free_{i}.html'
url_list.append(url)
i += 1
#创建下载文件文件夹
if not os.path.exists('./PPT'):
os.mkdir('./PPT')
#遍历分页链接
detail_url_list = []
for url in url_list:
page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
detail_url = tree.xpath('//div[@class="container clearfix"]//div[@class="bot-div"]/a/@href')
for i in detail_url:
detail_url_list.append(i)
#遍历详情页链接
for detail_url in detail_url_list:
detail_url = 'https://sc.chinaz.com/' + detail_url
detail_page_text = requests.get(url=detail_url, headers=headers)
detail_page_text.encoding = 'utf-8'
detail_page_text = detail_page_text.text
d_tree = etree.HTML(detail_page_text)
download_url = d_tree.xpath('//div[@class = "download-url"]/a/@href')[0]
download_name = d_tree.xpath('//div[@class = "title-box clearfix"]/h1/text()')[0]+'.ppt'
#对已解析到的PPT下载链接及PPT名称进行下载和存储
PPT_data = requests.get(url=download_url, headers=headers).content
PPT_path = 'PPT/' + download_name
with open(PPT_path,'wb') as fp:
fp.write(PPT_data)
print(download_name,'下载成功!!')