案例③由于时间流逝,网页爬取失效了
最近考研复试,增加一个案例,爬取985211学校,并以excel表格形式存储
0、什么是爬虫
那么这些图片数据从哪里来的呢?
图片通过网络从服务器发来!因此查看浏览器的网络请求即可=》
- 右键-》检查
- 按F12或者fn+F12
那么就能看到请求到的数据了,这些都是临时存在你的电脑的!
再次点击标头,就能看到每一个请求的具体信息了!
通过请求URL就能获取到所要的图片了
1.什么是xpath
- XPath 使用路径表达式在 XML 文档中进行导航
- XPath 包含一个标准函数库
- XPath 是 XSLT 中的主要元素
- XPath 是一个 W3C 标准
- 是一种用来确定XML文档中某部分位置的语言
是最常用的最广泛的数据解析方式
2.xpath解析原理
①实例化一个etree的对象,且需要将被解析的页面源码数据加载到该对象中
②调用etree对象中的xpath方法结合着xpath表达式实现标签的定位和内容的捕获
3.环境的安装
pip install lxml
4.如何实例化etree
导入etree:from lxml import etree
①将本地的html文档中的源码数据加载到etree对象中
etree.parse(filepath)
//filepath是html文档的路径
②可以将互联网上获取的源码数据加载到该对象中
etree.HTML('page_text')
//page_text是获取的响应数据
4.xpath表达式
①定位
Ⅰ.根据层级进行多个标签定位
从根节点开始进行定位/html/body/div
/
表示一个层级
例如:/html/body/div
//
表示多个层级
例如:/html//div
//
可以表示从任意位置开始定位
例如://div
./
表示从当前目录开始
例如:div.xpath('./ul')
表示之前取到的div下的ul
以上三个表达式表示结果相同
Ⅱ.根据属性进行准确标签定位
tree.xpath('//div[@class="属性名称"]')
Ⅲ.根据id进行准确定位
tree.xpath('//div[@id="标签ID值"]')
Ⅳ.根据索引号进行定位
注意:索引号从1开始,而不是0
tree.xpath('//div[@class="属性名称"]/p[3]')
//该属性class下的第③个p标签
②取值
Ⅰ.获取文本
直系文本:/text()
所有文本://text()
Ⅱ.获取属性
/@属性名称
获取img下面的src属性
img/@src
5.实例代码
①爬取58同城房源名称信息
from lxml import etree
import requests
url='https://wx.58.com/chuzu/?PGTID=0d200001-0005-de5c-a9c6-d0273a8518f9&ClickID=1'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36 Edg/86.0.622.56'
}
text_page = requests.get(url, headers).text
tree = etree.HTML(text_page)
li_list = tree.xpath('//div[@class="list-box"]//li[contains(@class,"house-cell")]')
fp = open('58.txt', 'w', encoding='utf-8')
ll = len(li_list)
for i in range(1, ll, 1):
print(li_list[i].xpath('./div[@class="des"]/h2/a/text()')[0])
fp.write(li_list[i].xpath('./div[@class="des"]/h2/a/text()')[0] + '\n')
########以下方法也可实现#############
# for detail in li_list:
# print(detail.xpath('./div[@class="des"]/h2/a/text()')[0])
# fp.write(detail.xpath('./div[@class="des"]/h2/a/text()')[0]+'\n')
②爬取彼岸图网
http://pic.netbian.com/4kmeinv/
import requests
from lxml import etree
import os
print("请输入您当前网页地址")
# http://pic.netbian.com/
url = input("")
header = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36 Edg/86.0.622.56'
}
response = requests.get(url, header)
# response.encoding = 'utf-8' # 修改乱码方法一
page_text = response.text
tree = etree.HTML(page_text)
# 获取图片表格li
li_list = tree.xpath('//div[@class="slist"]/ul[@class="clearfix"]/li')
# 创建文件夹
if not os.path.exists('./meinvpic'):
os.mkdir('./meinvpic')
for nav in li_list:
# 详细图片网页
img_url = 'http://pic.netbian.com' + nav.xpath('./a/@href')[0]
# print(img_url)
detail_text = requests.get(img_url, header).text
detail_tree = etree.HTML(detail_text)
# 爬取地址,名称
img_src ='http://pic.netbian.com' + detail_tree.xpath('//div[2]/div[1]/div[@class="photo"]/div[1]/div[2]/a/img/@src')[0]
name = detail_tree.xpath('//div[2]/div[1]/div[@class="photo"]/div[1]/div[1]/h1/text()')[0]+'.jpg'
# 修改乱码方法二
name = name.encode('iso-8859-1').decode('gbk')
img_data = requests.get(img_src, header).content
# 图片存储
with open('./meinvpic/' + name, 'wb') as s:
s.write(img_data)
print(name)
print(img_src)
③获取全部城市
import requests
import os
from lxml import etree
if __name__ == '__main__':
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36 Edg/86.0.622.56'
}
url = 'https://www.aqistudy.cn/historydata/'
# 建立文件夹
if not os.path.exists('./city'):
os.mkdir('./city')
# txt文件
f = open('./city/city.txt','w',encoding='utf-8')
response_text = requests.get(url, header).text
tree = etree.HTML(response_text)
# 一般城市的获取
normal_li_list = tree.xpath('//div[@class="container"]//div[@class="all"]/div[2]/ul/div[2]/li')
# 获取每一个ul标签
for normal_li in normal_li_list:
detail = normal_li.xpath('./a/text()')
f.write(detail[0]+'\n')
④爬取建立站长模板,爬取结果很大,量很多,记得不用了关闭运行
import requests
import os
from lxml import etree
url = 'http://sc.chinaz.com/jianli/free.html'
if not os.path.exists('./moban'):
os.mkdir('./moban')
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36 Edg/86.0.622.56'
}
if __name__ == '__main__':
while(1):
print(url)
reponse = requests.get(url, header)
reponse.encoding = 'utf-8'
page_text = reponse.text
detail_tree = etree.HTML(page_text)
mo_list = detail_tree.xpath('//div[@class="bggray clearfix pt20"]/div[3]/div/div/div')
for src in mo_list:
mo_url = src.xpath('./a/@href')[0]
name = src.xpath('./a/img/@alt')[0]
# name = name.encode('iso-8859-1').decode('gbk')
detail_text = requests.get('http:'+mo_url, header).text
tree = etree.HTML(detail_text)
source = tree.xpath('//div[@class="bggray clearfix"]/div[2]/div[2]/div[1]/div[@class="down_wrap"]/div[2]//li/a/@href')[0]
resourse = requests.get(source, header).content
with open('./moban/'+name+'.rar', 'wb') as s:
s.write(resourse)
print(name)
print(source)
next_list = detail_tree.xpath('//div[@class="bggray clearfix pt20"]/div[4]/div/a[@class="nextpage"]/@href')
if next_list==[]: break
url = 'http://sc.chinaz.com/jianli/'+next_list[0]
⑤爬取985211高校
from lxml import etree
from openpyxl import Workbook
import requests
url = 'https://www.dxsbb.com/news/2799.html'
headers = {
'content-type': ' text/html',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50',
'cookie': 'Hm_lvt_0fde2fa52b98e38f3c994a6100b45558=1677049250; Hm_lpvt_0fde2fa52b98e38f3c994a6100b45558=1677049250; acw_tc=700f252016770492503528680e5b8a7a7c3dbef48c9a79df43a5c8500e; ASPSESSIONIDSUACRRTD=GGJDBFJCCEIBIEOLADMHDBHO; __bid_n=18677ecc366d72a0a54207; FPTOKEN=2rV/4xrYE0/G+5iuP2rdN3LcLKqO0dBcg8kl0QMiF36A2ht7MI4OzTeCi/5ACge9CN/oaNo7VJQbKVZ7BxLfiqc3GaEavH4PCZ5LpTNaumoettr0Hhql9EhdTY2QSdZVM5Ip2YMRgb0f/HuFPNShu+HF5dNuYBg/zck5CAullAvW5Z9K7YEOU5zrafBzIH+iIauoJTwF6RUuIjUPdQR4seER4DsiVNzjGfrAwtHG61qGq1DnPGGdSzcQt5xPBsAzFt2rYOUL+IQ+1OefIDjyC2hBdFUVeyszHycoxO6zjIZiHI1CqH+F/MTxGIjJXa66h+UDqu4lmT45t0TVjJtBrbTbaXYPMkEgO8wX2D9f5sPLesAPZnRMGi9hCHTnORnbkjkhw//zdSWy0XQ9Cc7cxw==|FLUiRm1mfnZW2vuT8Zdr/AEdVwZ19V05jcU4xoEniJg=|10|e90907472456a902878502f635afb7ac',
}
text_page = requests.get(url, headers)
text_page.encoding = 'gbk'
text_page = text_page.text
tree = etree.HTML(text_page)
li_list = tree.xpath('//div[@class="tablebox"][2]/table//tr')
ll = len(li_list)
workbook = Workbook()
sheet = workbook.active
sheet.title = "1"
for i in range(0, ll, 1):
row = li_list[i].xpath('./td')
row_text = []
# 因为存在a标签干扰,所以直接text会有影响
for td in row:
text_list = td.xpath('.//text()')
td_text = ''
for k in text_list:
td_text = td_text + k
row_text.append(td_text)
print(row_text)
sheet.append(row_text)
workbook.save(filename="school2.xlsx")