PPT模板python爬取
对http://www.ypppt.com/moban/中的ppt模板进行爬取,网站设置了一些反爬机制,需要仔细分析url地址才能正确爬取!!!
#-*- coding = utf-8 -*-
#@Time:2020-08-13 16:43
#@Author:来瓶安慕嘻
#@File:免费简历爬取.py
#@开始美好的一天吧 @Q_Q@
import requests
import os
from lxml import etree
import re
if __name__ == "__main__":
url = 'http://www.ypppt.com/moban/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
}
response = requests.get(url=url,headers=headers)
response.encoding = 'utf-8'
page_text = response.text
# 创建储存ppt模板的文件
if not os.path.exists('./ppt模板'):
os.mkdir('./ppt模板')
# 创建etree对象
tree = etree.HTML(page_text)
# li_list 储存首页ppt模板的li
li_list = tree.xpath('//ul[@class="posts clear"]/li')
# 分析每一个li,提取里面的具体ppt的url和名称
for li in li_list:
ppt_url ='http://www.ypppt.com' +li.xpath('./a[1]/@href')[0]
ppt_name = li.xpath('./a[2]/text()')[0]
# print(ppt_url)
# print(ppt_name)
# 得到每一个ppt的网页,分析下载入口在哪,找到下载入口的url
ppt_response = requests.get(url=ppt_url,headers = headers)
ppt_response.encoding = 'utf-8'
ppt_text = ppt_response.text
ppt_tree = etree.HTML(ppt_text)
load_path ='http://www.ypppt.com' +ppt_tree.xpath('//div[@class="button"]/a/@href')[0]
# 找到了下载入口的网页面,现在需要分析,找到下载按钮在哪
load_response = requests.get(url=load_path,headers=headers)
load_response.encoding = 'utf-8'
final_text = load_response.text
final_tree = etree.HTML(final_text)
final_url = final_tree.xpath('//ul[@class="down clear"]/li[1]/a/@href')[0]
# 这里网站作了简单的反爬机制,有些下载链接的url直接为:/uploads/soft/200810/1-200Q0113H8.zip
# 而有些下载链接的url:http://www.ypppt.com/uploads/soft/200810/1-200Q0113H8.zip
# 因此这里用正则表达式进行判断
if len(re.findall('http:',str(final_url))) == 0:
final_url = 'http://www.ypppt.com' + final_url
else:
final_url = final_url
# 请求下载,这里的zip也是二进制content
final_ppt = requests.get(url = final_url,headers = headers).content
# 将爬取的ppt储存
with open('./ppt模板/'+ppt_name+'.zip','wb') as fp:
fp.write(final_ppt)
print(ppt_name+'----下载完成')
print('来瓶安慕嘻:爬取结束!!!!!!!')
爬取结束:
文件夹如上图所示!!!
注:不要恶意爬取啊,用来学习爬虫就行~