Python百度文库爬虫之ppt文件
对于文件的所有类型,我都会用一篇文章进行说明,链接:
一.网页分析
PTT文件的内容实际是图片,我们只需要把图片下载并保存
from IPython.display import Image
Image("./Images/ppt_0.png",width="600px",height="400px")
二.数据链接
Image("./Images/ppt_1.png",width="600px",height="400px")
查看链接,与我们数据一样:
Image("./Images/ppt_3.png",width="600px",height="400px")
三.程序调试
import requests
import json
import re
import json
session=requests.session()
url=input("请输入下载的文件URL地址:")
content=session.get(url).content.decode('gbk')
doc_id=re.findall('view/(.*?).html',url)[0]
types=re.findall(r"docType.*?:.*?'(.*?)'",content)[0]
title=re.findall(r"title.*?:.*?'(.*?)'",content)[0]
请输入下载的文件URL地址: https://wenku.baidu.com/view/b906673ed1d233d4b14e852458fb770bf68a3b18.html?fr=search
doc_id
'b906673ed1d233d4b14e852458fb770bf68a3b18'
types
'ppt'
title
'精品课件-爬虫技术'
content_url='https://wenku.baidu.com/browse/getbcsurl?doc_id='+doc_id+'&pn=1&rn=9999&type=ppt'
content=session.get(content_url).content.decode('gbk')
url_list=re.findall('{"zoom":"(.*?)","page"',content)
url_list=[addr.replace('\\','') for addr in url_list]
url_list
['https://wkretype.bdimg.com/retype/zoom/0836f08558f5f61fb636661b?pn=1&o=jpg_6&md5sum=f2be3d1fed17d9e67fa325fbdfbafa6c&sign=c05e1cdb4f&png=0-487135&jpg=0-133194',
'https://wkretype.bdimg.com/retype/zoom/0836f08558f5f61fb636661b?pn=2&o=jpg_6&md5sum=f2be3d1fed17d9e67fa325fbdfbafa6c&sign=c05e1cdb4f&png=487136-641849&jpg=133195-323627',
'https://wkretype.bdimg.com/retype/zoom/0836f08558f5f61fb636661b?pn=3&o=jpg_6&md5sum=f2be3d1fed17d9e67fa325fbdfbafa6c&sign=c05e1cdb4f&png=641850-727977&jpg=323628-515968',
'https://wkretype.bdimg.com/retype/zoom/0836f08558f5f61fb636661b?pn=4&o=jpg_6&md5sum=f2be3d1fed17d9e67fa325fbdfbafa6c&sign=c05e1cdb4f&png=727978-856818&jpg=515969-615553',
'https://wkretype.bdimg.com/retype/zoom/0836f08558f5f61fb636661b?pn=5&o=jpg_6&md5sum=f2be3d1fed17d9e67fa325fbdfbafa6c&sign=c05e1cdb4f&png=856819-942946&jpg=615554-777861',
'https://wkretype.bdimg.com/retype/zoom/0836f08558f5f61fb636661b?pn=6&o=jpg_6&md5sum=f2be3d1fed17d9e67fa325fbdfbafa6c&sign=c05e1cdb4f&png=942947-1029074&jpg=777862-946638',
'https://wkretype.bdimg.com/retype/zoom/0836f08558f5f61fb636661b?pn=7&o=jp