由于同学的抛转引玉,于是写了这个来下载MOOC上的PPT
主要代码如下,关键是使用的时候要加上cookie
url = 'https://cnmooc.org/view/doc.mooc?viewer=html&resid=174814&format=jpg&start={}'
header = {"User-Agent":"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36",
"cookie":"add your cookies here",
"Host":"cnmooc.org"
}
import requests
import os
directory_name = input("Directory name: ")
page = int(input("Max pages: "))
if os.path.exists(directory_name):
print("There is already a directory named IMAGE")
while 1:
num = 0
try:
directory_name = "{}_{}".format(directory_name, num)
os.mkdir(directory_name)
print("All files are saved in {}".format(directory_name))
num+=1
except:
continue
else:
break
else:
os.mkdir(directory_name)
print("All files are saved in {}".format(directory_name))
for i in range(1,page+1):
try:
response = requests.get(url.format(i), headers=header)
file = open('./{}/{}.jpg'.format(directory_name,i),'wb')
file.write(response.content)
file.close()
except Exception as e:
print(e)
break
一个一个网页进行构造,注意要检查cookie的有效性,否则下载的不是图片
下载完成之后,下载地址 提取码:klv8
知道了基电实验的PPT如何爬取了,那么我们便也可以爬取MOOC上所有的PPT资源了
- 到MOOC上打开network,寻找对应的url
- 然后利用上面的代码,并添加自己的cookie
- 输入自己自定义的文件名并查找PPT对应的页数
特别注意:隔一段时间之后需要调换 cookie