一、需求背景
最近正好在研究Python,看了菜鸟教程上的基本教程,然后又再看极客学院的教学视频,向实战进军。
极客学院的视频是需要年费会员才能下载的,客户端倒是可以批量下载,但是下载之后,没有目录结构,文件名和扩展名也被隐掉了,只能在客户端观看,但是客户端又做的没那么人性化,不能按课程分门别类,所有的课程都是在一个列表之中,很是麻烦,而且资料又不全。
恰好,看到了网页爬虫的相关内容,正好可以解决我这一问题,来个自动化下载,连带资料一起打包,按“类型/阶段/课程/视频”多级目录下载,岂不是很省事。
为什么要写一个自动化的脚本,原因如下:
1 下载一个视频,至少点击两下
2 文件名是一串代号
3 没有批量下载,下载整节课程麻烦死
年费会员的服务也就这样了,除了可以下载和缓存这个特权外,服务就一般般了
说干就干,Let’s go
注:本例基于Python3,在Ubuntu 14.04下开发,其已经集成了Python2和3,默认在控制台输入python命令使用的是Python2,调用Python3使用命令”python3”
二、获取课程路线
爬取极客学院只是体系图中的所有课程体系,并得到其链接集合
http://www.jikexueyuan.com/path/
PathSpider.py
#!/usr/bin/python3
import requests
import SpiderUtil
from lxml import etree
from CoursePathSpider import CoursePathSpider
class PathSpider(object):
""" 课程路径图爬虫 """
URL_PATH = 'http://www.jikexueyuan.com/path/'
XPATH_PATH_LINK = '//a[@class="pathlist-one cf"]'
XPATH_PATH_NAME = 'div[@class="pathlist-txt"]/h2/text()'
XPATH_PATH_INTRO = 'div[@class="pathlist-txt"]/p/text()'
def __init__(self):
super(PathSpider, self).__init__()
self.path_info_list = []
self.response = None
self.selector = None
def parse_html(self):
print("正在获取课程路线列表...")
# try:
self.response = requests.get(PathSpider.URL_PATH)
# print(self.response.text)
self.selector = etree.HTML(self.response.text)
for link_ele in self.selector.xpath(PathSpider.XPATH_PATH_LINK):
self.path_info_list.append(_PathInfo(link_ele))
# except Exception:
# print("连接异常")
# # exit()
# else:
# pass
def show(self):
n = 0
print("共有学习路线图:", len(self.path_info_list), "个,分别是:")
for info in self.path_info_list:
n += 1
print(str(n)+".", info.name)
def show_detail(self, index):
if SpiderUtil.is_valid_index(index, len(self.path_info_list)):
self.path_info_list[index].show()
return "OK"
else:
return "error"
def download(self, index):
if SpiderUtil.is_valid_index(index, len(self.path_info_list)):
print("开始下载", self.path_info_list[index].name)
self.path_info_list[index].download()
return "OK"
else:
return "error"
class _PathInfo(object):
def __init__(self, selector):
super(_PathInfo, self).__init__()
# print("正在获取课程路线列表...")
self.selector = selector
self.name = selector.xpath(PathSpider.XPATH_PATH_NAME)[0]
self.inrto = selector.xpath(PathSpider.XPATH_PATH_INTRO)
self.url = selector.xpath('@href')[0]
def show(self):
print("课程:", self.name)
print("简介:", self.inrto)
print("链接:", self.url)
return
def sub_spider(self):
spider = CoursePathSpider(self.url, self.name)
return spider
def download(self):
self.sub_spider().download()
三、分析课程路线
分析具体的课程体系,按章节分组,并得到其课程视频链接集合
获取的结果是一个二级目录机构,如下所示:
- Python快速入门
- Python语言集成开发环境搭建
- Python语言基本语法
- Python语言Web开发框架web2py
- Python初级课程
……..
在下载视频时,将其作为存储路径,这样就得到了一个层级的目录结构,方便观看
代码如下:
CoursePathSpider.py
#!/usr/bin/python3
import requests
import SpiderUtil
import os
from lxml import etree
from LessonVideoSpider import VideoSpider
class CoursePathSpider(object):
""" 课程路径图网页分析 """
# 课程章路径
XPATH_CHAPTER = '//*[@id="container"]/div/div[@class="pathstage mar-t30"]'
# 章节名
xpath_chapter_name = 'div[@class="pathstage-txt"]/h2/text()'
# 章下的课程列表路径
xpath_chapter_lesson_list = 'div/div[@class="stagewidth lesson-list"]/ul[@class="cf"]/li'
# 课程名和链接
xpath_lesson_name = 'div[@class="lesson-infor"]/h2[@class="lesson-info-h2"]/a/text()'
xpath_lesson_link = 'div[@class="lesson-infor"]/h2[@class="lesson-info-h2"]/a/@href'
def __init__(self, url, simple_name):
super(CoursePathSpider, self).__init__()
self.url = url
self.simple_name = simple_name
self.response = None
self.chapter_list = []
self.selector = None
self.title = ''
self.chapter_list = []
def parse_html(self):
print("正在打开网址:", self.url)
self.response = requests.get(self.url)
print("开始处理返回结果...")
self.selector = etree.HTML(self.response.text)
self.title = self.selector.xpath('//title/text()')[0]
if self.simple_name == '':
if len(self.title) > 10:
self.simple_name = self.title[0, 10]
else:
self.simple_name = self.title
print("课程名称:", self.title)
for chapterEle in self.selector.xpath(CoursePathSpider.XPATH_CHAPTER):
self.add_chapter(_Chapter(chapterEle))
def add_chapter(self, chapter):
if isinstance(chapter, _Chapter):
self.chapter_list.append(chapter)
else:
raise ValueError("chapter is not a instance of Chapter")
def download(self, path, index='a'):
path = path + "/" + self.simple_name
if SpiderUtil.is_all(index):
print("下载完整路线")
self.download_all(path)
else:
index2 = SpiderUtil.is_valid_index(index