Selenium + Python + Chrome 中国大学慕课网视频爬取

最新推荐文章于 2024-06-27 11:06:08 发布

Siven_L

最新推荐文章于 2024-06-27 11:06:08 发布

阅读量2k

点赞数 1

文章标签： selenium python 爬虫

本文链接：https://blog.csdn.net/weixin_41965056/article/details/86307030

版权

准备放寒假啦，爬取一些MOOC上的课程爬回家去看。
爬取的课程是北京大学的离散数学概论

其实GitHub有可以直接用的程序但是我半路出家不怎么会提交HTTP请求所以直接用selenium简单粗暴了。
网页解析我用的是BeautifulSoup。
思路其实很简单了，直接在课件网页里面把每一个chapter里的每一个lesson的所有unit里的视频都拿下来就可以了。所以就直接嵌套循环就OK了。在这里插入图片描述
遇到的一些困难：

课件部分的两个框都是隐藏的框，所以在模拟浏览器点击
操作之前，都需要用JavaScript修改元素display的值
元素的id属性每一次点击都是不一样的，所以在定位元素的时候没有用id属性定位，用了title属性或者其他的属性。
另外就是我没有办法用headless模式爬取网页，应该是我这边的环境问题，不知道各位有没有遇到这种情况。

代码：

# -*- coding:utf-8 -*-
import time
from selenium import webdriver
from bs4 import BeautifulSoup
import re
import json
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--disable-gpu")
browser = webdriver.Chrome(executable_path='G:\\chromedriver.exe', options=chrome_options)
browser.get('https://www.icourse163.org/learn/NJTU-1002530017#/learn/content?type=detail&id=1004513821')    # 目标网页
time.sleep(3)
video = {}

soup = BeautifulSoup(browser.page_source, 'html.parser')
c_l = soup.find("div", attrs={"class": "j-breadcb f-fl"})
chapter_all = c_l.find("div", attrs={"class": "f-fl j-chapter"})
chapter = chapter_all.find_all("div", attrs={"class": "f-thide list"})
for chap in chapter:
    js = 'document.querySelectorAll("div.down")[0].style.display="block";'
    browser.execute_script(js)
    chapter_name = chap.text
    a = browser.find_element_by_xpath("//div[@title = '"+chapter_name+"']")
    a.click()
    time.sleep(3)
    soup1 = BeautifulSoup(browser.page_source, 'html.parser')
    c_l1 = soup1.find("div", attrs={"class": "j-breadcb f-fl"})
    lesson_all = c_l1.find("div", attrs={"class": "f-fl j-lesson"})
    lesson = lesson_all.find_all("div", attrs={"class": "f-thide list"})
    for les in lesson:
        js1 = 'document.querySelectorAll("div.down")[1].style.display="block";'
        browser.execute_script(js1)
        lesson_name = les.text
        b = browser.find_element_by_xpath("//div[@title = '"+lesson_name+"']")
        b.click()
        time.sleep(3)
        soup2 = BeautifulSoup(browser.page_source, 'html.parser')
        units = soup2.find_all("li", attrs={"title": re.compile(r"^视频")})	# 只爬取视频课件 
        for unit in units:
            video_name = unit.get("title")
            video_link = browser.find_element_by_xpath("//li[@title = '"+video_name+"']")
            video_link.click()
            time.sleep(3)
            soup2 = BeautifulSoup(browser.page_source, 'html.parser')
            try:
                video_src = soup2.find("source")
                video[chapter_name + " " + lesson_name + video_name] = video_src.get("src")
            except:
                continue
browser.quit()