【中国大学MOOC】慕课字幕爬取

在慕课上打算一边记笔记,一边看视频,突然发现网站自带字幕(并不是所有课都带字幕),就想着怎么爬下来,作为笔记的地方底稿

本着拿来主义的想法,去网上随便找了几个代码,发现都没有办法用

于是自己写了一个, 与网上的那几个代码相比更加简短,且有效

#https://www.icourse163.org/learn/BIT-1001870002?tid=1472922453#/learn/content?type=detail&id=1259485818
course_id = 1472922453 # 这里输入网站的url中的课程编号tid
#把cookie复制到这里
cookies='XXX'
#复制cookies中的"NTESSTUDYSI"字段
csrfKey = "XXXX"
header = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36',
        'cookie':cookies }
import requests

def get_Units(term_id):
    if type(term_id) == str:
        term_id = int(term_id)
    url = f'https://www.icourse163.org/web/j/courseBean.getLastLearnedMocTermDto.rpc?csrfKey={csrfKey}'
    data = {
        'termId': term_id
    }
    html = dict(requests.post(url, headers=header, data=data).json())
    chapters = html['result']['mocTermDto']['chapters']
    lessonsIds = []
    contentIds=[]
    names=[]
    for lessons in chapters:
        for lesson in lessons["lessons"]:
            for unit in lesson['units']:
                if unit['contentType'] == 1:
                    contentIds.append(unit['contentId'])
                    lessonsIds.append(unit['id'])
                    names.append(unit['name'])
    return lessonsIds,contentIds,names

def get_Subtitle(lessonsId,contentId):
    url = f"https://www.icourse163.org/mm-course/web/j/mocCourseBean.getVideoSubtitle.rpc?csrfKey={csrfKey}"
    data = {
        "lessonUnitId": {lessonsId},
        "videoId": {contentId}
        }

    segs=(dict(requests.post(url, headers=header, data=data).json())['result']['mergedSentences'])

    sentences=''
    for seg in segs:
        subs = seg['sentences']
        for sub in subs:
            sentences=sentences+sub['text']+"。"
    return sentences

if __name__ == '__main__':

    lessonsIds,contentIds,names=get_Units(course_id)
    Subs=[]
    for i in range(len(lessonsIds)):
        Subs.append(get_Subtitle(lessonsIds[i],contentIds[i]))

    Res=''
    for i in range(len(names)):
        Res += names[i] + "\n" + Subs[i] + "\n"

    with open(f"{course_id}.txt", 'a', encoding="utf-8") as file:
        file.write(Res)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值