Python爬虫---中国大学MOOC爬取数据(文中有数据集)

1、内容简介

本文为大二在校学生所做,内容为爬取中国大学Mooc网站的课程分类数据、课程数据、评论数据。数据集大佬们需要拿走。主要是希望大佬们能指正代码问题。

2、数据集

课程评论数据集,343525条(包括评论id、评论时间、发送评论用户的id,发送评论的用户昵称,评论内容,所属课程id)

课程数据集,29196条(包括课程id、课程、课程名称、报名人数、教师名称、所属大学、开始日期、截止日期、所属课程类别id)

课程分类数据集,23条(包括课程id、课程名称、课程简介)

链接:https://pan.baidu.com/s/10m_kWEvLaom41sFvb8CQgApwd=8888
提取码:8888
–来自百度网盘超级会员V4的分享

3、代码

3.1 获取课程类别

3.1.1 全部代码
import csv
import json
import requests

# 获取类别的数据

# 请求url
baseurl = "https://www.icourse163.org/web/j/channelBean.listChannelCategoryDetail.rpc?csrfKey=9ddd9641afce4905aa429bf754db5b1b"

# 请求头
headers = {
"Accept":"*/*",
"Accept-Encoding":"gzip, deflate, br",
"Accept-Language":"zh-CN,zh;q=0.9",
"Cache-Control":"no-cache",
"Content-Length":"192",
"Content-Type":"application/x-www-form-urlencoded;charset=UTF-8",
"Cookie":"NTESSTUDYSI=9ddd9641afce4905aa429bf754db5b1b; EDUWEBDEVICE=7a18a0d811e443da8466a956d9571abd; Hm_lvt_77dc9a9d49448cf5e629e5bebaa5500b=1714007316; __yadk_uid=lxVu0GBpcgHrjnjdBP4LXojZlFfwpS4n; WM_NI=4G7PWoEQ02m4%2BvueYIsLLEdVSrkmtq2QbnYz6oN0vU7DuVNiljn7xkLMqPibUCA0Y3KTw4e3PffcgFfwwieW1RRmO7vvCHyP7%2FfjlmJia7I03OR%2FMP0xMocc%2FWmx3lHnckU%3D; WM_NIKE=9ca17ae2e6ffcda170e2e6ee91e159a5adb6b5f162b1a88aa6d15f968b9e86d83eb0bc83dae67dbc8699a6db2af0fea7c3b92af899f9b1d665878f00d3f040e9aabb86dc43b19484a6e75e91aea58ef77bf28fae8fdc72f6b182d1f55ff59cbbb7cd63b0b6a2d8fb619093bfb1ec6a979d8fd1dc39898cb9d5e83994baaa84e64ea6a9b98ae85df38e9e8ece3db6aa8da9fc5cf5f1aa8bef678ab7afd8cc3cf1919ed0d05cf4a88390e269ed89a399b2688cb2ad8cd837e2a3; WM_TID=uZLSXz1D89ZBAUAUVQKFu0YZ91x1YWRD; Hm_lpvt_77dc9a9d49448cf5e629e5bebaa5500b=1714009886",
"Edu-Script-Token":"9ddd9641afce4905aa429bf754db5b1b",
"Origin":"https://www.icourse163.org",
"Pragma":"no-cache",
"Referer":"https://www.icourse163.org/channel/2001.htm?cate=-1&subCate=-1",
"Sec-Ch-Ua":'"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"',
"Sec-Ch-Ua-Mobile":"?1",
"Sec-Ch-Ua-Platform":"Android",
"Sec-Fetch-Dest":"empty",
"Sec-Fetch-Mode":"cors",
"Sec-Fetch-Site":"same-origin",
"User-Agent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.289 Mobile Safari/537.36"
}

# 请求数据,固定的
data = "includeALLChannels=true&includeDefaultChannels=false&includeMyChannels=false";
# 发送请求
response = requests.post(baseurl,data=data, headers=headers,timeout=10)

# 获取数据
parsed_data = json.loads(response.text)

# 将数据转为utf-8模式
for key, value in parsed_data.items():
    if isinstance(value, str):
        parsed_data[key] = value.encode('utf-8').decode('unicode-escape')

# 打印 类别JSON数据
print(json.dumps(parsed_data, indent=4, ensure_ascii=False))

# 定义需要写入csv的字段
fields = ['id', 'name', 'shortDesc', 'weight', 'defaultChannel', 'charge', 'includeLearningProgress', 'newGiftBag', 'suportOoc', 'suportVocationalMooc', 'suportNavigationType', 'icon', 'recommendWord', 'seoKeywords', 'showChildCategory', 'childrenChannelIds', 'childrenChannelDtoList']

# 创建 CSV 文件并写入数据
with open('channel_data.csv', 'w', newline='', encoding='utf-8-sig') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(fields)

    # 按照json数据 遍历写入数据即可
    for channel_category in parsed_data['result']['channelCategory
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值