《苍兰诀》豆瓣影评爬虫
项目简介
这个项目是一个豆瓣影评爬虫,可以帮助您爬取指定电影的影评信息并存储到Excel表格中。
使用方法
- 下载项目代码到您的电脑上。
- 安装所需的Python库。
pip install pandas
pip install requests
pip install beautifulsoup4
- 打开 scrape_douban.py 文件,将 EXCEL_FILE_PATH 变量设置为您希望保存Excel文件的路径。
# Excel文件保存路径
EXCEL_FILE_PATH = 'dataset/douban/douban_comments.xlsx'
- 修改登录后的cookie信息,需要修改的代码如下
# 定义登录后的Cookie信息
cookies = {
'bid': 'SzNvd-H3HQ8',
'douban-fav-remind': '1',
'_pk_id.100001.8cb4': '6a1812e422687410.1687587816.',
'__yadk_uid': 'bKeDpqamOUqVFgySo0c9Q7mlFZofu5jn',
'll': '108288',
'viewed': '1477390_35972849',
'_pk_ref.100001.8cb4': '%5B%22%22%2C%22%22%2C1708219364%2C%22https%3A%2F%2Fcn.bing.com%2F%22%5D',
'_pk_ses.100001.8cb4': '1',
'__utma': '30149280.1879374568.1687587817.1704897114.1708219366.5',
'__utmc': '30149280',
'__utmz': '30149280.1708219366.5.4.utmcsr=cn.bing.com|utmccn=(referral)|utmcmd=referral|utmcct=/',
'ap_v': '0,6.0',
'push noty num': '0',
'push doumail num': '',
'frodotk_db': '4c92f1b16efda7a3e5b8d5144cf8f6e9',
'dbcl2': '164014520:FawKNUeWwVk'
}
得到cookie信息的步骤如下:
注意不要点击登录,按F12进入开发者工具,点击network或网络:
点击登录豆瓣,登录成功后,点击login_success的标头,右键复制请求标头:
得到类似如下内容
GET /stat.html?&login_success_duration=164.801&platform=douban&login_end_time=1708237806205&callback=jsonp_1prvlv0djqnw1n5 HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6
Cache-Control: no-cache
Connection: keep-alive
Cookie: bid=SzNvd-H3HQ8; douban-fav-remind=1; _pk_id.100001.8cb4=6a1812e422687410.1687587816.; __yadk_uid=bKeDpqamOUqVFgySo0c9Q7mlFZofu5jn; ll="108288"; viewed="1477390_35972849"; push_noty_num=0; push_doumail_num=0; __utmc=30149280; __utmz=30149280.1708230192.6.5.utmcsr=cn.bing.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __utmv=30149280.16401; frodotk_db="23de5732df33ae736b9b05989d9b0b59"; __utma=30149280.1879374568.1687587817.1708230192.1708232910.7; _pk_ref.100001.8cb4=%5B%22%22%2C%22%22%2C1708237621%2C%22https%3A%2F%2Fwww.bing.com%2F%22%5D; _pk_ses.100001.8cb4=1; dbcl2="164014520:8puZEtmVRzI"
Host: www.douban.com
Pragma: no-cache
Referer: https://accounts.douban.com/
Sec-Fetch-Dest: script
Sec-Fetch-Mode: no-cors
Sec-Fetch-Site: same-site
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0
sec-ch-ua: "Not A(Brand";v="99", "Microsoft Edge";v="121", "Chromium";v="121"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
将上述内容中的Cookie整理成字典格式:
# 定义登录后的Cookie信息
cookies = {
'bid': 'SzNvd-H3HQ8',
'douban-fav-remind': '1',
'_pk_id.100001.8cb4': '6a1812e422687410.1687587816.',
'__yadk_uid': 'bKeDpqamOUqVFgySo0c9Q7mlFZofu5jn',
'll': '108288',
'viewed': '1477390_35972849',
'_pk_ref.100001.8cb4': '%5B%22%22%2C%22%22%2C1708219364%2C%22https%3A%2F%2Fcn.bing.com%2F%22%5D',
'_pk_ses.100001.8cb4': '1',
'__utma': '30149280.1879374568.1687587817.1704897114.1708219366.5',
'__utmc': '30149280',
'__utmz': '30149280.1708219366.5.4.utmcsr=cn.bing.com|utmccn=(referral)|utmcmd=referral|utmcct=/',
'ap_v': '0,6.0',
'push noty num': '0',
'push doumail num': '',
'frodotk_db': '4c92f1b16efda7a3e5b8d5144cf8f6e9',
'dbcl2': '164014520:FawKNUeWwVk'
}
- 修改请求头:
# 请求头
headers = {'user-agent': 'Mozilla/5.0'}
修改方法:双击随便一个活动,点击标头往下滑,完整复制User-Agent的内容进行替换即可