爬取豆瓣评分
大家好,我是Te,今天我们来写一个爬虫,就爬豆瓣电影的评分吧,来,走起
这里说一下啊,豆瓣不能多爬,爬多了会封IP几小时,那不就难受了,是不是啊,所以我们只爬评分
先导入库
import requests
import time
import matplotlib as mpl #用来画可视化
import matplotlib.pyplot as plt
import os #系统库
import json
from bs4 import BeautifulSoup
主函数
def main():
sum = 0
x = []
y = []
n = int(input("电影个数>>>"))
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36 Edg/91.0.864.54"
}
while True:
sum = sum + 1
name = input("请输入电影" + str(sum) + "的名字>>>")
num = int(input("请输入电影" + str(sum) + "的电影路径数字>>>"))
url = "https://movie.douban.com/subject/" + str(num) + "/?tag=%E7%83%AD%E9%97%A8&from=gaia"
name = name
movie(x=x,y=y,url=url,name=name,headers=headers)
if sum == n:
paint(x,y)
break
寻找并清洗评分
def movie(x,y,url,name,headers):
req = requests.get(url,headers=headers)
data = req.text
bf = BeautifulSoup(data, 'html.parser')
dict = {}
find = bf.select(".rating_per")
dict[bf.select(".stars5")[0].string.replace("\n", "").strip()] = find[0].string.split("%")[0]
dict[bf.select(".stars4")[0].string.replace("\n", "").strip()] = find[1].string.split("%")[0]
dict[bf.select(".stars3")[0].string.replace("\n", "").strip()] = find[2].string.split("%")[0]
dict[bf.select(".stars2")[0].string.replace("\n", "").strip()] = find[3].string.split("%")[0]
dict[bf.select(".stars1")[0].string.replace("\n", "").strip()] = find[4].string.split("%")[0]
s_stars5 = dict["5星"]
s_stars4 = dict["4星"]
s_stars3 = dict["3星"]
s_stars2 = dict["2星"]
s_stars1 = dict["1星"]
s_5 = s_stars5
s_4 = s_stars4
s_3 = s_stars3
s_2 = s_stars2
s_1 = s_stars1
s = float(s_5) * 5 + float(s_4) * 4 + float(s_3) * 3 + float(s_2) * 2 + float(s_1) * 1
s = s // 1
s = s
x = x.append(name)
y = y.append(s)
最后画图,调用主函数
def paint(x,y):
# 添加中文
plt.rcParams['font.sans-serif'] = ['SimHei']
# 画条形图
plt.bar(x, y)
plt.show()
if __name__ == "__main__":
main()
注意,电影路径就是电影界面url里subject/后面的数字
这里推荐一个文件格式转换的网站:https://www.aconvert.com/
ok,那就到这里吧,喜欢的话请点赞留言加关注,爱你么么哒