在豆瓣电影中,未上映电影的点映综合评分是不显示的,就需要爬取所有的评分取平均值,也可爬取用户和评论。
这里选取未上映的流浪地球
# -*- coding: utf-8 -*
import requests
from bs4 import BeautifulSoup
names, stars, texts = [], [], []
ch = {'力荐': 5, '推荐': 4, '还行': 3, '较差': 2, '很差': 1}
star = {1: 0, 2: 0, 3: 0, 4: 0, 5: 0}
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'}
for i in range(0, 1000, 20):
url = 'https://movie.douban.com/subject/26266893/comments?start=' + str(i) + '&limit=20&sort=new_score&status=P'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
pl = soup.find_all(class_='comment-item')
if len(pl) == 0: break
for i in pl:
span = i.select('.comment-info')[0]
if span.find_all('span')[1]['title'][0] != '2':
pf = ch[span.find_all('span')[1]['title']]
stars.append(pf)
star[pf] += 1
# texts.append(i.select('.short')[0].text)
# names.append(span.select('a')[0].text)
print('score:', sum(stars) / len(stars) * 2)
for i in star:
print(str(i) + 'star:' + str(star[i]) + ',' + str(round(star[i] / len(stars) * 100, 2)) + '%')
print('sum:', len(stars))
运行结果:
score: 8.4739336492891
1star:5,2.37%
2star:6,2.84%
3star:22,10.43%
4star:79,37.44%
5star:99,46.92%
sum: 211
[Finished in 4.6s]