复联三过后或许你还惊魂未定就被惊奇队长里的噬元兽吓到恐猫,whatever,本文将会介绍如何从豆瓣爬取惊奇队长的短评并加工处理生成词云。
爬取评论
首先还是爬取评论,老规矩用requests和BeautifulSoup就行。通过查看网页源码容易发现所有的短评都放在span标签中且class为short,这样就很方便了,用find_all就完事了。
# -*- coding:utf-8 -*-
import requests
import time
import random
from bs4 import BeautifulSoup
urls = []
for i in range(0, 500, 20):
urls.append('https://movie.douban.com/subject/26213252/comments?start=' + str(
i) + '&limit=20&sort=new_score&status=P') # 评论的翻页
def singlepage_comment(url):
# 得到单页的评论
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/65.0.3325.162 Safari/537.36'
}
html = requests.get(url, headers)
html.encoding = 'utf-8'
soup