Python爬虫豆瓣影评

最新推荐文章于 2024-09-15 13:36:39 发布

不想上学的小菜鸟

最新推荐文章于 2024-09-15 13:36:39 发布

阅读量1.9k

点赞数 5

分类专栏： Python

本文链接：https://blog.csdn.net/qq_36151472/article/details/102672942

版权

Python 专栏收录该内容

20 篇文章 2 订阅

订阅专栏

Python爬取豆瓣影评并生成词云，网上很多案例，我参考的这一篇 Python爬虫实战，具体步骤这篇文章讲解的很详细了，不过我在复现的过程中也遇到了很多问题，所以记录一下。

#coding:utf-8

import warnings
warnings.filterwarnings("ignore")
import jieba    #分词包
import numpy    #numpy计算包
import codecs   #codecs提供的open方法来指定打开的文件的语言编码，它会在读取的时候自动转换为内部unicode
import re
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from urllib import request
from bs4 import BeautifulSoup as bs
# %matplotlib inline

matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
from wordcloud import WordCloud#词云包

#分析网页函数
def getNowPlayingMovie_list():
    resp = request.urlopen('https://movie.douban.com/nowplaying/hangzhou/')
    html_data = resp.read().decode('utf-8')
    soup = bs(html_data, 'html.parser')
    nowplaying_movie = soup.find_all('div', id='nowplaying')
    nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_='list-item')
    nowplaying_list = []
    for item in nowplaying_movie_list:
        nowplaying_dict = {}
        nowplaying_dict['id'] = item['data-subject']
        for tag_img_item in item.find_all('img'):
            nowplaying_dict['name'] = tag_img_item['alt']
            nowplaying_list.append(nowplaying_dict)
    return nowplaying_list

#爬取评论函数
def getCommentsById(movieId, pageNum):
    eachCommentList = [];
    if pageNum>0:
         start = (pageNum-1) * 20
    else:
        return False
    requrl = 'https://movie.douban.com/subject/' + movieId + '/comments' +'?' +'start=' + str(start) + '&limit=20'
    print(requrl)
    resp = request.urlopen(requrl)
    html_data = resp.read().decode('utf-8')
    soup = bs(html_data, 'html.parser')
    comment_div_lits = soup.find_all('div', class_='comment')
    for item in comment_div_lits:
        if item.find_all('p')[0].span.string is not None:
            eachCommentList.append(item.find_all('p')[0].span.string)
    return eachCommentList

def main():
    #循环获取第一个电影的前10页评论
    commentList = []
    NowPlayingMovie_list = getNowPlayingMovie_list()
    for i in range(10):
        num = i + 1
        commentList_temp = getCommentsById(NowPlayingMovie_list[0]['id'], num)
        commentList.append(commentList_temp)

    #将列表中的数据转换为字符串
    comments = ''
    for k in range(len(commentList)):
        comments = comments + (str(commentList[k])).strip()

    #使用正则表达式去除标点符号
    pattern = re.compile(r'[\u4e00-\u9fa5]+')
    filterdata = re.findall(pattern, comments)
    cleaned_comments = ''.join(filterdata)

    #使用结巴分词进行中文分词
    segment = jieba.lcut(cleaned_comments)
    words_df=pd.DataFrame({'segment':segment})

    #去掉停用词
    stopwords=pd.read_csv("stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='gbk')#quoting=3全不引用
    words_df=words_df[~words_df.segment.isin(stopwords.stopword)]

    #统计词频
    words_stat=words_df.groupby(by=['segment'])['segment'].agg({"计数":numpy.size})
    words_stat=words_stat.reset_index().sort_values(by=["计数"],ascending=False)

    #用词云进行显示
    # wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80)
    word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}

    word_frequence_list = []
    for key in word_frequence:
        temp = (key,word_frequence[key])
        word_frequence_list.append(temp)
    wfl = dict(word_frequence_list)

    wordcloud = WordCloud(scale=5,font_path='./fonts/simhei.ttf',max_font_size=40, relative_scaling=.5).fit_words(wfl)
    plt.figure()
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()

#主函数
main()

不过在搬运的过程中也出现一些小bug以及小tips，记录下来。
（1） %matplotlib inline这一个语句是jupyter中的，在pycharm中会显示invalid syntax（无效语法），所以注释掉就OK，在jupyter中这一句的意思。当你调用matplotlib.pyplot的绘图函数plot()进行绘图的时候，或者生成一个figure画布的时候，可以直接在你的python console里面生成图像。
（2）stopwords.txt可以直接百度搜索下载，否则会提示找不着文件。
（3）最后结果只显示框框没有文字（如下图）
在这里插入图片描述
这是因为中文不识别，所以在在Wordcloud中加入 font_path=’./fonts/simhei.ttf’ 即可

（4）报错：‘list’ object has no attribute ‘items’’，这是由于fit_words需要传入字典格式，而传入列表会报错。所以要转换格式（wfl = dict(word_frequence_list)）。
（5）提取评论结果为空，有两种方法可以解决：
        a.把第二个遍历里的item.string 改成 item.span.string；
        b.直接在find_all的语句改成 find_all(‘span’, ‘short’)。
        说明一下，首先p元素里面还有一个span元素，如果你直接.string的话正常应该是****这样的形式。但为什么你的代码里却什么都没有呢，因为requests的响应内容里面，会有\n这个换行符。也就是说你的p元素里面不止有一个span元素，还有2个\n分别在span的两边，这个换行符对于bs4来说也是一个元素，而string只能用于里面只有一个元素的情况。所以你的string方法什么都没有。
（6）stopwords可能会报解码错误，这取决于你下载的stopwords.txt的编码方式，通常就是gbk或utf-8这两种，改一下就好。