实验二数据获取与分析

最新推荐文章于 2024-06-21 12:52:40 发布

qq_73931224

最新推荐文章于 2024-06-21 12:52:40 发布

阅读量339

点赞数 9

文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_73931224/article/details/138084980

版权

采用爬虫技术（ urllib 库， BeautifulSoup 库， re 库）从豆瓣网站获取 ” 豆瓣电影

TOP250” 中一部电影的评论数据（评分不高于 8.6 ），其中，每个评论需要提取 ” ['name',

'score', 'time', 'usefulNum', 'comment']” 共 5 项数据，并将全部数据写入 douban.xls 文件中。

然后采用 pandas 和 matplotlib 完成以下题目：

1) 使用 pandas 创建 1 个 dataframe ，读取 douban.xls 文件中的数据，并将其存储到

dataframe 中，其中第 1 行为列索引 ['name', 'score', 'time', 'usefulNum', 'comment'] ；

2) 统计数据中各列的缺失值总数和空白总数

3) score 列数据处理 n 过滤 score 列中的缺失值和空白 ;

n [' 力荐 ', ' 推荐 ', ' 还行 ', ' 较差 ', ' 很差 '] 分别对应 5 ， 4 ， 3 ， 2 ， 1 ，统计影片的平均得分

n 统计网友为评分为 ' 力荐 ' 的人点击 " 有用 " 的数量的平均数，绘制柱状图

4) usefulNum 列数据处理

n 过滤 usefulNum 列中的缺失值和空白

n 排序，找出网友认为最 " 有用 " 的评论，即 usefulNum 最大值对应的数据记录

5) 相关分析

n 过滤 uscore 与 usefulNum 列中的缺失值和空白

n score 与 usefulNum 数据相关分析

6) comment 分析

n 过滤 comment 列中的缺失值和空白

n 对全体 comment 进行了分词，输出频率最高的 3 个词，并对全体词汇绘制词云

import requests
from bs4 import BeautifulSoup
import pandas as pd
import xlwt
import matplotlib.pyplot as plt
import jieba
from wordcloud import WordCloud


def getHtml():
    htmls = []
    headers = {"User-Agent":
              "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"
               }
    url = "https://movie.douban.com/subject/3011051/comments?limit=20&status=P&sort=new_score"
    r = requests.get(url, headers=headers, timeout=10)
    if r.status_code != 200:
        raise Exception("error")
    htmls.append(r.text)
    for j in range(20, 101, 20):
        url = f"https://movie.douban.com/subject/3011051/comments?start={j}&limit=20&status=P&sort=new_score"
        r = requests.get(url, headers=headers, timeout=10)
        if r.status_code != 200:
            raise Exception("error")
        htmls.append(r.text)
    return htmls


def parse_single_html(html):
    soup = BeautifulSoup(html, "html.parser")
    path = soup.find("div", id="content").find("div", class_="grid-16-8 clearfix").find("div", class_="article") \
        .find_all("div", class_="comment-item")
    my_list = []
    for i in range(20):

        name = path[i].find("span", class_="comment-info").find("a").get_text()

        score = "None"
        level = path[i].find("span", class_="comment-info")
        if level.find("span", title="力荐") is not None:
            score = level.find("span", title="力荐").attrs["title"]
        elif level.find("span", title="推荐") is not None:
            score = level.find("span", title="推荐").attrs["title"]
        elif level.find("span", title="还行") is not None:
            score = level.find("span", title="还行").attrs["title"]
        elif level.find("span", title="较差") is not None:
            score = level.find("span", title="较差").attrs["title"]
        elif level.find("span", title="很差") is not None:
            score = level.find("span", title="很差").attrs["title"]

        time = path[i].find("span", class_="comment-info").find("span", class_="comment-time").get_text()

        useful = path[i].find("span", class_="comment-vote").find("span", class_="votes vote-count").get_text()

        comment = path[i].find("span", class_="short").get_text()

        my_list.append([name, score, str(time).strip(), int(useful), comment])
    return my_list


all_htmls = getHtml()
data = []
for html in all_htmls:
    data.append(parse_single_html(html))

# 创建excel表对象
workbook = xlwt.Workbook(encoding="utf-8")
# 创建sheet表
worksheet = workbook.add_sheet("Sheet1")
# 自定义列名
columns = ["name", "score", "time", "useful", "comment"]
# 将列名写在第一行
for i in range(len(columns)):
    worksheet.write(0, i, columns[i])
# flag实现多次写入，每次写入+len(x)
flag = -1
# 将数据写入表格
for x in data:
    flag += 1
    for i in range(len(x)):
        data = x[i]
        for j in range(len(columns)):
            worksheet.write((i + 1)+(flag*len(x)), j, data[j])

# 保存文件，文件名为comment.xls
workbook.save("comment.xls")

df = pd.read_excel("C:\pycharm--Python learning\comment.xls", sheet_name="Sheet1", header=0)
# 统计各列空白个数
print(df.isnull().sum())
# 去除score为空的行
df = df[pd.isnull(df["score"]) == False]
print(df.isnull().sum())

df["score"].replace("力荐", 5, inplace=True)
df["score"].replace("推荐", 4, inplace=True)
df["score"].replace("还行", 3, inplace=True)
df["score"].replace("较差", 2, inplace=True)
df["score"].replace("很差", 1, inplace=True)

print(df.head())
print(df.tail())
求score的平均值
print(f"该电影的平均分数为：%.2f" % df["score"].mean())

# 评分为力荐的行
df1 = df[df["score"] == 5]
# 评分为力荐的人点击'有用'数量的平均数
average = df1["useful"].mean()
print(f"评分为力荐的人点击'有用'数量的平均数为{average}")

# 图画的丑，建议自己重新画
y = average
plt.bar(1, y, color='lightblue', bottom=0, width=0.4)
plt.ylabel("useful平均值")
# 汉字字体，优先使用楷体，找不到则使用黑体
plt.rcParams['font.sans-serif'] = ['Kaitt', 'SimHei']
# 正常显示负号
plt.rcParams['axes.unicode_minus'] = False
plt.show()

from collections import Counter
绘制词云
lists = list(df["comment"])
result_list = []
result_txt = ''
for list in lists:
    words = jieba.lcut(str(list))
    result_list.append(words)
for l in result_list:
    txt = ''.join(l)
    result_txt += txt
# 词云这里的msyh.ttc要把你自己电脑上的字体文件移动到与你的python文件同一目录下，具体怎么做，自己在csdn上搜
wordcloud = WordCloud(font_path="msyh.ttc").generate(result_txt)
wordcloud.to_file('豆瓣评论词云图.jpg')

print("useful 和 score 的相关系数为：")
print(df["useful"].corr(df["score"]))
# 输出前三的词语
# print(Counter(result_txt).most_common(3))
# 输出useful最大的行
print(df.loc[df["useful"].idxmax()])

qq_73931224

关注

9
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
1
评论
实验二数据获取与分析

统计数据中各列的缺失值总数和空白总数。中一部电影的评论数据（评分不高于。文件中的数据，并将其存储到。的数量的平均数，绘制柱状图。进行了分词，输出频率最高的。项数据，并将全部数据写入。个词，并对全体词汇绘制词云。，其中，每个评论需要提取。，统计影片的平均得分。最大值对应的数据记录。排序，找出网友认为最。
复制链接

扫一扫