python数据分析与可视化_葡萄酒评论分析报告

最新推荐文章于 2024-03-22 16:40:34 发布

一语梦千城

最新推荐文章于 2024-03-22 16:40:34 发布

阅读量1.9k

点赞数 5

文章标签： python 数据分析数据挖掘

本文链接：https://blog.csdn.net/m0_53514303/article/details/133636802

版权

python数据分析与可视化_葡萄酒评论分析报告_pandas库

任务要求
编程思路
相关函数及实现方法说明
- 函数部分
- 方法部分
代码实现
- 子函数部分
- 最终实现代码

任务要求

问题重述

文件“winemag-data.csv” 包含编号、国家、描述、评分、价格、省份等6列和12974行葡萄酒评论的数据。数据格式如下所示：‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‪‬

number,country,description,points,price,province
30,France,“Red cherry fruit comes laced with…”,86,15,Beaujolais
50,Italy,“This blend of Nero Avola and Syrah…”,86,15,Sicily
100,US,“Fresh apple, lemon and pear flavors…”,88,18,New York‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‪‬

通过分析这些数据，用户可以根据产地、评份、价格等挑选适合自己的葡萄酒，商家可以分析消费者的购买行为习惯，可以更加准确地提供适合市场的产品，精准定位客户。

要求

输入'国家名列表'，统计文件中出现的葡萄酒生产国家，输出不重复的国家名列表，按字母表升序排序，若国家名数据缺失，略过该条数据，返回值中不包含空字符串元素
输入'平均分'，计算每个国家的葡萄酒的平均得分( 保留最多2位小数)，返回值为国家名和得分的列表
输入'平均分排序'，计算每个国家的葡萄酒的平均得分，返回值为国家名和得分的列表，按评分由高到低降序排列
输入'评分最高'，输出评分最高的十款葡萄酒的编号、出产国、评分和价格，按评分降序输出
输入'价格最高'，输出价格最高的二十款葡萄酒的编号、出产国、评分和价格，按价格降序输出
输入'葡萄酒评分'，统计各个评分的葡萄酒数量是多少？
输出各个评分的葡萄酒数量的列表，按评分升序排序；
输出拥有葡萄酒数量最多的评分和数量；
输出拥有葡萄酒数量最多的评分的葡萄酒的平均价格
输入其他时，输出“输入错误”

读取文件示例：

def csv_to_ls(file):
    """接收文件名为参数，用pandas读取数据为dataframe格式，
    再将其数据部分(values)用tolist()方法转为二维列表，
    返回这个二维列表。
    @参数 file：文件名，字符串类型
    """
    wine_list = pd.read_csv(file).values.tolist()
    # print(wine_list)
    return wine_list

输入输出示例，示例仅为格式展示，与测试用例无关

示例 1

输入：
国家名列表
输出：
[‘Argentina’, ‘Armenia’, … ‘US’, ‘Ukraine’, ‘Uruguay’]

示例 2

输入：
平均分
输出：
[[‘Argentina’, 86.72], [‘Armenia’, 87.0],… [‘Ukraine’, 83.0], [‘Uruguay’, 88.0]]

示例 3

输入：
葡萄酒评分
输出：
[[80, 38], [81, 71], … [95, 140], [96, 50], [97, 26], [98, 8], [99, 3]] [86, 1743]
31.02

编程思路

统计文件中出现的葡萄酒生产国家，输出不重复的国家名列表，按字母表升序排序，若国家名数据缺失，略过该条数据，返回值中不包含空字符串元素。
计算每个国家的葡萄酒的平均得分，返回值为国家名和得分的列表
计算每个国家的葡萄酒的平均得分，返回值为国家名和得分的列表，按评分由高到低降序排列
评分最高的十款葡萄酒的编号、出产国、评分和价格，按评分降序输出
价格最高的二十款葡萄酒的编号、出产国、评分和价格，按价格降序输出
统计各个评分的葡萄酒数量是多少？输出包含评分和数量的列表
输出拥有葡萄酒数量最多的评分和数量
输出拥有葡萄酒数量最多的评分的葡萄酒的平均价格

代码实现

子函数部分

输入'国家名列表'，统计文件中出现的葡萄酒生产国家，输出不重复的国家名列表，按字母表升序排序，若国家名数据缺失，略过该条数据，返回值中不包含空字符串元素

def country_ls(wine_list):
    """接收列表格式的葡萄酒数据为参数，略过标题行，返回不重复的国家名列表，按字母表升序排序，
    若国家名数据缺失，略过该条数据，返回值中不包含空字符串元素。
    @参数 wine_list：葡萄酒数据，列表类型
    """
    country_list = []
    for x in wine_list:
        if x[COUNTRY] not in country_list:
            country_list.append(x[COUNTRY])
    country_list.sort()
    # print(country_list)
    return country_list

输入'平均分'，计算每个国家的葡萄酒的平均得分( 保留最多2位小数)，返回值为国家名和得分的列表

def avg_point(wine_list, country):
    """接收列表格式的葡萄酒数据和国家名列表为参数，计算每个国家的葡萄酒的平均得分，
    返回值为国家名和得分的列表。
    @参数 wine_list：葡萄酒数据，列表类型
    @参数 country：国家名，列表类型
    """
    avg_point_per_country = []
    for country_name in country:
        point_of_country = [x[POINTS] for x in wine_list[1:] if x[COUNTRY] == country_name]  # 每个国家的葡萄酒评分列表
        avg_point_per_country.append([country_name, round(sum(point_of_country) / len(point_of_country), 2)])
    return avg_point_per_country  # 返回每个国家的葡萄酒的平均评分

输入'平均分排序'，计算每个国家的葡萄酒的平均得分，返回值为国家名和得分的列表，按评分由高到低降序排列

def avg_point_sort(wine_list, country):
    """接收列表格式的葡萄酒数据和国家名列表为参数，计算每个国家的葡萄酒的平均得分，
    返回值为国家名和得分的列表，按评分由高到低降序排列。
    @参数 wine_list：葡萄酒数据，列表类型
    @参数 country：国家名，列表类型
    """
    avg_point_per_country = []
    for country_name in country:
        point_of_country = [float(x[POINTS]) for x in wine_list[1:] if x[COUNTRY] == country_name]  # 每个国家的葡萄酒评分列表
        avg_point_per_country.append([country_name, round(sum(point_of_country) / len(point_of_country), 2)])
    return sorted(avg_point_per_country, key=lambda x: x[1], reverse=True)  # 返回每个国家的葡萄酒的平均评分

输入'评分最高'，输出评分最高的十款葡萄酒的编号、出产国、评分和价格，按评分降序输出

def top_10_point(wine_list):
    """接收列表格式的葡萄酒数据参数，返回评分最高的十款葡萄酒的编号、出产国、评分和价格，按评
    分降序输出。
    需要注意的是评分可能有缺失值，此时该数据为nan
    if math.isnan(x) == False可用于判定x的值是不是nan
    nan的数据类型是float,不可以直接用字符串判定方法。
    @参数 wine_list：葡萄酒数据，列表类型
    """
    wine_top_point = [[x[NUMBER], x[COUNTRY], x[POINTS], x[PRICE]] for x in wine_list if math.isnan(x[POINTS]) is False]
    return sorted(wine_top_point, key=lambda x: x[2], reverse=True)[:10]
    #注意这里切片的使用

输入'价格最高'，输出价格最高的二十款葡萄酒的编号、出产国、评分和价格，按价格降序输出

def top_20_price(wine_list):
    """接收列表格式的葡萄酒数据参数，返回价格最高的二十款葡萄酒的编号、出产国、评分和价格，按价
    格降序输出。
    @参数 wine_list：葡萄酒数据，列表类型
    需要注意的是价格可能有缺失值，此时该数据为nan
    if math.isnan(x) == False可用于判定x的值是不是nan
    nan的数据类型是float,不可以直接用字符串判定方法。
    """
    wine_top_price = [[x[NUMBER], x[COUNTRY], x[POINTS], x[PRICE]]for x in wine_list if math.isnan(x[PRICE]) is False]
    return sorted(wine_top_price, key=lambda x: x[3], reverse=True)[:20]

输入'葡萄酒评分'，统计各个评分的葡萄酒数量是多少？
输出各个评分的葡萄酒数量的列表，按评分升序排序；
输出拥有葡萄酒数量最多的评分和数量；
输出拥有葡萄酒数量最多的评分的葡萄酒的平均价格

统计各个评分的葡萄酒数量是多少

def amount_of_point(wine_list):
    """接收列表格式的葡萄酒数据参数，返回每个评分的葡萄酒数量，忽略没有评分的数据
    例如[...[84, 645], [85, 959],...]表示得分为84的葡萄酒645种，得分85的葡萄酒有959种。
    @参数 wine_list：葡萄酒数据，列表类型
    """
    point_list = []
    for x in wine_list:
        if x[POINTS] not in point_list:
            point_list.append(x[POINTS])
    point_list.sort()
    amount = [x[POINTS] for x in wine_list]
    amount_of_points = [[point, amount.count(point)] for point in point_list]
    # print(amount_of_points)
    return amount_of_points

输出各个评分的葡萄酒数量的列表，按评分升序排序；


def most_of_point(amount_of_points):
    """接收每个评分的葡萄酒数量的列表为参数，返回获得该分数数量最多的评分和数量的列表。
    @参数 amount_of_points：每个评分的葡萄酒数量，列表类型
    """
    return sorted(amount_of_points, key=lambda x: x[1], reverse=True)[0]

输出拥有葡萄酒数量最多的评分和数量

def avg_price_of_most_point(wine_list, most_of_points):
    """接收列表格式的葡萄酒数据和获得最多的评分及数量的列表为参数
    忽略缺失价格的数据，返回这个分数的葡萄酒的平均价格，保留2位小数。
    @参数 wine_list：葡萄酒数据，列表类型
    @参数 most_of_points：获得最多的评分及数量，列表类型
    """
    price_of_point = [x[PRICE] for x in wine_list if x[POINTS] == most_of_points[0] and math.isnan(x[PRICE]) is False]
    avg_price_of_point = sum(price_of_point) / len(price_of_point)
    # print(price_of_point)
    return round(avg_price_of_point, 2)

输出拥有葡萄酒数量最多的评分的葡萄酒的平均价格
使用之前写好的avg_price_of_most_point()函数即可

avg_price_of_most_point(wine, most_point)

完成以上子函数编写后，接收一个字符串为参数，根据参数值调用不同函数judge()函数完成任务

def judge(txt):
    filename = './data/winemag-data.csv'
    wine = csv_to_ls(filename)
    country = country_ls(wine)
    if txt == '国家名列表':
        print(country)
    elif txt == '平均分':
        print(avg_point(wine, country))  # 每个国家的葡萄酒的平均得分
    elif txt == '平均分排序':
        print(avg_point_sort(wine, country))  # 每个国家的葡萄酒的平均得分降序输出
    elif txt == '评分最高':
        print(top_10_point(wine))  # 评分最高的十款葡萄酒的编号、出产国、评分和价格，按评分降序输出
    elif txt == '价格最高':
        print(top_20_price(wine))  # 价格最高的二十款葡萄酒的编号、出产国、评分和价格，按价格降序输出
    elif txt == '葡萄酒评分':
        amount_point = amount_of_point(wine)
        most_point = most_of_point(amount_point)
        print(amount_point)  # 各个评分的葡萄酒数量
        print(most_point)  # 拥有葡萄酒数量最多的评分和数量
        print(avg_price_of_most_point(wine, most_point))  # 拥有葡萄酒数量最多的评分的葡萄酒的平均价格
    else:
        print('输入错误')

最终实现代码

import pandas as pd
import math

# 定义符号常量，用于索引，使之具有清晰的语义
NUMBER = 0
COUNTRY = 1
DESCRIPTION = 2
POINTS = 3
PRICE = 4
PROVINCE = 5


def csv_to_ls(file):
    """接收文件名为参数，逐行读取文件中的数据，根据逗号切分每行数据为列表类型，
    作为二维列表的一个元素，返回二维列表。
    @参数 file：文件名，字符串类型
    """
    wine_list = pd.read_csv(file).values.tolist()
    # print(wine_list)
    return wine_list


def country_ls(wine_list):
    country_list = []
    for x in wine_list:
        if x[COUNTRY] not in country_list:
            country_list.append(x[COUNTRY])
    country_list.sort()
    # print(country_list)
    return country_list


def avg_point(wine_list, country):
    avg_point_per_country = []
    for country_name in country:
        point_of_country = [x[POINTS] for x in wine_list[1:] if x[COUNTRY] == country_name]  # 每个国家的葡萄酒评分列表
        avg_point_per_country.append([country_name, round(sum(point_of_country) / len(point_of_country), 2)])
    return avg_point_per_country  # 返回每个国家的葡萄酒的平均评分


def avg_point_sort(wine_list, country):
    avg_point_per_country = []
    for country_name in country:
        point_of_country = [float(x[POINTS]) for x in wine_list[1:] if x[COUNTRY] == country_name]  # 每个国家的葡萄酒评分列表
        avg_point_per_country.append([country_name, round(sum(point_of_country) / len(point_of_country), 2)])
    return sorted(avg_point_per_country, key=lambda x: x[1], reverse=True)  # 返回每个国家的葡萄酒的平均评分


def top_10_point(wine_list):
    wine_top_point = [[x[NUMBER], x[COUNTRY], x[POINTS], x[PRICE]] for x in wine_list if math.isnan(x[POINTS]) is False]
    return sorted(wine_top_point, key=lambda x: x[2], reverse=True)[:10]


def top_20_price(wine_list):
    wine_top_price = [[x[NUMBER], x[COUNTRY], x[POINTS], x[PRICE]]for x in wine_list if math.isnan(x[PRICE]) is False]
    return sorted(wine_top_price, key=lambda x: x[3], reverse=True)[:20]


def amount_of_point(wine_list):
    point_list = []
    for x in wine_list:
        if x[POINTS] not in point_list:
            point_list.append(x[POINTS])
    point_list.sort()
    amount = [x[POINTS] for x in wine_list]
    amount_of_points = [[point, amount.count(point)] for point in point_list]
    # print(amount_of_points)
    return amount_of_points


def most_of_point(amount_of_points):
    return sorted(amount_of_points, key=lambda x: x[1], reverse=True)[0]


def avg_price_of_most_point(wine_list, most_of_points):
    price_of_point = [x[PRICE] for x in wine_list if x[POINTS] == most_of_points[0] and math.isnan(x[PRICE]) is False]
    avg_price_of_point = sum(price_of_point) / len(price_of_point)
    # print(price_of_point)
    return round(avg_price_of_point, 2)


def judge(txt):
    filename = './data/winemag-data.csv'
    wine = csv_to_ls(filename)
    country = country_ls(wine)
    if txt == '国家名列表':
        print(country)
    elif txt == '平均分':
        print(avg_point(wine, country))  # 每个国家的葡萄酒的平均得分
    elif txt == '平均分排序':
        print(avg_point_sort(wine, country))  # 每个国家的葡萄酒的平均得分降序输出
    elif txt == '评分最高':
        print(top_10_point(wine))  # 评分最高的十款葡萄酒的编号、出产国、评分和价格，按评分降序输出
    elif txt == '价格最高':
        print(top_20_price(wine))  # 价格最高的二十款葡萄酒的编号、出产国、评分和价格，按价格降序输出
    elif txt == '葡萄酒评分':
        amount_point = amount_of_point(wine)
        most_point = most_of_point(amount_point)
        print(amount_point)  # 各个评分的葡萄酒数量
        print(most_point)  # 拥有葡萄酒数量最多的评分和数量
        print(avg_price_of_most_point(wine, most_point))  # 拥有葡萄酒数量最多的评分的葡萄酒的平均价格
    else:
        print('输入错误')


if __name__ == '__main__':
    text = input()
    judge(text)