《利用Python进行数据分析》初阶例题分析-3：美国农业部视频数据库-CSDN博客

本文链接：https://blog.csdn.net/m0_56162579/article/details/131690472

这里写目录标题

- 题目描述
- 源数据集
- 数据分析

题目描述

美国农业部提供了食物营养信息数据库。每种事务都有一些识别属性以及两份营养元素和营养比例的列表。这种形式的数据不适合分析，所以需要做一些工作将数据转换成更好的形式。

源数据集

点此获得该系列1-4数据
提取码：if5a

数据分析

加载json文件数据集并保存到db中，并求出数据集的长度。

import json
import pandas as pd
db = json.load(open("datasets/usda_food/database.json"))
len(db)

首先返回所有的键名；接下来去键为nutrients的第一个值；接下来将nutrients对应的值转化为dataframe的形式，最后展示前七行数据。

db[0].keys()
db[0]["nutrients"][0]
nutrients = pd.DataFrame(db[0]["nutrients"])
nutrients.head(7)

在这里插入图片描述
索引出key为"description", “group”, “id”, "manufacturer"的数据并保存为dataframe的形式，展示前五行数据和信息。

info_keys = ["description", "group", "id", "manufacturer"]
info = pd.DataFrame(db, columns=info_keys)
info.head()
info.info()

在这里插入图片描述
返回group列中十个元素出现的次数

pd.value_counts(info["group"])[:10]

在这里插入图片描述
将db中的键为nutrients的数据逐个添加到nutrients这个列表中

nutrients = []

for rec in db:
    fnuts = pd.DataFrame(rec["nutrients"])
    fnuts["id"] = rec["id"]
    nutrients.append(fnuts)

nutrients = pd.concat(nutrients, ignore_index=True)

在这里插入图片描述
获得一个不包含重复行的 nutrients

nutrients.duplicated().sum()  # number of duplicates
nutrients = nutrients.drop_duplicates()
col_mapping = {"description" : "food",
               "group"       : "fgroup"}
info = info.rename(columns=col_mapping, copy=False)
info.info()
col_mapping = {"description" : "nutrient",
               "group" : "nutgroup"}
nutrients = nutrients.rename(columns=col_mapping, copy=False)
nutrients

在这里插入图片描述
将 nutrients 和 info 数据帧在列id上进行合并，并将结果赋值给 ndata 变量。并返回 ndata 数据帧中索引为 30000 的行。

ndata = pd.merge(nutrients, info, on="id")
ndata.info()
ndata.iloc[30000]

绘制水平条形图，可视化不同 “fgroup” 下 “Zinc, Zn” 的中位数值。

result = ndata.groupby(["nutrient", "fgroup"])["value"].quantile(0.5)
result["Zinc, Zn"].sort_values().plot(kind="barh")

在这里插入图片描述
获得一个包含每个组中具有最大值的行的数据框 max_foods。

by_nutrient = ndata.groupby(["nutgroup", "nutrient"])

def get_maximum(x):
    return x.loc[x.value.idxmax()]

max_foods = by_nutrient.apply(get_maximum)[["value", "food"]]

# make the food a little smaller
max_foods["food"] = max_foods["food"].str[:50]