第七章食物数据库例子

最新推荐文章于 2022-03-10 14:57:49 发布

豆乳_艾米

最新推荐文章于 2022-03-10 14:57:49 发布

阅读量1.9k

点赞数 1

文章标签： python

本文链接：https://blog.csdn.net/yunini2/article/details/73604631

版权

In [3]: import json

In [4]: db=json.load(open('E:/foods-2011-10-03.json'))

In [5]: len(db)
Out[5]: 6636

In [6]:

db中的每条目都是一个含有某种食物全部数据的字典，nutrients是字典的一个列表，其中每个字典对应一种营养成分：

db[0].keys()
Out[6]: 
[u'portions',
 u'description',
 u'tags',
 u'nutrients',
 u'group',
 u'id',
 u'manufacturer']
In [7]: db[0]['nutrients'][0]Out[7]: 
{u'description': u'Protein',
 u'group': u'Composition',
 u'units': u'g',
 u'value': 25.18}
In [8]: from pandas import DataFrame
In [9]: nutrients=DataFrame(db[0]['nutrients'])
In [11]: nutrients[:7]Out[11]: 
                   description        group units    value
0                      Protein  Composition     g    25.18
1            Total lipid (fat)  Composition     g    29.20
2  Carbohydrate, by difference  Composition     g     3.06
3                          Ash        Other     g     3.28
4                       Energy       Energy  kcal   376.00
5                        Water  Composition     g    39.28
6                       Energy       Energy    kJ  1573.00

将字典列表转换为DataFrame时，可以只抽取其中一部分字段，将抽取食物的名称、分类、编号以及制造商信息：

In [12]: info_keys=['description','group','id','manufacturer']

In [13]: info=DataFrame(db,columns=info_keys)

In [14]: info[:5]
Out[14]: 
                          description                   group    id  
0                     Cheese, caraway  Dairy and Egg Products  1008   
1                     Cheese, cheddar  Dairy and Egg Products  1009   
2                        Cheese, edam  Dairy and Egg Products  1018   
3                        Cheese, feta  Dairy and Egg Products  1019   
4  Cheese, mozzarella, part skim milk  Dairy and Egg Products  1028   
  manufacturer  
0               
1               
2               
3               
4 


import pandas as pd



#通过value_counts,查看食物类别分布情况



pd.value_counts(info.group)[:10]

Out[20]: 

Vegetables and Vegetable Products    812

Beef Products                        618

Baked Products                       496

Breakfast Cereals                    403

Legumes and Legume Products          365

Fast Foods                           365

Lamb, Veal, and Game Products        345

Sweets                               341

Fruits and Fruit Juices              328

Pork Products                        328

Name: group, dtype: int64#==============================================================================

# 将所有食物的营养成分整合到一个大表中：

#1、将各食物的营养成分列表转换为一个DataFrame，并添加一个表示编号的列

#2、该DataFrame添加到一个表中

#3、通过concat将这些东西连接起来

#==============================================================================

nutrients=[]

for rec in db:

    fnuts=DataFrame(rec['nutrients'])

    fnuts['id']=rec['id']

    nutrients.append(fnuts)

nutrients=pd.concat(nutrients,ignore_index=True)

nutrients[:10]

Out[23]:
description group units value id
0 Protein Composition g 25.18 1008
1 Total lipid (fat) Composition g 29.20 1008
2 Carbohydrate, by difference Composition g 3.06 1008
3 Ash Other g 3.28 1008
4 Energy Energy kcal 376.00 1008
5 Water Composition g 39.28 1008
6 Energy Energy kJ 1573.00 1008
7 Fiber, total dietary Composition g 0.00 1008
8 Calcium, Ca Elements mg 673.00 1008
9 Iron, Fe Elements mg 0.64 1008
In [24]: #重复项



In [25]: nutrients.duplicated().sum()
Out[25]: 14179



In [26]: nutrients=nutrients.drop_duplicates()



In [27]: col_mapping={'description':'food','group':'fgroup'}



In [28]: info=info.rename(columns=col_mapping,copy=False)



In [29]: info[:10]
Out[29]:

food fgroup \
0 Cheese, caraway Dairy and Egg Products 
1 Cheese, cheddar Dairy and Egg Products 
2 Cheese, edam Dairy and Egg Products 
3 Cheese, feta Dairy and Egg Products 
4 Cheese, mozzarella, part skim milk Dairy and Egg Products 
5 Cheese, mozzarella, part skim milk, low moisture Dairy and Egg Products

6 Cheese, romano Dairy and Egg Products 
7 Cheese, roquefort Dairy and Egg Products 
8 Cheese spread, pasteurized process, american, ... Dairy and Egg Products

9 Cream, fluid, half and half Dairy and Egg Products 



id manufacturer 
0 1008 
1 1009 
2 1018 
3 1019 
4 1028 
5 1029 
6 1038 
7 1039 
8 1048 
9 1049 
In [64]: col_mapping={'description':'nutrient','group':'nutgroup'}



In [65]: info=nutrients.rename(columns=col_mapping,copy=False)



In [66]: info[:10]
Out[66]:

nutrient nutgroup units value id
0 Protein Composition g 25.18 1008
1 Total lipid (fat) Composition g 29.20 1008
2 Carbohydrate, by difference Composition g 3.06 1008
3 Ash Other g 3.28 1008
4 Energy Energy kcal 376.00 1008
5 Water Composition g 39.28 1008
6 Energy Energy kJ 1573.00 1008
7 Fiber, total dietary Composition g 0.00 1008
8 Calcium, Ca Elements mg 673.00 1008
9 Iron, Fe Elements mg 0.64 1008



In [67]: #可以将info1和info合并起来



In [68]: ndata=pd.merge(info,info1,on='id',how='outer')



In [69]: ndata[:10]
Out[69]:

nutrient nutgroup units value id \
0 Protein Composition g 25.18 1008 
1 Total lipid (fat) Composition g 29.20 1008 
2 Carbohydrate, by difference Composition g 3.06 1008 
3 Ash Other g 3.28 1008 
4 Energy Energy kcal 376.00 1008 
5 Water Composition g 39.28 1008 
6 Energy Energy kJ 1573.00 1008 
7 Fiber, total dietary Composition g 0.00 1008 
8 Calcium, Ca Elements mg 673.00 1008 
9 Iron, Fe Elements mg 0.64 1008 



food fgroup manufacturer 
0 Cheese, caraway Dairy and Egg Products 
1 Cheese, caraway Dairy and Egg Products 
2 Cheese, caraway Dairy and Egg Products 
3 Cheese, caraway Dairy and Egg Products 
4 Cheese, caraway Dairy and Egg Products 
5 Cheese, caraway Dairy and Egg Products 
6 Cheese, caraway Dairy and Egg Products 
7 Cheese, caraway Dairy and Egg Products 
8 Cheese, caraway Dairy and Egg Products 
9 Cheese, caraway Dairy and Egg Products 



In [70]: ndata.ix[30000]
Out[70]:

nutrient Glycine
nutgroup Amino Acids
units g
value 0.04
id 6158
food Soup, tomato bisque, canned, condensed
fgroup Soups, Sauces, and Gravies
manufacturer 
Name: 30000, dtype: object



In [71]: result=ndata.groupby(['nutrient','fgroup'])['value'].quantile(0.5)
import matplotlib

%matplotlib inline

result['Zinc, Zn'].order().plot(kind='barh')


In [85]: by_nutrient=ndata.groupby(['nutgroup','nutrient'])



In [86]: get_maxinum=lambda x:x.xs(x.value.idxmax())



In [87]: get_mininum=lambda x:x.xs(x.value.idxmin())



In [88]: max_foods=by_nutrient.apply(get_maxinum)[['value','food']]



In [89]: #让food小一点



In [90]: max_foods.food=max_foods.food.str[:50]



In [91]: max_foods.ix['Amino Acids']['food']
Out[91]:

nutrient
Alanine Gelatins, dry powder, unsweetened
Arginine Seeds, sesame flour, low-fat
Aspartic acid Soy protein isolate
Cystine Seeds, cottonseed flour, low fat (glandless)
Glutamic acid Soy protein isolate
Glycine Gelatins, dry powder, unsweetened
Histidine Whale, beluga, meat, dried (Alaska Native)
Hydroxyproline KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINA...
Isoleucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Leucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Lysine Seal, bearded (Oogruk), meat, dried (Alaska Na...
Methionine Fish, cod, Atlantic, dried and salted
Phenylalanine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Proline Gelatins, dry powder, unsweetened
Serine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Threonine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Tryptophan Sea lion, Steller, meat with fat (Alaska Native)
Tyrosine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Valine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Name: food, dtype: object