第七章 食物数据库例子


In [3]: import json

In [4]: db=json.load(open('E:/foods-2011-10-03.json'))

In [5]: len(db)
Out[5]: 6636

In [6]:
    db中的每条目都是一个含有某种食物全部数据的字典,nutrients是字典的一个列表,其中每个字典对应一种营养成分:

db[0].keys()
Out[6]: 
[u'portions',
 u'description',
 u'tags',
 u'nutrients',
 u'group',
 u'id',
 u'manufacturer']
In [7]: db[0]['nutrients'][0]

Out[7]:

{u'description': u'Protein',

u'group': u'Composition',

u'units': u'g',

u'value': 25.18}

In [8]: from pandas import DataFrame

In [9]: nutrients=DataFrame(db[0]['nutrients'])

In [11]: nutrients[:7]

Out[11]:

description group units value

0 Protein Composition g 25.18

1 Total lipid (fat) Composition g 29.20

2 Carbohydrate, by difference Composition g 3.06

3 Ash Other g 3.28

4 Energy Energy kcal 376.00

5 Water Composition g 39.28

6 Energy Energy kJ 1573.00

    将字典列表转换为DataFrame时,可以只抽取其中一部分字段,将抽取食物的名称、分类、编号以及制造商信息:

In [12]: info_keys=['description','group','id','manufacturer']

In [13]: info=DataFrame(db,columns=info_keys)

In [14]: info[:5]
Out[14]: 
                          description                   group    id  
0                     Cheese, caraway  Dairy and Egg Products  1008   
1                     Cheese, cheddar  Dairy and Egg Products  1009   
2                        Cheese, edam  Dairy and Egg Products  1018   
3                        Cheese, feta  Dairy and Egg Products  1019   
4  Cheese, mozzarella, part skim milk  Dairy and Egg Products  1028   
  manufacturer  
0               
1               
2               
3               
4 


import pandas as pd

#通过value_counts,查看食物类别分布情况

pd.value_counts(info.group)[:10]
Out[20]:
Vegetables and Vegetable Products    812
Beef Products                        618
Baked Products                       496
Breakfast Cereals                    403
Legumes and Legume Products          365
Fast Foods                           365
Lamb, Veal, and Game Products        345
Sweets                               341
Fruits and Fruit Juices              328
Pork Products                        328
Name: group, dtype: int64#==============================================================================
# 将所有食物的营养成分整合到一个大表中:
#1、将各食物的营养成分列表转换为一个DataFrame,并添加一个表示编号的列
#2、该DataFrame添加到一个表中
#3、通过concat将这些东西连接起来
#==============================================================================
nutrients=[]
for rec in db:
    fnuts=DataFrame(rec['nutrients'])
    fnuts['id']=rec['id']
    nutrients.append(fnuts)
nutrients=pd.concat(nutrients,ignore_index=True)
nutrients[:10]
Out[23]:

description group units value id

0 Protein Composition g 25.18 1008

1 Total lipid (fat) Composition g 29.20 1008

2 Carbohydrate, by difference Composition g 3.06 1008

3 Ash Other g 3.28 1008

4 Energy Energy kcal 376.00 1008

5 Water Composition g 39.28 1008

6 Energy Energy kJ 1573.00 1008

7 Fiber, total dietary Composition g 0.00 1008

8 Calcium, Ca Elements mg 673.00 1008

9 Iron, Fe Elements mg 0.64 1008

In [24]: #重复项


In [25]: nutrients.duplicated().sum()

Out[25]: 14179


In [26]: nutrients=nutrients.drop_duplicates()


In [27]: col_mapping={'description':'food','group':'fgroup'}


In [28]: info=info.rename(columns=col_mapping,copy=False)


In [29]: info[:10]

Out[29]:

food fgroup \

0 Cheese, caraway Dairy and Egg Products

1 Cheese, cheddar Dairy and Egg Products

2 Cheese, edam Dairy and Egg Products

3 Cheese, feta Dairy and Egg Products

4 Cheese, mozzarella, part skim milk Dairy and Egg Products

5 Cheese, mozzarella, part skim milk, low moisture Dairy and Egg Products

6 Cheese, romano Dairy and Egg Products

7 Cheese, roquefort Dairy and Egg Products

8 Cheese spread, pasteurized process, american, ... Dairy and Egg Products

9 Cream, fluid, half and half Dairy and Egg Products


id manufacturer

0 1008

1 1009

2 1018

3 1019

4 1028

5 1029

6 1038

7 1039

8 1048

9 1049

In [64]: col_mapping={'description':'nutrient','group':'nutgroup'}


In [65]: info=nutrients.rename(columns=col_mapping,copy=False)


In [66]: info[:10]

Out[66]:

nutrient nutgroup units value id

0 Protein Composition g 25.18 1008

1 Total lipid (fat) Composition g 29.20 1008

2 Carbohydrate, by difference Composition g 3.06 1008

3 Ash Other g 3.28 1008

4 Energy Energy kcal 376.00 1008

5 Water Composition g 39.28 1008

6 Energy Energy kJ 1573.00 1008

7 Fiber, total dietary Composition g 0.00 1008

8 Calcium, Ca Elements mg 673.00 1008

9 Iron, Fe Elements mg 0.64 1008


In [67]: #可以将info1和info合并起来


In [68]: ndata=pd.merge(info,info1,on='id',how='outer')


In [69]: ndata[:10]

Out[69]:

nutrient nutgroup units value id \

0 Protein Composition g 25.18 1008

1 Total lipid (fat) Composition g 29.20 1008

2 Carbohydrate, by difference Composition g 3.06 1008

3 Ash Other g 3.28 1008

4 Energy Energy kcal 376.00 1008

5 Water Composition g 39.28 1008

6 Energy Energy kJ 1573.00 1008

7 Fiber, total dietary Composition g 0.00 1008

8 Calcium, Ca Elements mg 673.00 1008

9 Iron, Fe Elements mg 0.64 1008


food fgroup manufacturer

0 Cheese, caraway Dairy and Egg Products

1 Cheese, caraway Dairy and Egg Products

2 Cheese, caraway Dairy and Egg Products

3 Cheese, caraway Dairy and Egg Products

4 Cheese, caraway Dairy and Egg Products

5 Cheese, caraway Dairy and Egg Products

6 Cheese, caraway Dairy and Egg Products

7 Cheese, caraway Dairy and Egg Products

8 Cheese, caraway Dairy and Egg Products

9 Cheese, caraway Dairy and Egg Products


In [70]: ndata.ix[30000]

Out[70]:

nutrient Glycine

nutgroup Amino Acids

units g

value 0.04

id 6158

food Soup, tomato bisque, canned, condensed

fgroup Soups, Sauces, and Gravies

manufacturer

Name: 30000, dtype: object


In [71]: result=ndata.groupby(['nutrient','fgroup'])['value'].quantile(0.5)

import matplotlib
%matplotlib inline
result['Zinc, Zn'].order().plot(kind='barh')

In [85]: by_nutrient=ndata.groupby(['nutgroup','nutrient'])


In [86]: get_maxinum=lambda x:x.xs(x.value.idxmax())


In [87]: get_mininum=lambda x:x.xs(x.value.idxmin())


In [88]: max_foods=by_nutrient.apply(get_maxinum)[['value','food']]


In [89]: #让food小一点


In [90]: max_foods.food=max_foods.food.str[:50]


In [91]: max_foods.ix['Amino Acids']['food']

Out[91]:

nutrient

Alanine Gelatins, dry powder, unsweetened

Arginine Seeds, sesame flour, low-fat

Aspartic acid Soy protein isolate

Cystine Seeds, cottonseed flour, low fat (glandless)

Glutamic acid Soy protein isolate

Glycine Gelatins, dry powder, unsweetened

Histidine Whale, beluga, meat, dried (Alaska Native)

Hydroxyproline KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINA...

Isoleucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...

Leucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...

Lysine Seal, bearded (Oogruk), meat, dried (Alaska Na...

Methionine Fish, cod, Atlantic, dried and salted

Phenylalanine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...

Proline Gelatins, dry powder, unsweetened

Serine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...

Threonine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...

Tryptophan Sea lion, Steller, meat with fat (Alaska Native)

Tyrosine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...

Valine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...

Name: food, dtype: object




















  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
数据库系统概论第七章主要讨论了关系数据库的设计与规范化。在设计一个关系数据库之前,需要对关系模式进行设计,并对其进行规范化,以消除数据冗余和数据插入异常。 对于关系模式的设计,需要确定关系的名称、属性和主键。属性的选择应该合理,能够准确描述实体的特征,并且尽量避免数据冗余。主键是用来唯一标识关系中的元组,因此应该选择能够保证唯一性的属性作为主键。 规范化是一种将关系模式转化为满足特定要求的标准化过程。常用的规范化方法有1NF、2NF、3NF等。1NF要求每个属性都是不可分割的原子值,即每个属性不能再分解为其他更小的值。2NF要求满足1NF的基础上,非主键属性依赖于整个主键而不是部分主键。3NF要求满足2NF的基础上,消除所有传递依赖。 规范化的目的是为了消除数据冗余和数据插入异常。数据冗余指的是同样的数据数据库中存储了多次,这不仅浪费了存储空间,也容易导致数据一致性问题。数据插入异常是指在插入新数据时,可能会因为依赖关系的存在而导致插入失败或者引入不正确的数据。 通过规范化的过程,可以有效地设计出高效、合理的关系数据库。但规范化也有一定的局限性,过度规范化可能会导致查询的复杂性增加,影响数据库的性能。因此,在进行规范化设计时,需要根据实际情况进行权衡,并根据具体需求进行灵活的调整。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值