Pandas基本操作与常用接口

声明

本文涉及的代码基于python 3.6.5 pandas 0.23.0
pandas是一个在numpy之上,提供了丰富的数据预处理接口的模块。
使用pandas模块之前首先需要在代码之前导入该模块:

import pandas as pd

读取csv文件

现在我们有一个food_info.csv文件,格式如下所示:

NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),Calcium_(mg),Iron_(mg),Magnesium_(mg),Phosphorus_(mg),Potassium_(mg),Sodium_(mg),Zinc_(mg),Copper_(mg),Manganese_(mg),Selenium_(mcg),Vit_C_(mg),Thiamin_(mg),Riboflavin_(mg),Niacin_(mg),Vit_B6_(mg),Vit_B12_(mcg),Vit_A_IU,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg)
01001,BUTTER WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0,0.06,24,0.02,2,24,24,643,0.09,0,0,1,0,0.005,0.034,0.042,0.003,0.17,2499,684,2.32,1.5,60,7,51.368,21.021,3.043,215

其中,第一行为标题,第二行为具体的数据,该文件具有多行数据,这里为了节约篇幅,只展示了一行。csv文件中的各个数据之间一般以逗号分隔,可以用Excel等工具以表格的形式打开。

food_info = pd.read_csv("food_info.csv")
print(type(food_info))  # <class 'pandas.core.frame.DataFrame'>
print(food_info.dtypes)
# NDB_No               int64
# Shrt_Desc           object
# Water_(g)          float64
# Energ_Kcal           int64
# Protein_(g)        float64
# Lipid_Tot_(g)      float64
# Ash_(g)            float64
# Carbohydrt_(g)     float64
# Fiber_TD_(g)       float64
# Sugar_Tot_(g)      float64
# Calcium_(mg)       float64
# Iron_(mg)          float64
# Magnesium_(mg)     float64
# Phosphorus_(mg)    float64
# Potassium_(mg)     float64
# Sodium_(mg)        float64
# Zinc_(mg)          float64
# Copper_(mg)        float64
# Manganese_(mg)     float64
# Selenium_(mcg)     float64
# Vit_C_(mg)         float64
# Thiamin_(mg)       float64
# Riboflavin_(mg)    float64
# Niacin_(mg)        float64
# Vit_B6_(mg)        float64
# Vit_B12_(mcg)      float64
# Vit_A_IU           float64
# Vit_A_RAE          float64
# Vit_E_(mg)         float64
# Vit_D_mcg          float64
# Vit_D_IU           float64
# Vit_K_(mcg)        float64
# FA_Sat_(g)         float64
# FA_Mono_(g)        float64
# FA_Poly_(g)        float64
# Cholestrl_(mg)     float64
# dtype: object

我们使用上述代码将food_info.csv读取进来,读进来后保存在DataFrame对象中。DataFrame可以看做是一种矩阵结构。
再观察pandas中每个数据的具体类型,可以看到一般是int64,float64类型的,但是也有object类型,pandas里将字符串处理成object类型,在这里就是Shrt_Desc这列的数据。
常用的数据类型说明如下:

类型 说明
object for string values
int for integer values
float for float values
datetime for time values
bool for Boolean values

如果对read_csv()方法还不清楚,可以使用如下代码查看文档:

print(help(pd.read_csv()))

我们可以使用DataFrame的实例方法head()检查读取的数据格式是否正确,代码如下:

print(food_info.head())
#    NDB_No                 Shrt_Desc       ...        FA_Poly_(g)  Cholestrl_(mg)
# 0    1001          BUTTER WITH SALT       ...              3.043           215.0
# 1    1002  BUTTER WHIPPED WITH SALT       ...              3.012           219.0
# 2    1003      BUTTER OIL ANHYDROUS       ...              3.694           256.0
# 3    1004               CHEESE BLUE       ...              0.800            75.0
# 4    1005              CHEESE BRICK       ...              0.784            94.0
# 
# [5 rows x 36 columns]

head()方法有默认参数n=5,表示展示的行数,这个可以自己定义。
同样,我们还可以使用tail()方法查看读取进来的数据的后几行信息:

print(food_info.tail())
#       NDB_No      ...       Cholestrl_(mg)
# 8613   83110      ...                 95.0
# 8614   90240      ...                 41.0
# 8615   90480      ...                  0.0
# 8616   90560      ...                 50.0
# 8617   93600      ...                 50.0
# 
# [5 rows x 36 columns]

columns属性可以用来获取csv表格的表头(列名):

print(food_info.columns)
# Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
#        'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
#        'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
#        'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
#        'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
#        'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
#        'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
#        'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
#        'Cholestrl_(mg)'],
#       dtype='object')

前面我们说过可以把DataFrame看做是矩阵,那么这里就可以利用shape属性获取矩阵的形状:

print(food_info.shape)  # (8618, 36)

这里矩阵的行数不包括表头行,表示数据中有8618个样本,每个样本有36个特征。

索引与计算

想要获取DataFrame中某一行的数据,要通过loc方法结合"[]"内的下标获取。例如:

print(food_info.loc[0])
# NDB_No                         1001
# Shrt_Desc          BUTTER WITH SALT
# Water_(g)                     15.87
# Energ_Kcal                      717
# Protein_(g)                    0.85
# Lipid_Tot_(g)                 81.11
# Ash_(g)                        2.11
# Carbohydrt_(g)                 0.06
# Fiber_TD_(g)                      0
# Sugar_Tot_(g)                  0.06
# Calcium_(mg)                     24
# Iron_(mg)                      0.02
# Magnesium_(mg)                    2
# Phosphorus_(mg)                  24
# Potassium_(mg)                   24
# Sodium_(mg)                     643
# Zinc_(mg)                      0.09
# Copper_(mg)                       0
# Manganese_(mg)                    0
# Selenium_(mcg)                    1
# Vit_C_(mg)                        0
# Thiamin_(mg)                  0.005
# Riboflavin_(mg)               0.034
# Niacin_(mg)                   0.042
# Vit_B6_(mg)                   0.003
# Vit_B12_(mcg)                  0.17
# Vit_A_IU                       2499
# Vit_A_RAE                       684
# Vit_E_(mg)                     2.32
# Vit_D_mcg                       1.5
# Vit_D_IU                         60
# Vit_K_(mcg)                       7
# FA_Sat_(g)                   51.368
# FA_Mono_(g)                  21.021
# FA_Poly_(g)                   3.043
# Cholestrl_(mg)                  215
# Name: 0, dtype: object

当然,这里也可以使用切片来获取多行数据:

print(food_info.loc[3:6])  # 获取3、4、5、6行数据
print(food_info.loc[[2
  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值