Pandas基本操作与常用接口
声明
本文涉及的代码基于python 3.6.5
pandas 0.23.0
pandas是一个在numpy之上,提供了丰富的数据预处理接口的模块。
使用pandas模块之前首先需要在代码之前导入该模块:
import pandas as pd
读取csv文件
现在我们有一个food_info.csv文件,格式如下所示:
NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),Calcium_(mg),Iron_(mg),Magnesium_(mg),Phosphorus_(mg),Potassium_(mg),Sodium_(mg),Zinc_(mg),Copper_(mg),Manganese_(mg),Selenium_(mcg),Vit_C_(mg),Thiamin_(mg),Riboflavin_(mg),Niacin_(mg),Vit_B6_(mg),Vit_B12_(mcg),Vit_A_IU,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg)
01001,BUTTER WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0,0.06,24,0.02,2,24,24,643,0.09,0,0,1,0,0.005,0.034,0.042,0.003,0.17,2499,684,2.32,1.5,60,7,51.368,21.021,3.043,215
其中,第一行为标题,第二行为具体的数据,该文件具有多行数据,这里为了节约篇幅,只展示了一行。csv文件中的各个数据之间一般以逗号分隔,可以用Excel等工具以表格的形式打开。
food_info = pd.read_csv("food_info.csv")
print(type(food_info)) # <class 'pandas.core.frame.DataFrame'>
print(food_info.dtypes)
# NDB_No int64
# Shrt_Desc object
# Water_(g) float64
# Energ_Kcal int64
# Protein_(g) float64
# Lipid_Tot_(g) float64
# Ash_(g) float64
# Carbohydrt_(g) float64
# Fiber_TD_(g) float64
# Sugar_Tot_(g) float64
# Calcium_(mg) float64
# Iron_(mg) float64
# Magnesium_(mg) float64
# Phosphorus_(mg) float64
# Potassium_(mg) float64
# Sodium_(mg) float64
# Zinc_(mg) float64
# Copper_(mg) float64
# Manganese_(mg) float64
# Selenium_(mcg) float64
# Vit_C_(mg) float64
# Thiamin_(mg) float64
# Riboflavin_(mg) float64
# Niacin_(mg) float64
# Vit_B6_(mg) float64
# Vit_B12_(mcg) float64
# Vit_A_IU float64
# Vit_A_RAE float64
# Vit_E_(mg) float64
# Vit_D_mcg float64
# Vit_D_IU float64
# Vit_K_(mcg) float64
# FA_Sat_(g) float64
# FA_Mono_(g) float64
# FA_Poly_(g) float64
# Cholestrl_(mg) float64
# dtype: object
我们使用上述代码将food_info.csv读取进来,读进来后保存在DataFrame对象中。DataFrame可以看做是一种矩阵结构。
再观察pandas中每个数据的具体类型,可以看到一般是int64,float64类型的,但是也有object类型,pandas里将字符串处理成object类型,在这里就是Shrt_Desc这列的数据。
常用的数据类型说明如下:
类型 | 说明 |
---|---|
object | for string values |
int | for integer values |
float | for float values |
datetime | for time values |
bool | for Boolean values |
如果对read_csv()方法还不清楚,可以使用如下代码查看文档:
print(help(pd.read_csv()))
我们可以使用DataFrame的实例方法head()检查读取的数据格式是否正确,代码如下:
print(food_info.head())
# NDB_No Shrt_Desc ... FA_Poly_(g) Cholestrl_(mg)
# 0 1001 BUTTER WITH SALT ... 3.043 215.0
# 1 1002 BUTTER WHIPPED WITH SALT ... 3.012 219.0
# 2 1003 BUTTER OIL ANHYDROUS ... 3.694 256.0
# 3 1004 CHEESE BLUE ... 0.800 75.0
# 4 1005 CHEESE BRICK ... 0.784 94.0
#
# [5 rows x 36 columns]
head()方法有默认参数n=5,表示展示的行数,这个可以自己定义。
同样,我们还可以使用tail()方法查看读取进来的数据的后几行信息:
print(food_info.tail())
# NDB_No ... Cholestrl_(mg)
# 8613 83110 ... 95.0
# 8614 90240 ... 41.0
# 8615 90480 ... 0.0
# 8616 90560 ... 50.0
# 8617 93600 ... 50.0
#
# [5 rows x 36 columns]
columns属性可以用来获取csv表格的表头(列名):
print(food_info.columns)
# Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
# 'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
# 'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
# 'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
# 'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
# 'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
# 'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
# 'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
# 'Cholestrl_(mg)'],
# dtype='object')
前面我们说过可以把DataFrame看做是矩阵,那么这里就可以利用shape属性获取矩阵的形状:
print(food_info.shape) # (8618, 36)
这里矩阵的行数不包括表头行,表示数据中有8618个样本,每个样本有36个特征。
索引与计算
想要获取DataFrame中某一行的数据,要通过loc方法结合"[]"内的下标获取。例如:
print(food_info.loc[0])
# NDB_No 1001
# Shrt_Desc BUTTER WITH SALT
# Water_(g) 15.87
# Energ_Kcal 717
# Protein_(g) 0.85
# Lipid_Tot_(g) 81.11
# Ash_(g) 2.11
# Carbohydrt_(g) 0.06
# Fiber_TD_(g) 0
# Sugar_Tot_(g) 0.06
# Calcium_(mg) 24
# Iron_(mg) 0.02
# Magnesium_(mg) 2
# Phosphorus_(mg) 24
# Potassium_(mg) 24
# Sodium_(mg) 643
# Zinc_(mg) 0.09
# Copper_(mg) 0
# Manganese_(mg) 0
# Selenium_(mcg) 1
# Vit_C_(mg) 0
# Thiamin_(mg) 0.005
# Riboflavin_(mg) 0.034
# Niacin_(mg) 0.042
# Vit_B6_(mg) 0.003
# Vit_B12_(mcg) 0.17
# Vit_A_IU 2499
# Vit_A_RAE 684
# Vit_E_(mg) 2.32
# Vit_D_mcg 1.5
# Vit_D_IU 60
# Vit_K_(mcg) 7
# FA_Sat_(g) 51.368
# FA_Mono_(g) 21.021
# FA_Poly_(g) 3.043
# Cholestrl_(mg) 215
# Name: 0, dtype: object
当然,这里也可以使用切片来获取多行数据:
print(food_info.loc[3:6]) # 获取3、4、5、6行数据
print(food_info.loc[[2