小白学习python——pandas——03-深入pandas：pandas的基本功能

最新推荐文章于 2024-07-17 10:37:21 发布

HPL__001

最新推荐文章于 2024-07-17 10:37:21 发布

阅读量340

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/HPL__001/article/details/96148364

版权

python 专栏收录该内容

7 篇文章 1 订阅

订阅专栏

深入pandas：pandas的基本功能

1、数据文件读取

1、通过pandas提供的read_xxx相关的函数可以读取文件中的数据，并形成DataFrame,常用的数据读取方法为：read_csv，主要可以读取文本类型的数据

(1)df01 = pd.read_csv('data01.txt', sep=',', header=None)
	print(df01)
	df02 = pd.read_csv('data01.csv')
	print(df02)

输出结果：

	       0   1  2      3
	0    Tom  25  1   8000
	1    Jim  28  2   9000
	2   Jack  21  3  12000
	3    Bob  30  4  20000
	4  Susan  26  5  11000
	   name  num  salary
	0   Tom    1    8000
	1  Jack    2    9000
	2   Bob    3   10000

2、数据过滤获取

1、通过DataFrame的相关方式可以获取对应的列或者数据形成一个新的DataFrame, 方便后续进行统计计算。

(1)df01 = pd.read_csv('data01.txt', sep=',', header=None)
	columns = ['name', 'age', 'num', 'salary']
	df01.columns = columns
	print(df01)
	df02 = df01[columns[2:]]
	print(df02)
	df03 = df01[1:4]
	print(df03)

输出结果：

	    name  age  num  salary
	0    Tom   25    1    8000
	1    Jim   28    2    9000
	2   Jack   21    3   12000
	3    Bob   30    4   20000
	4  Susan   26    5   11000
	   num  salary
	0    1    8000
	1    2    9000
	2    3   12000
	3    4   20000
	4    5   11000
	   name  age  num  salary
	1   Jim   28    2    9000
	2  Jack   21    3   12000
	3   Bob   30    4   20000

3、缺省值NaN处理方法

1、对于DataFrame/Series中的NaN一般采取的方式为删除对应的列/行或者填充一个默认值
(1)dropna:根据标签的值中是否存在缺失数据对轴标签进行过滤(删除), 可以通过阈值的调节对缺失值的容忍度
(2)notnull:isnull的否定式
(3)isnull:返回一个含有布尔值的对象，这些布尔值表示那些值是缺失值NA

    df01 = pd.DataFrame([
	    ['Tom', 26, 9000, 'M'],
	    ['Jack', 23, np.nan, 'M'],
	    ['Bob', 22, 10000, np.nan],
	    ['Susan', 28, 12000, 'W']
	])
	columns = ['name', 'age', 'salary', 'gender']
	df01.columns = columns
	print(df01)
	print(‘\n')
	print(df01.isnull())
	print('\n')
	print(df01.notnull())
	print(df01.dropna())    #默认丢弃包含缺失值的行
	print(df01.dropna(how='all'))   #丢弃全部为缺失值的行
	print(df01.dropna(axis=1))  #丢弃包含缺失值的列
	print(df01.dropna(axis=0))  #丢弃包含缺失值的行

输出结果：

	    name  age   salary gender
	0    Tom   26   9000.0      M
	1   Jack   23      NaN      M
	2    Bob   22  10000.0    NaN
	3  Susan   28  12000.0      W

	    name    age  salary  gender
	0  False  False   False   False
	1  False  False    True   False
	2  False  False   False    True
	3  False  False   False   False
	
	   name   age  salary  gender
	0  True  True    True    True
	1  True  True   False    True
	2  True  True    True   False
	3  True  True    True    True
	
	    name  age   salary gender
	0    Tom   26   9000.0      M
	3  Susan   28  12000.0      W
	
	    name  age   salary gender
	0    Tom   26   9000.0      M
	1   Jack   23      NaN      M
	2    Bob   22  10000.0    NaN
	3  Susan   28  12000.0      W
	
	    name  age
	0    Tom   26
	1   Jack   23
	2    Bob   22
	3  Susan   28
	
	    name  age   salary gender
	0    Tom   26   9000.0      M
	3  Susan   28  12000.0      W

(4)fillna:用指定值或者插值的方式填充缺失数据，比如: ffill或者bfill

df02 = pd.DataFrame(np.random.randn(7,3))
df02.ix[:4,1] = np.nan
df02.ix[:3,2] = np.nan
print(df02)
df03 = df02.fillna(0)
print(df03)
df04 = df02.fillna({1:0.5, 2:-1})
print(df04)

输出结果

          0         1         2
	0 -1.626751       NaN       NaN
	1  1.290152       NaN       NaN
	2  0.316259       NaN       NaN
	3  0.718937       NaN       NaN
	4 -0.482139       NaN -0.033808
	5 -0.145318 -0.045726 -1.315794
	6  1.184503 -0.950017  0.111288
	
	          0         1         2
	0 -1.626751  0.000000  0.000000
	1  1.290152  0.000000  0.000000
	2  0.316259  0.000000  0.000000
	3  0.718937  0.000000  0.000000
	4 -0.482139  0.000000 -0.033808
	5 -0.145318 -0.045726 -1.315794
	6  1.184503 -0.950017  0.111288
	
	          0         1         2
	0 -1.626751  0.500000 -1.000000
	1  1.290152  0.500000 -1.000000
	2  0.316259  0.500000 -1.000000
	3  0.718937  0.500000 -1.000000
	4 -0.482139  0.500000 -0.033808
	5 -0.145318 -0.045726 -1.315794
	6  1.184503 -0.950017  0.111288

4、常用的数学统计方法

1、在这里插入图片描述

(1)count()函数：默认计算列的总数，axis=1表示行

	df01 = pd.DataFrame({
	    '语文' : [86,88,91,79],
	    '数学' : [100,95,86,96],
	    '英语' : [79,84,85,93]
	},index=('Tom', 'Jack', 'Bob','Jim'))
	print(df01)
	print(df01.count())

输出结果

	      语文   数学  英语
	Tom   86  100  79
	Jack  88   95  84
	Bob   91   86  85
	Jim   79   96  93
	语文    4
	数学    4
	英语    4
	dtype: int64

(2)describe()、sum()函数

	df01 = pd.DataFrame({
	    '语文' : [86,88,91,79],
	    '数学' : [100,95,86,96],
	    '英语' : [79,84,85,93]
	},index=('Tom', 'Jack', 'Bob','Jim'))
	print(df01)
	print(df01.describe())
	print(df01.sum())           #默认按列求和
	print(df01.sum(axis = 1))   #axis=1按行求和,axis=0按列求和

输出结果

	      语文    数学    英语
	Tom   86     100       79
	Jack  88      95       84
	Bob   91      86       85
	Jim   79      96       93
	
	        语文          数学         英语
	count   4.00000    4.000000   4.000000
	mean   86.00000   94.250000  85.250000
	std     5.09902    5.909033   5.795113
	min    79.00000   86.000000  79.000000
	25%    84.25000   92.750000  82.750000
	50%    87.00000   95.500000  84.500000
	75%    88.75000   97.000000  87.000000
	max    91.00000  100.000000  93.000000
	
	语文    344
	数学    377
	英语    341
	dtype: int64
	
	Tom     265
	Jack    267
	Bob     262
	Jim     268
	dtype: int64

(3)quantile()、median()函数：median()求中位数，即从大到小排列在中间的那个数，quantile(m)则包含了median()的功能，0<=m<=1，当m=0.5时就是求中位数，当m等于0.75时相当于排列在第四位的数字（0，0.25，0.5，0.75，1）

	df01 = pd.DataFrame(np.random.randint(1,10,size=(5,3)))
	print(df01)
	print(df01.median())
	print(df01.quantile())
	print(df01.quantile(0.75))

输出结果

	   0  1  2
	0  3  7  1
	1  3  1  2
	2  4  5  1
	3  3  7  5
	4  6  7  3
	
	0    3.0
	1    7.0
	2    2.0
	dtype: float64
	
	0    3.0
	1    7.0
	2    2.0
	Name: 0.5, dtype: float64
	
	0    4.0
	1    7.0
	2    3.0

Name: 0.75, dtype: float64

(4)cumprod()函数、cumsum（）函数

	df01 = pd.DataFrame(np.random.randint(1,10,size=(5,3)))
	print(df01)
	print(df01.cumsum())		#依次和加到下一位
	print(df01.cumprod())		#依次积乘到下一位

输出结果

	   0  1  2
	0  1  2  3
	1  2  3  5
	2  6  7  4
	3  2  6  6
	4  2  3  6
	
	    0   1   2
	0   1   2   3
	1   3   5   8
	2   9  12  12
	3  11  18  18
	4  13  21  24
	
	    0    1     2
	0   1    2     3
	1   2    6    15
	2  12   42    60
	3  24  252   360
	4  48  756  2160

(5)pct_change()函数：后一位减去前一位再除以前一位（0.142857 = （8-7）/7）

	df01 = pd.DataFrame(np.random.randint(1,10,size=(5,3)))
	print(df01)

输出结果

	   0  1  2
	0  7  6  1
	1  8  2  9
	2  2  3  2
	3  7  7  7
	4  1  3  3
	
	          0         1         2
	0       NaN       NaN       NaN
	1  0.142857 -0.666667  8.000000
	2 -0.750000  0.500000 -0.777778
	3  2.500000  1.333333  2.500000
	4 -0.857143 -0.571429 -0.571429

5、相关系数与协方差

1、相关系数（Correlation coefficient）：反映两个样本/样本之间的相互关系以及之间的相关程度。在COV的基础上进行了无量纲化操作，也就是进行了标准化操作。

2、协方差(Covariance, COV)：反映两个样本/变量之间的相互关系以及之间的相关程度。
在这里插入图片描述

	df01 = pd.DataFrame({
	    'GDP' : [12,23,34,45,56],
	    'air_temperature' : [23,25,26,27,30],
	    'year' : ['2001','2002','2003','2004','2005']
	})
	print('相关系数:\n', df01.corr())
	print('协方差:\n', df01.cov())
	print(df01['GDP'].corr(df01['air_temperature']))
	print(df01['GDP'].cov(df01['air_temperature']))

输出结果

	相关系数:
	                       GDP  air_temperature
	GDP              1.000000         0.977356
	air_temperature  0.977356         1.000000
	
	协方差:
	                    GDP  air_temperature
	GDP              302.5             44.0
	air_temperature   44.0              6.7
	
	0.9773555548504418
	
	44.0

6、唯一值、值计数以及成员资格

1、unique方法用于获取Series中的唯一值数组(去重数据后的数组)

	ser = pd.Series(['a', 'b', 'c', 'c', 'b'])
	print(ser)
	print(ser.unique())
	df = pd.DataFrame({
	    'Tom' : [2, 6, 8, 9, 2, 6],
	    'Jack' : [3, 3, 5, 6, 5, 6]
	})
	print(df)
	print(df['Jack'].unique())  #unique适用于一维,所以多余多维应选择一行或者一列去重

输出结果

	0    a
	1    b
	2    c
	3    c
	4    b
	dtype: object
	
	['a' 'b' 'c']

	   Tom  Jack
	0    2     3
	1    6     3
	2    8     5
	3    9     6
	4    2     5
	5    6     6
	
	[3 5 6]

2、value_counts方法用于计算一个Series中各值的出现频率

	ser02 = pd.Series(['a', 'b', 'c', 'c', 'b', 'a', 'b', 'c', 'c',])
	print(ser02)
	print(ser02.value_counts())

输出结果

	0    a
	1    b
	2    c
	3    c
	4    b
	5    a
	6    b
	7    c
	8    c
	dtype: object
	
	c    4
	b    3
	a    2
	dtype: int64

3、isin方法用于判断矢量化集合的成员资格，可用于选取Series中或者DataFrame中列中数据的子集

	ser02 = pd.Series(['a', 'b', 'c', 'c', 'b', 'a', 'b', 'c', 'c',])
	print(ser02)
	mask = ser02.isin(['b','c'])
	print(mask)
	print(ser02[mask])		#输出不是b和c的元素

输出结果

	0    a
	1    b
	2    c
	3    c
	4    b
	5    a
	6    b
	7    c
	8    c
	dtype: object
	
	0    False
	1     True
	2     True
	3     True
	4     True
	5    False
	6     True
	7     True
	8     True
	dtype: bool
	
	1    b
	2    c
	3    c
	4    b
	6    b
	7    c
	8    c
	dtype: object

7、层次索引

1、在某一个方向拥有多个(两个及两个以上)索引级别
2、通过层次化索引，pandas能够以较低维度形式处理高纬度的数据
3、通过层次化索引，可以按照层次统计数据
4、层次索引包括Series层次索引和DataFrame层次索引
1示例程序1

	data = pd.Series([988.23, 95862, 3694, 45987, 1989],
	                 index=[
	                     ['2001', '2001', '2001', '2002','2002'],
	                     ['苹果','香蕉','西瓜','苹果','香蕉']
	                 ]
	                 )
	print(data)

输出结果

	2001  苹果      988.23
	      香蕉    95862.00
	      西瓜     3694.00
	2002  苹果    45987.00
	      香蕉     1989.00
	dtype: float64

2示例程序2

	df = pd.DataFrame({
	    'year': [2001,2001,2002,2002,2003],
	    'fruit': ['apple','banana','apple','banana','apple'],
	    'production': [2345, 3124, 5668, 2535, 2136],
	    'profits': [233.14,4452.6,1225.3,7845.3,2365.9]
	})
	print(df)
	df01 = df.set_index(['year','fruit'])
	print(df01)
	print(df01.ix[2002,'apple'])		 #print(df.ix[2002,'apple'])会出错,因为df的结构没有变化,不存在2002这个索引,它的索引还是0\1\2\3
	print(df01.sum(level = 'year'))
	print(df01.mean(level = 'fruit'))

输出结果

	   year   fruit  production  profits
	0  2001   apple        2345   233.14
	1  2001  banana        3124  4452.60
	2  2002   apple        5668  1225.30
	3  2002  banana        2535  7845.30
	4  2003   apple        2136  2365.90
	
	             production  profits
	year fruit                      
	2001 apple         2345   233.14
	     banana        3124  4452.60
	2002 apple         5668  1225.30
	     banana        2535  7845.30
	2003 apple         2136  2365.90
	
	production    5668.0
	profits       1225.3
	Name: (2002, apple), dtype: float64
	
	      production  profits
	year                     
	2001        5469  4685.74
	2002        8203  9070.60
	2003        2136  2365.90
	
	        production  profits
	fruit                      
	apple       3383.0  1274.78
	banana      2829.5  6148.95