数据分析与数据科学必备技能之——Pandas使用(一)

本次主要总结归纳了Pandas常用函数方法,分两节归纳概括,内容后期有待补充。
首先介绍前7节内容,之后会介绍8节至15节内容

import pandas as pd 
import numpy as np
import matplotlib as plt
import math

1. Pandas数据结构

数据结构维数说明
Series1序列/系列;是具有均匀数据、尺寸大小不变、数据可变的一维数组结构
DataFrame2数据框/数据帧;是具有异构数据、大小可变、数据可变的二维数组结构,即数据以行和列的表格方式排列
Panel3面板;具有异构数据、大小可变、数据可变的三维数据结构

2. Pandas创建对象

2.1 pd.Series对象创建

创建函数参数描述
pd.Series( data, index, dtype, copy)data
index
dtype
copy
数据采取各种形式,如:ndarray,list,constants,dict;
索引值必须是唯一的和散列的,与数据的长度相同。如果没有索引被传递默认np.arange(n);
dtype用于数据类型。如果没有,将推断数据类型;
复制数据,默认为false;
# 创建简单的序列
# 创建一个空的系列
pd.Series()
# 从ndarray创建一个系列
data=np.array(['a','s','d','f'])
pd.Series(data)
pd.Series(data,index=[101,102,103,104])
# 从字典创建一个系列
data = {'a' : 0., 'b' : 1., 'c' : 2.}
pd.Series(data)
pd.Series(data,index=['b','c','d','a'])
# 从标量创建一个系列
pd.Series(5, index=['a','b','c','d'])

Series([], dtype: float64)
0 a
1 s
2 d
3 f
dtype: object

101 a
102 s
103 d
104 f
dtype: object

a 0.0
b 1.0
c 2.0
dtype: float64

b 1.0
c 2.0
d NaN
a 0.0
dtype: float64

a 5
b 5
c 5
d 5
dtype: int64

# 访问序列中数据(与访问numpy中元素一致)
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
s[0] # 等价于 s['a']
s[:3] # 等价于 s[['a','b','c']]
s[3:]
s[1:3]

1

a 1
b 2
c 3
dtype: int64

d 4
e 5
dtype: int64

b 2
c 3
dtype: int64

2.2 pd.DataFrame对象创建

创建函数参数描述
pd.pandas(data, index, columns, dtype, copy)data
index
columns
dtype
copy
数据采取各种形式,如:ndarray,series,map,lists,dict,constant和另一个DataFrame;
对于行标签,要用于结果帧的索引是可选缺省值np.arrange(n),如果没有传递索引值;
对于列标签,可选的默认语法是 - np.arange(n)。 如果没有传递索引值;
每列的数据类型;
如果默认值为False,则此命令(或任何它)用于复制数据;
如果默认值为False,则此命令(或任何它)用于复制数据;
# 创建简单的数据框
# 创建一个空的DataFrame
pd.DataFrame()
# 从列表创建DataFrame
data = [1,2,3]
pd.DataFrame(data)
data = [['Alex',10],['Bob',12],['Clarke',13]]
pd.DataFrame(data,columns=['Name','Age'],dtype=float)
0
01
12
23
NameAge
0Alex10.0
1Bob12.0
2Clarke13.0
# 从ndarrays/Lists的字典来创建DataFrame
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
pd.DataFrame(data)
pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
AgeName
028Tom
134Jack
229Steve
342Ricky
AgeName
rank128Tom
rank234Jack
rank329Steve
rank442Ricky
# 从字典列表创建数据帧DataFrame
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data,index=['first', 'second'])
pd.DataFrame(data, index=['first', 'second'], columns=list('ab'))
pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1']) # columns可用两种方式表示
abc
first12NaN
second51020.0
ab
first12
second510
ab1
first1NaN
second5NaN
# 从系列的字典来创建DataFrame
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'two' : pd.Series([1, 2, 3, 4], index=list('abcd'))} # index可用两种方式表示
pd.DataFrame(d)
onetwo
a1.01
b2.02
c3.03
dNaN4

2.2.1 数据框数据的索引与选取

选择对象方法
行列df[]
区域df.loc[ ],df.iloc[ ],df.ix[ ]
单元格df.at[ ],df.iat[ ]
选择方法说明
loc[ ]根据index行标签或column列名称选取
iloc[ ]基于行/列的position(行数列数)
at[ ]根据指定行index及列label,快速定位DataFrame的元素
iat[ ]与at类似,不同的是根据position来定位的
ix[ ]为loc与iloc的混合体,既支持label也支持position
行列选取方法维度行列操作类型
df[ ]一维行维度
列维度
整数切片、标签切片、<布尔数组>
标签索引、标签列表、Callable
# df[]的例子
df = pd.DataFrame(np.random.randn(6,4), index=list('abcdef'), columns=list('ABCD'))
df[:3] #整数切片 
# df[0]、df['a']报错,不能为整数索引,必须为整数切片
df['a':'c'] #标签切片 
# df[['a','c']]错误,不能为标签列表,必须为标签切片
df[[True,True,True,False,False,False]] # 前三行(布尔数组长度等于行数)
df[df['A']>0] # A列值大于0的行
df[(df['A']>0) & (df['C']>0)] # A列值大于0,并且C列大于0的行
ABCD
a-0.897015-1.883345-0.5892681.215273
b0.6397030.982318-0.7855170.708336
c-0.4020660.054401-0.5111911.703571
ABCD
a-0.897015-1.883345-0.5892681.215273
b0.6397030.982318-0.7855170.708336
c-0.4020660.054401-0.5111911.703571
ABCD
a-0.897015-1.883345-0.5892681.215273
b0.6397030.982318-0.7855170.708336
c-0.4020660.054401-0.5111911.703571
ABCD
b0.6397030.982318-0.7855170.708336
e1.1994480.5559401.222145-1.744408
f0.623923-0.3886830.2737790.468964
ABCD
e1.1994480.5559401.222145-1.744408
f0.623923-0.3886830.2737790.468964
df['A'] # 标签索引,等价于 df.A,
# df[0:3,'A'] / df['a':'c','A']错误,不能为切片,必须为索引
df[['A','C']] #标签列表
df[lambda df: df.columns[0]] # Callable

a -0.897015
b 0.639703
c -0.402066
d -1.296602
e 1.199448
f 0.623923
Name: A, dtype: float64

AC
a-0.897015-0.589268
b0.639703-0.785517
c-0.402066-0.511191
d-1.2966020.488359
e1.1994481.222145
f0.6239230.273779

a -0.897015
b 0.639703
c -0.402066
d -1.296602
e 1.199448
f 0.623923
Name: A, dtype: float64

标签区域选取方法维度行列操作类型
df.loc[]二维行维度
列维度
标签索引、标签切片、标签列表、<布尔数组>、Callable
标签索引、标签切片、标签列表、<布尔数组>、Callable
# df.loc[]的例子
df.loc['a', :] # 行标签索引,等价于 df.loc['a']\df.iloc[0]
df.loc['a':'d', :] # 行标签切片 等价于df.loc['a':'b']
df.loc[['a','b','c'], :] # 行标签列表
df.loc[[True,True,True,False,False,False], :] # 前三行(布尔数组长度等于行数)
df.loc[df.A<0, :] # A列大于0的列
df.loc[df.loc[:,'A']>0, :]
df.loc[df.iloc[:,0]>0, :]
df.loc[lambda _df: _df.A > 0, :]

A -0.897015
B -1.883345
C -0.589268
D 1.215273
Name: a, dtype: float64

ABCD
a-0.897015-1.883345-0.5892681.215273
b0.6397030.982318-0.7855170.708336
c-0.4020660.054401-0.5111911.703571
d-1.296602-0.4463510.4883591.219049
ABCD
a-0.897015-1.883345-0.5892681.215273
b0.6397030.982318-0.7855170.708336
c-0.4020660.054401-0.5111911.703571
ABCD
a-0.897015-1.883345-0.5892681.215273
b0.6397030.982318-0.7855170.708336
c-0.4020660.054401-0.5111911.703571
ABCD
a-0.897015-1.883345-0.5892681.215273
c-0.4020660.054401-0.5111911.703571
d-1.296602-0.4463510.4883591.219049
ABCD
b0.6397030.982318-0.7855170.708336
e1.1994480.5559401.222145-1.744408
f0.623923-0.3886830.2737790.468964
ABCD
b0.6397030.982318-0.7855170.708336
e1.1994480.5559401.222145-1.744408
f0.623923-0.3886830.2737790.468964
ABCD
b0.6397030.982318-0.7855170.708336
e1.1994480.5559401.222145-1.744408
f0.623923-0.3886830.2737790.468964
df.loc[:, 'A'] # 列标签索引
df.loc[:, 'A':'C'] #列切片
df.loc[:, ['A','B','C']]  #列列表
df.loc[:, [True,True,True,False]] # 前三列(布尔数组长度等于行数)
df.loc[:, df.loc['a']>0] # a行大于0的列
df.loc[:, df.iloc[0]>0] # 0行大于0的列
df.loc[:, lambda _df: ['A', 'B']] #Callable

a -0.897015
b 0.639703
c -0.402066
d -1.296602
e 1.199448
f 0.623923
Name: A, dtype: float64

ABC
a-0.897015-1.883345-0.589268
b0.6397030.982318-0.785517
c-0.4020660.054401-0.511191
d-1.296602-0.4463510.488359
e1.1994480.5559401.222145
f0.623923-0.3886830.273779
ABC
a-0.897015-1.883345-0.589268
b0.6397030.982318-0.785517
c-0.4020660.054401-0.511191
d-1.296602-0.4463510.488359
e1.1994480.5559401.222145
f0.623923-0.3886830.273779
ABC
a-0.897015-1.883345-0.589268
b0.6397030.982318-0.785517
c-0.4020660.054401-0.511191
d-1.296602-0.4463510.488359
e1.1994480.5559401.222145
f0.623923-0.3886830.273779
D
a1.215273
b0.708336
c1.703571
d1.219049
e-1.744408
f0.468964
D
a1.215273
b0.708336
c1.703571
d1.219049
e-1.744408
f0.468964
AB
a-0.897015-1.883345
b0.6397030.982318
c-0.4020660.054401
d-1.296602-0.446351
e1.1994480.555940
f0.623923-0.388683
df.A.loc[lambda s: s > 0] # 定位到标量元素
df.loc[['a','d'], ['A','B']] # 行列标签列表
df.loc['a':'c', 'A':'C'] # 行列标签切片
df.loc['a':'c', ['A','B']] # 混合
df.loc['a','A'] # 定位到标量元素

b 0.639703
e 1.199448
f 0.623923
Name: A, dtype: float64

AB
a-0.897015-1.883345
d-1.296602-0.446351
ABC
a-0.897015-1.883345-0.589268
b0.6397030.982318-0.785517
c-0.4020660.054401-0.511191
AB
a-0.897015-1.883345
b0.6397030.982318
c-0.4020660.054401

-0.8970154528725585

整数选取方法维度行列操作类型
df.iloc[]二维行维度
列维度
整数索引、整数切片、整数列表、<布尔数组>
整数索引、整数切片、整数列表、<布尔数组>、Callable
# df.iloc[]的例子
df.iloc[3, :] # 行整数索引 等价于df.iloc[3]
df.iloc[0:3, :] # 行整数切片 等价于df.iloc[0:3]
df.iloc[[0,2,4], :] # 行列表
df.iloc[[True,True,True,False,False,False], :] # 前三行(布尔数组长度等于行数)
#df.iloc[df['A']>0, :] #× 为什么不行呢?想不通!
#df.iloc[df.loc[:,'A']>0, :] #×
#df.iloc[df.iloc[:,0]>0, :] #×
df.iloc[lambda _df: [0, 1], :]

A -1.296602
B -0.446351
C 0.488359
D 1.219049
Name: d, dtype: float64

ABCD
a-0.897015-1.883345-0.5892681.215273
b0.6397030.982318-0.7855170.708336
c-0.4020660.054401-0.5111911.703571
ABCD
a-0.897015-1.883345-0.5892681.215273
c-0.4020660.054401-0.5111911.703571
e1.1994480.5559401.222145-1.744408
ABCD
a-0.897015-1.883345-0.5892681.215273
b0.6397030.982318-0.7855170.708336
c-0.4020660.054401-0.5111911.703571
ABCD
a-0.897015-1.883345-0.5892681.215273
b0.6397030.982318-0.7855170.708336
df.iloc[:, 1] # 列索引
df.iloc[:, 0:3] # 列切片
df.iloc[:, [0,1,2]] # 列列表
df.iloc[:, [True,True,True,False]] # 前三列(布尔数组长度等于行数)
#df.iloc[:, df.loc['a']>0] #× why
#df.iloc[:, df.iloc[0]>0] #× why
df.iloc[:, lambda _df: [0, 1]]

a -1.883345
b 0.982318
c 0.054401
d -0.446351
e 0.555940
f -0.388683
Name: B, dtype: float64

ABC
a-0.897015-1.883345-0.589268
b0.6397030.982318-0.785517
c-0.4020660.054401-0.511191
d-1.296602-0.4463510.488359
e1.1994480.5559401.222145
f0.623923-0.3886830.273779
ABC
a-0.897015-1.883345-0.589268
b0.6397030.982318-0.785517
c-0.4020660.054401-0.511191
d-1.296602-0.4463510.488359
e1.1994480.5559401.222145
f0.623923-0.3886830.273779
ABC
a-0.897015-1.883345-0.589268
b0.6397030.982318-0.785517
c-0.4020660.054401-0.511191
d-1.296602-0.4463510.488359
e1.1994480.5559401.222145
f0.623923-0.3886830.273779
AB
a-0.897015-1.883345
b0.6397030.982318
c-0.4020660.054401
d-1.296602-0.446351
e1.1994480.555940
f0.623923-0.388683
df.iloc[[0,1], [0,1,2]] # 行列索引列表
df.iloc[1:3, 0:3] # 行列索引切片
df.iloc[[0,1], 0:3] # 混合
df.iloc[1,3] #定位到标量值
ABC
a-0.897015-1.883345-0.589268
b0.6397030.982318-0.785517
ABC
b0.6397030.982318-0.785517
c-0.4020660.054401-0.511191
ABC
a-0.897015-1.883345-0.589268
b0.6397030.982318-0.785517

0.7083361725183475

混合选取方法维度行列操作类型
df.ix[ ]二维行维度
列维度
标签/整数索引、标签/整数切片、标签/整数列表、<布尔数组>、Callable
标签/整数索引、标签/整数切片、标签/整数列表、<布尔数组>、Callable
# df.ix[]的例子
df.ix[0, :] # 行整数索引
df.ix[0:3, :] # 行整数切片
df.ix[[0,1,2], :] # 行整数列表

D:\Program Files\anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

A -0.897015
B -1.883345
C -0.589268
D 1.215273
Name: a, dtype: float64

ABCD
a-0.897015-1.883345-0.5892681.215273
b0.6397030.982318-0.7855170.708336
c-0.4020660.054401-0.5111911.703571
ABCD
a-0.897015-1.883345-0.5892681.215273
b0.6397030.982318-0.7855170.708336
c-0.4020660.054401-0.5111911.703571
df.ix['a', :] # 行标签
df.ix['a':'c', :] # 行标签切片
df.ix[['a','b','c'], :] # 行标签列表

A -0.897015
B -1.883345
C -0.589268
D 1.215273
Name: a, dtype: float64

ABCD
a-0.897015-1.883345-0.5892681.215273
b0.6397030.982318-0.7855170.708336
c-0.4020660.054401-0.5111911.703571
ABCD
a-0.897015-1.883345-0.5892681.215273
b0.6397030.982318-0.7855170.708336
c-0.4020660.054401-0.5111911.703571
df.ix[:, 0] # 列索引
df.ix[:, 0:3]
df.ix[:, [0,1,2]]

a -0.897015
b 0.639703
c -0.402066
d -1.296602
e 1.199448
f 0.623923
Name: A, dtype: float64

ABC
a-0.897015-1.883345-0.589268
b0.6397030.982318-0.785517
c-0.4020660.054401-0.511191
d-1.296602-0.4463510.488359
e1.1994480.5559401.222145
f0.623923-0.3886830.273779
ABC
a-0.897015-1.883345-0.589268
b0.6397030.982318-0.785517
c-0.4020660.054401-0.511191
d-1.296602-0.4463510.488359
e1.1994480.5559401.222145
f0.623923-0.3886830.273779
df.ix[:, 'A'] #列标签
df.ix[:, 'A':'C']
df.ix[:, ['A','B','C']]

a -0.897015
b 0.639703
c -0.402066
d -1.296602
e 1.199448
f 0.623923
Name: A, dtype: float64

ABC
a-0.897015-1.883345-0.589268
b0.6397030.982318-0.785517
c-0.4020660.054401-0.511191
d-1.296602-0.4463510.488359
e1.1994480.5559401.222145
f0.623923-0.3886830.273779
ABC
a-0.897015-1.883345-0.589268
b0.6397030.982318-0.785517
c-0.4020660.054401-0.511191
d-1.296602-0.4463510.488359
e1.1994480.5559401.222145
f0.623923-0.3886830.273779
df.ix[[0,1,2], 'A':'B'] # 行整数列表,列标签切片
df.ix[0:3, ['A','B','C']] # 行整数切片,列标签列表
df.ix['a':'d', 0:3] # 行标签切片,列整数切片
df.ix[['a','b','c'], 'A':'B'] # 行标签列表,列标签切片
AB
a-0.897015-1.883345
b0.6397030.982318
c-0.4020660.054401
ABC
a-0.897015-1.883345-0.589268
b0.6397030.982318-0.785517
c-0.4020660.054401-0.511191
ABC
a-0.897015-1.883345-0.589268
b0.6397030.982318-0.785517
c-0.4020660.054401-0.511191
d-1.296602-0.4463510.488359
AB
a-0.897015-1.883345
b0.6397030.982318
c-0.4020660.054401
单元格选取方法维度行列操作类型
df.at[]精确定位单元格行维度
列维度
标签索引
标签索引
# df.at[]的例子
df.at['a', 'A']
-0.8970154528725585
单元格选取方法维度行列操作类型
df.iat[]精确定位单元格行维度
列维度
整数索引
整数索引
# df.iat[]的例子
df.iat[0, 0]

-0.8970154528725585

2.3 pd.Panel对象创建

创建函数参数描述
pd.Panel(data, items, major_axis, minor_axis, dtype, copy)data
items
major_axis
major_axis
dtype
copy
数据采取各种形式,如:ndarray,series,map,lists,dict,constant和另一个数据帧(DataFrame);
axis=0;
axis=1;
axis=2;
每列的数据类型;
复制数据,默认 - false;
# 创建面板
# 从3D ndarray创建
data = np.random.rand(2,4,5)
pd.Panel(data)
# 从DataFrame对象的dict创建面板
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
        'Item2' : pd.DataFrame(np.random.randn(4, 2))}
data
pd.Panel(data)

<class ‘pandas.core.panel.Panel’>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4

{‘Item1’: 0 1 2
0 -0.434674 -1.516325 -1.090963
1 1.343774 -0.143883 -0.213383
2 0.349254 1.666228 -0.654386
3 -1.416395 -0.913377 -0.475773, ‘Item2’: 0 1
0 0.643990 0.304934
1 -0.456729 0.210168
2 -1.620677 -0.479168
3 1.265655 -1.812357}

<class ‘pandas.core.panel.Panel’>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

3. Pandas基本属性

属性或方法描述
dtypes返回对象的数据类型(dtype)
shape返回数据框的大小
head()返回前n行
tail()返回最后n行
values将返回底层数据作为ndarray返回
index返回行索引
columns返回列名称
axes返回行轴标签列表。
empty如果系列为空,则返回True
ndim返回底层数据的维数,默认定义:1
size返回基础数据中的元素数
df.T将数据框转置
sort_index按行索引排序
sort_values(by=‘listA’)按实际值排序,按listA列值排序,其他数据参照此列变化
# pandas相关属性示例
# df = pd.DataFrame({'A':np.random.randn(24)
#                   ,'B': ['A', 'B', 'C'] * 8
#                   ,'C': ['Female', 'Male', 'Male', 'Male']*6}
#                   , index=pd.date_range('20170101', periods=24))
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df=pd.DataFrame(d)
df.dtypes
df.shape
df.head() # df.head(7)
df.tail()

Age int64
Name object
Rating float64
dtype: object

(7, 3)
AgeNameRating
025Tom4.23
126James3.24
225Ricky3.98
323Vin2.56
430Steve3.20
AgeNameRating
225Ricky3.98
323Vin2.56
430Steve3.20
529Minsu4.60
623Jack3.80
df.values
df.index
df.columns
df.axes
df.empty

array([[25, ‘Tom’, 4.23],
[26, ‘James’, 3.24],
[25, ‘Ricky’, 3.98],
[23, ‘Vin’, 2.56],
[30, ‘Steve’, 3.2],
[29, ‘Minsu’, 4.6],
[23, ‘Jack’, 3.8]], dtype=object)

RangeIndex(start=0, stop=7, step=1)

Index([‘Age’, ‘Name’, ‘Rating’], dtype=‘object’)

[RangeIndex(start=0, stop=7, step=1),
Index([‘Age’, ‘Name’, ‘Rating’], dtype=‘object’)]

False

df.ndim
df.size
df[['Age','Rating']].describe
df.T
df.sort_index # 默认升序
df.sort_index(ascending=False) # 降序
# df.sort_index(axis=1) # 按列排序
df.sort_values(by=['Age','Rating']) # 按值排序

2

21

<bound method NDFrame.describe of Age Rating
0 25 4.23
1 26 3.24
2 25 3.98
3 23 2.56
4 30 3.20
5 29 4.60
6 23 3.80>

0123456
Age25262523302923
NameTomJamesRickyVinSteveMinsuJack
Rating4.233.243.982.563.24.63.8

<bound method DataFrame.sort_index of Age Name Rating
0 25 Tom 4.23
1 26 James 3.24
2 25 Ricky 3.98
3 23 Vin 2.56
4 30 Steve 3.20
5 29 Minsu 4.60
6 23 Jack 3.80>

AgeNameRating
623Jack3.80
529Minsu4.60
430Steve3.20
323Vin2.56
225Ricky3.98
126James3.24
025Tom4.23
AgeNameRating
323Vin2.56
623Jack3.80
225Ricky3.98
025Tom4.23
126James3.24
529Minsu4.60
430Steve3.20

4. Pandas描述性统计

函数描述
count()非空观测数量
sum()所有值之和
mean()所有值的平均值
median()所有值的中位数
mode()值的模值
std()值的标准偏差
min()所有值中的最小值
max()所有值中的最大值
abs()绝对值
prod()数组元素的乘积
cumsum()累计总和
cumprod()累计乘积
函数描述参数说明
describe ( include= )对数据的描述性统计信息,注意数据需要是数据类型(float,int)include默认number 用于汇总数字列;
object 汇总字符串列;
all 将所有列汇总在一起
# 描述性统计函数示例
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
     'Lee','David','Gasper','Betina','Andres']),
     'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
     'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
df=pd.DataFrame(d)
df.count()
df.sum() # 默认axis=0
df.sum(axis=1)

Age 12
Name 12
Rating 12
dtype: int64

Age 382
Name TomJamesRickyVinSteveMinsuJackLeeDavidGasperBe…
Rating 44.92
dtype: object

0 29.23
1 29.24
2 28.98
3 25.56
4 33.20
5 33.60
6 26.80
7 37.78
8 42.98
9 34.80
10 55.10
11 49.65
dtype: float64

df.mean()
df.median()

Age 31.833333
Rating 3.743333
dtype: float64

Age 29.50
Rating 3.79
dtype: float64

df.mode()
df.std()
df.max()
AgeNameRating
023.0Andres2.56
125.0Betina2.98
230.0David3.20
3NaNGasper3.24
4NaNJack3.65
5NaNJames3.78
6NaNLee3.80
7NaNMinsu3.98
8NaNRicky4.10
9NaNSteve4.23
10NaNTom4.60
11NaNVin4.80

Age 9.232682
Rating 0.661628
dtype: float64

Age 51
Name Vin
Rating 4.8
dtype: object

df.prod()
df.cumsum()
df[['Age','Rating']].cumprod() # abs/cumprod必须针对数值型

Age 7.158408e+17
Rating 6.320128e+06
dtype: float64

AgeNameRating
025Tom4.23
151TomJames7.47
276TomJamesRicky11.45
399TomJamesRickyVin14.01
4129TomJamesRickyVinSteve17.21
5158TomJamesRickyVinSteveMinsu21.81
6181TomJamesRickyVinSteveMinsuJack25.61
7215TomJamesRickyVinSteveMinsuJackLee29.39
8255TomJamesRickyVinSteveMinsuJackLeeDavid32.37
9285TomJamesRickyVinSteveMinsuJackLeeDavidGasper37.17
10336TomJamesRickyVinSteveMinsuJackLeeDavidGasperBe...41.27
11382TomJamesRickyVinSteveMinsuJackLeeDavidGasperBe...44.92
AgeRating
02.500000e+014.230000e+00
16.500000e+021.370520e+01
21.625000e+045.454670e+01
33.737500e+051.396395e+02
41.121250e+074.468465e+02
53.251625e+082.055494e+03
67.478738e+097.810877e+03
72.542771e+112.952512e+04
81.017108e+138.798485e+04
93.051325e+144.223273e+05
101.556176e+161.731542e+06
117.158408e+176.320128e+06
df.describe()
df.describe(include=['object'])
df.describe(include='all')
AgeRating
count12.00000012.000000
mean31.8333333.743333
std9.2326820.661628
min23.0000002.560000
25%25.0000003.230000
50%29.5000003.790000
75%35.5000004.132500
max51.0000004.800000
Name
count12
unique12
topLee
freq1
AgeNameRating
count12.0000001212.000000
uniqueNaN12NaN
topNaNLeeNaN
freqNaN1NaN
mean31.833333NaN3.743333
std9.232682NaN0.661628
min23.000000NaN2.560000
25%25.000000NaN3.230000
50%29.500000NaN3.790000
75%35.500000NaN4.132500
max51.000000NaN4.800000

5. Pandas统计函数

函数描述
pct_change()函数将每个元素与其前一个元素进行比较,并计算变化百分比
cov ( )cov用来计算序列对象之间的协方差,NA将被自动排除
corr ( )相关性显示了任何两个数值(系列)之间的线性关系,计算pearson(默认),spearman和kendall之间的相关性
rank ( )数据排名为元素数组中的每个元素生成排名
# 统计函数示例
df = pd.DataFrame(np.random.randn(4, 2))
df.pct_change() # 行df.pct_change(axis=1)
01
0NaNNaN
1-1.5996590.390988
21.791461-0.938851
3-2.7469434.785401
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
s1.cov(s2)
df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
df['a'].cov(df['b'])
df.cov() # cov()计算所有列之间的协方差值

-0.12117368605285425
0.02248553586668352

abcde
a0.2016760.022486-0.0539500.249407-0.171799
b0.0224860.3707530.0724380.362195-0.123009
c-0.0539500.0724381.0188860.1515290.325987
d0.2494070.3621950.1515291.893329-0.448767
e-0.171799-0.1230090.325987-0.4487670.951633
df['a'].corr(df['b'])
df.corr()  # 计算所有列之间的相关系数

0.08223070280618396

abcde
a1.0000000.082231-0.1190140.403617-0.392157
b0.0822311.0000000.1178590.432302-0.207091
c-0.1190140.1178591.0000000.1090980.331056
d0.4036170.4323020.1090981.000000-0.334328
e-0.392157-0.2070910.331056-0.3343281.000000
df.rank()
abcde
03.09.02.06.05.0
17.010.09.09.09.0
21.03.04.02.04.0
34.07.010.07.06.0
410.04.08.04.02.0
58.05.03.01.07.0
69.06.01.010.01.0
76.02.07.05.08.0
82.01.06.03.010.0
95.08.05.08.03.0

6. Pandas字符串和文本函数

函数函数作用
lower()将Series/Index中的字符串转换为小写。
upper()将Series/Index中的字符串转换为大写。
len()计算字符串长度。
strip()帮助从两侧的系列/索引中的每个字符串中删除空格(包括换行符)。
split(’ ')用给定的模式拆分每个字符串。
cat(sep=’ ')使用给定的分隔符连接系列/索引元素。
get_dummies()返回具有单热编码值的数据帧(DataFrame)。
contains(pattern)如果元素中包含子字符串,则返回每个元素的布尔值True,否则为False。
replace(a,b)将值a替换为值b。
repeat(value)重复每个元素指定的次数。
count(pattern)返回模式中每个元素的出现总数。
startswith(pattern)如果系列/索引中的元素以模式开始,则返回true。
endswith(pattern)如果系列/索引中的元素以模式结束,则返回true。
find(pattern)返回模式第一次出现的位置。
findall(pattern)返回模式的所有出现的列表。
swapcase变换字母大小写。
islower()检查系列/索引中每个字符串中的所有字符是否小写,返回布尔值
isupper()检查系列/索引中每个字符串中的所有字符是否大写,返回布尔值
isnumeric()检查系列/索引中每个字符串中的所有字符是否为数字,返回布尔值。
# 字符串和文本函数示例
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),
     'Age':pd.Series([25,26,25,23,30,29,23]),
     'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df=pd.DataFrame(d)
df['Name'].str.lower()
df['Name'].str.upper()
df['Name'].str.len()

0 tom
1 james
2 ricky
3 vin
4 steve
5 minsu
6 jack
Name: Name, dtype: object

0 TOM
1 JAMES
2 RICKY
3 VIN
4 STEVE
5 MINSU
6 JACK
Name: Name, dtype: object

0 3
1 5
2 5
3 3
4 5
5 5
6 4
Name: Name, dtype: int64

(df.sort_values(by='Age'))['Name'].str.cat(sep='->')
df['Name'].str.contains('')
df['Name'].str.replace('a','#')

‘Vin->Jack->Tom->Ricky->James->Minsu->Steve’

0 True
1 True
2 True
3 True
4 True
5 True
6 True
Name: Name, dtype: bool

0 Tom
1 J#mes
2 Ricky
3 Vin
4 Steve
5 Minsu
6 J#ck
Name: Name, dtype: object

df['Name'].str.repeat(2)
df['Name'].str.count('m')
df['Name'].str.startswith('T')

0 TomTom
1 JamesJames
2 RickyRicky
3 VinVin
4 SteveSteve
5 MinsuMinsu
6 JackJack
Name: Name, dtype: object

0 1
1 1
2 0
3 0
4 0
5 0
6 0
Name: Name, dtype: int64

0 True
1 False
2 False
3 False
4 False
5 False
6 False
Name: Name, dtype: bool

df['Name'].str.find('m')
df['Name'].str.findall('m')
df['Name'].str.swapcase()

0 2
1 2
2 -1
3 -1
4 -1
5 -1
6 -1
Name: Name, dtype: int64

0 [m]
1 [m]
2 []
3 []
4 []
5 []
6 []
Name: Name, dtype: object

0 tOM
1 jAMES
2 rICKY
3 vIN
4 sTEVE
5 mINSU
6 jACK
Name: Name, dtype: object

df['Name'].str.isupper()
df['Name'].str.isnumeric()

0 False
1 False
2 False
3 False
4 False
5 False
6 False
Name: Name, dtype: bool

0 False
1 False
2 False
3 False
4 False
5 False
6 False
Name: Name, dtype: bool

7. Pandas窗口函数与聚合函数

函数函数作用
df.rolling( window=n )指定window=n参数并在其上应用适当的统计函数,前n-1个元素有空值,第n个元素的值将是n,n-1….0个元素的统计函数值
df.expanding( min_periods=n )指定min_periods=n参数并在其上应用适当的统计函数,与rolling函数功能相似
df.aggregate( np.sum )在整个数据框上应用聚合\单个列上应用聚合\多列上应用聚合\单个列上应用多个函数\多列上应用多个函数
# 窗口函数
df = pd.DataFrame(np.random.randn(10, 4),index = pd.date_range('1/1/2020', periods=10),columns = ['A', 'B', 'C', 'D'])
df.rolling(window=4).mean()
ABCD
2020-01-01NaNNaNNaNNaN
2020-01-02NaNNaNNaNNaN
2020-01-03NaNNaNNaNNaN
2020-01-040.0728470.625421-0.321718-0.286368
2020-01-05-0.1119570.589124-0.955854-0.709480
2020-01-06-0.7854111.143256-0.521317-0.302849
2020-01-07-0.6626340.507758-0.407731-0.527692
2020-01-080.1151850.596837-0.579141-0.239076
2020-01-090.2457640.085004-0.1705100.295653
2020-01-100.627896-0.273199-0.1228800.188302
df.expanding(min_periods=3).mean()
ABCD
2020-01-01NaNNaNNaNNaN
2020-01-02NaNNaNNaNNaN
2020-01-030.8799510.437257-0.767225-0.045709
2020-01-040.0728470.625421-0.321718-0.286368
2020-01-050.2695470.644470-0.466433-0.504114
2020-01-06-0.0516960.680689-0.491277-0.303636
2020-01-07-0.0015260.477544-0.561800-0.321128
2020-01-080.0940160.611129-0.450430-0.262722
2020-01-090.2589770.395819-0.334912-0.148662
2020-01-100.2201410.299134-0.343918-0.106861
# 聚合函数
df1=df.rolling(window=3,min_periods=1)
df1.aggregate(np.sum) #在整个数据框上应用聚合
ABCD
2020-01-011.7955650.8658531.4912540.317351
2020-01-022.831468-0.488887-0.862395-0.610418
2020-01-032.6398531.311772-2.301674-0.137128
2020-01-04-1.5041761.635833-2.778127-1.462824
2020-01-05-1.4837333.711238-1.469768-1.910150
2020-01-06-2.9500302.772365-0.645988-1.684687
2020-01-07-0.3020730.841119-2.645727-1.102421
2020-01-08-0.5956061.666682-1.2712750.418790
2020-01-092.640971-0.521768-0.0665420.483859
2020-01-102.212092-0.3514650.4934201.179287
df1['A'].aggregate(np.sum) # 在数据框的单个列上应用聚合
df1[['A','C']].aggregate(np.sum) # 在DataFrame的多列上应用聚合

2020-01-01 1.795565
2020-01-02 2.831468
2020-01-03 2.639853
2020-01-04 -1.504176
2020-01-05 -1.483733
2020-01-06 -2.950030
2020-01-07 -0.302073
2020-01-08 -0.595606
2020-01-09 2.640971
2020-01-10 2.212092
Freq: D, Name: A, dtype: float64

AC
2020-01-011.7955651.491254
2020-01-022.831468-0.862395
2020-01-032.639853-2.301674
2020-01-04-1.504176-2.778127
2020-01-05-1.483733-1.469768
2020-01-06-2.950030-0.645988
2020-01-07-0.302073-2.645727
2020-01-08-0.595606-1.271275
2020-01-092.640971-0.066542
2020-01-102.2120920.493420
df1['A'].aggregate([np.sum,np.mean]) # 在DataFrame的单个列上应用多个函数
df1[['A','C']].aggregate([np.sum,np.mean]) # 在DataFrame的多列上应用多个函数 
summean
2020-01-011.7955651.795565
2020-01-022.8314681.415734
2020-01-032.6398530.879951
2020-01-04-1.504176-0.501392
2020-01-05-1.483733-0.494578
2020-01-06-2.950030-0.983343
2020-01-07-0.302073-0.100691
2020-01-08-0.595606-0.198535
2020-01-092.6409710.880324
2020-01-102.2120920.737364
AC
summeansummean
2020-01-011.7955651.7955651.4912541.491254
2020-01-022.8314681.415734-0.862395-0.431197
2020-01-032.6398530.879951-2.301674-0.767225
2020-01-04-1.504176-0.501392-2.778127-0.926042
2020-01-05-1.483733-0.494578-1.469768-0.489923
2020-01-06-2.950030-0.983343-0.645988-0.215329
2020-01-07-0.302073-0.100691-2.645727-0.881909
2020-01-08-0.595606-0.198535-1.271275-0.423758
2020-01-092.6409710.880324-0.066542-0.022181
2020-01-102.2120920.7373640.4934200.164473
df1.aggregate({'A' : np.sum,'B' : np.mean}) # 将不同的函数应用于DataFrame的不同列
AB
2020-01-011.7955650.865853
2020-01-022.831468-0.244444
2020-01-032.6398530.437257
2020-01-04-1.5041760.545278
2020-01-05-1.4837331.237079
2020-01-06-2.9500300.924122
2020-01-07-0.3020730.280373
2020-01-08-0.5956060.555561
2020-01-092.640971-0.173923
2020-01-102.212092-0.117155
  • 2
    点赞
  • 25
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值