一.什么是pandas
1.1定义
是基于numpy 的一种为了解决数据分析任务提供了高效地操作大型数据集所需的工具,
Pandas 纳入了大量库和一些标准的数据模型,提供了大量能使我们快速便捷地处理数据的函数和方法。
1.2数据结构
Series:一维数组,与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近。Series如今能保存不同种数据类型,字符串、boolean值、数字等都能保存在Series中。
Time- Series:以时间为索引的Series。
DataFrame:二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。以下的内容主要以DataFrame为主。
Panel :三维的数组,可以理解为DataFrame的容器。
二.创建
2.1创建series
s = pd.Series([1,3,5,4,6,8])
0 1
1 3
2 5
3 4
4 6
5 8
dtype: int64
2.2创建time-series
dates = pd.date_range('20181001',periods=6)
DatetimeIndex(['2018-10-01', '2018-10-02', '2018-10-03', '2018-10-04',
'2018-10-05', '2018-10-06'],
dtype='datetime64[ns]', freq='D')
2.3创建dataframe
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
A B C D
2018-10-01 -0.419864 -0.716621 -1.522754 -0.896049
2018-10-02 0.361538 0.312017 0.427167 -1.191305
2018-10-03 -2.019418 -0.895160 0.425233 1.467052
2018-10-04 1.186440 0.251670 -0.995712 1.845424
2018-10-05 0.509954 1.543433 -0.036554 -0.426340
2018-10-06 0.585052 -1.347779 -0.115104 -0.058916
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20181002'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'hello' })
A B C D E F
0 1.0 2018-10-02 1.0 3 test hello
1 1.0 2018-10-02 1.0 3 train hello
2 1.0 2018-10-02 1.0 3 test hello
3 1.0 2018-10-02 1.0 3 train hello
2.4创建panel
data = pd.Panel(np.random.rand(2,4,5))
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4
三.入门(主要以dataframe为主)
df:
A B C D
2018-10-01 1.097650 0.600073 1.296362 -1.119968
2018-10-02 0.931835 -1.174313 1.247509 0.462275
2018-10-03 -1.481357 1.737640 -0.853880 0.146072
2018-10-04 3.658062 -0.276350 -0.529113 0.389288
2018-10-05 -0.867468 0.153941 1.182122 1.380770
2018-10-06 -0.019635 -1.102165 0.503295 0.403443
df.head()
A B C D
2018-10-01 1.097650 0.600073 1.296362 -1.119968
2018-10-02 0.931835 -1.174313 1.247509 0.462275
2018-10-03 -1.481357 1.737640 -0.853880 0.146072
2018-10-04 3.658062 -0.276350 -0.529113 0.389288
2018-10-05 -0.867468 0.153941 1.182122 1.380770
df.tail(4)
A B C D
2018-10-03 -1.575419 0.408146 -0.727951 0.892273
2018-10-04 0.477188 -1.134372 1.354922 -0.641237
2018-10-05 -1.060002 -0.117877 -0.245136 0.048334
2018-10-06 -0.264465 0.054226 -0.381926 -0.037507
df.index
DatetimeIndex(['2018-10-01', '2018-10-02', '2018-10-03', '2018-10-04',
'2018-10-05', '2018-10-06'],
dtype='datetime64[ns]', freq='D')
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
df.values
[[-0.01094233 0.56624115 -0.69607107 -1.41716622]
[ 1.6608257 -0.9003276 -1.21582265 0.09932746]
[ 0.18357827 -1.31061872 0.02711289 -1.25631321]
[ 1.21774381 -0.52218704 -1.00689223 2.86039582]
[-0.85700921 0.15736927 0.88435886 -1.65245358]
[ 0.74485154 0.06683108 0.31701702 -0.0537278 ]]
df.head() 最多输出前五行
df.tail(4) 输出后几行
df.index 行索引
df.columns 列索引
df.values 值
df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.325588 -0.381394 -0.756222 0.864435
std 0.743853 1.210628 1.682261 0.536523
min -0.890606 -2.236455 -2.670481 0.120198
25% -0.028596 -1.112915 -2.051519 0.578564
50% 0.539035 -0.068343 -0.686917 0.830762
75% 0.806438 0.422785 -0.046963 1.162358
max 1.088299 0.959387 1.847014 1.637499
快速统计摘要
count 数量
mean 平均值
std 标准差
min 最小值
25% 第一四分位数 (Q1),又称“较小四分位数”,等于该样本中所有数值由小到大排列后第25%的数字。
50% 中位数
75% 同上类似
max 最大值
df.T
2018-10-01 2018-10-02 ... 2018-10-05 2018-10-06
A -1.086541 0.806482 ... 1.464247 0.413810
B -0.043731 -0.988297 ... -0.532121 -0.214636
C 0.370472 -1.854053 ... 0.669644 1.455905
D 1.057330 1.665222 ... -0.239430 -0.392194
[4 rows x 6 columns]
转置
df.sort_index()
A B C D
2018-10-01 -0.636390 0.270001 1.482772 0.487126
2018-10-02 0.424275 2.079136 0.234442 -1.508086
2018-10-03 -1.719585 -1.404995 0.581046 -0.588827
2018-10-04 0.206164 -0.710080 0.214864 -0.245257
2018-10-05 0.157271 -1.659809 -0.689508 -0.627286
2018-10-06 0.054739 0.081013 -0.845970 0.058534
df.sort_index(ascending=False)
A B C D
2018-10-06 0.050489 1.725587 1.401500 0.109099
2018-10-05 -1.642711 -1.246953 0.031868 -0.742620
2018-10-04 0.448737 0.953092 0.072601 0.277347
2018-10-03 -0.585020 1.247202 -0.677308 -0.082281
2018-10-02 -0.441545 0.502394 -0.823207 0.077199
2018-10-01 0.320901 -0.641877 -0.592185 -0.006243
df.sort_index(axis=1)
A B C D
2018-10-01 -0.080709 -1.268852 -0.029146 1.730056
2018-10-02 0.177026 -0.095608 0.630715 1.251889
2018-10-03 -0.324091 1.135895 1.443761 -1.656423
2018-10-04 1.341659 -0.495091 -0.925834 -0.515833
2018-10-05 0.198719 1.271994 -0.315808 0.285534
2018-10-06 -0.085695 -1.108529 -0.047992 3.094784
按frame的行索引进行排序 frame.sort_index()
按frame的列索引进行排序 frame.sort_index(axis=1)
默认升序,降序可在括号加ascending=False
df.sort_index(by=['B'],ascending=False)
df.sort_values('B',ascending=False)
A B C D
2018-10-05 -1.513641 2.605290 0.487593 0.180191
2018-10-04 -0.750031 0.744077 -1.218886 1.500481
2018-10-02 1.312716 0.043054 -0.037574 -0.579429
2018-10-01 -0.206157 -0.360247 0.024310 1.851131
2018-10-03 0.522026 -0.496297 0.465798 -0.942027
2018-10-06 0.303735 -0.860363 -0.800755 1.101880
A B C D
2018-10-05 -1.513641 2.605290 0.487593 0.180191
2018-10-04 -0.750031 0.744077 -1.218886 1.500481
2018-10-02 1.312716 0.043054 -0.037574 -0.579429
2018-10-01 -0.206157 -0.360247 0.024310 1.851131
2018-10-03 0.522026 -0.496297 0.465798 -0.942027
2018-10-06 0.303735 -0.860363 -0.800755 1.101880
按值排序,可见上面两个式子是等价的