一、导入库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
二、创建对象
1、通过传递一个list对象来创建一个 Series,pandas会默认创建整型索引:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
s= pd.Series([1,3,5,np.nan,6,8])
print(s)
//输出
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
2、通过传递一个numpy array,时间索引以及列标签来创建一个DataFrame:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
s= pd.Series([1,3,5,np.nan,6,8])
# print(s)
dates = pd.date_range('20170801',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
print(df)
//
A B C D
2017-08-01 -0.352464 0.207259 -0.379499 -0.040017
2017-08-02 -1.400304 -1.609741 1.149181 0.353892
2017-08-03 -0.244790 0.343552 -2.269876 -0.832983
2017-08-04 0.817258 -0.256027 0.917888 -0.240781
2017-08-05 0.312557 -1.103468 1.140370 -0.097515
2017-08-06 -0.945031 0.178114 0.099680 -0.993974
3、通过传递一个能够被转换成类似序列结构的字典对象来创建一个DataFrame:
df2 = pd.DataFrame({
'A': 1,
'B':pd.Timestamp('20170801'),
'C':pd.Series(1,index=list(range(4)),dtype='float64'),
'D':np.array([3]*4,dtype='int64'),
'E':pd.Categorical(["test","train","test","train"]),
'F':'fooo'
})
print(df2)
//
A B C D E F
0 1 2017-08-01 1.0 3 test fooo
1 1 2017-08-01 1.0 3 train fooo
2 1 2017-08-01 1.0 3 test fooo
3 1 2017-08-01 1.0 3 train fooo
4、查看不同列的数据类型
print(df2.dtype)
//
A int64
B datetime64[ns]
C float64
D int64
E category
F object
dtype: object
三、查看数据:
查看frame中头部和尾部的行
df.head(3)
df.tail(3)
2、显示索引、列和底层的numpy数据。
print(df.index)
print(df.columns)
print(df.values)
//
DatetimeIndex(['2017-08-01', '2017-08-02', '2017-08-03', '2017-08-04',
'2017-08-05', '2017-08-06'],
dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')
[[-0.99340137 -0.12548833 -1.08581538 1.17112124]
[ 0.29427792 0.20182606 -0.84427908 -1.42783808]
[-0.06528316 1.55255719 1.15818104 0.05286832]
[ 1.49223138 -1.11517738 2.01386582 0.3496034 ]
[ 0.23697223 0.43241392 -0.97061103 0.79105226]
[ 2.96518919 0.8754193 -1.26479748 0.29930427]]
3、describe()函数对数据快速的统计汇总:
print(describe())
//
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.396478 0.089271 -0.006982 0.281260
std 1.342037 1.158896 0.513877 0.997442
min -1.395445 -1.287098 -1.047989 -1.062651
25% -0.390341 -0.907016 0.116073 -0.073818
50% 0.210131 0.248303 0.193435 0.058509
75% 1.401632 0.856122 0.235431 0.655625
max 2.145753 1.559510 0.279513 1.896458
4、对数据的转置,按轴进行排序,按值进行排序
df.T
df.sort_index(axis=1,ascending=False)
df.sort(columns='B')
四、选择:
虽然标准的python/numpy的选择和设置表达式都能够直接派上用场,但是作为工程使用的代码,我们推荐使用经过优化的pandas数据访问 方式:.at .iat .iloc .loc .ix
获取:
1、选择一个单独的列,这将会返回一个Series,等同于df.A:
print(df['A'])
//
2017-08-01 0.559156
2017-08-02 -1.381263
2017-08-03 0.127286
2017-08-04 0.771317
2017-08-05 0.763690
2017-08-06 -0.123249
Freq: D, Name: A, dtype: float64
2、通过多个标签在多个轴上进行选择 用loc 两个中括号
print(df.loc[:,['A','B']])
//
A B
2017-08-01 0.040603 1.519448
2017-08-02 -2.429148 0.874873
2017-08-03 -0.315154 0.285603
2017-08-04 0.878968 -0.130717
2017-08-05 0.201083 -0.280419
2017-08-06 1.089989 0.754818
3、标签切片,类似于2
print(df.loc['20170802':'20170804',['A','B']])
//
A B
2017-08-02 1.923047 0.324715
2017-08-03 -0.349817 0.085359
2017-08-04 -0.675602 1.166312
4、对于返回的对象进行维度缩减
print(df.loc['20170803',['A','B']])
//
A -1.133357
B -1.008605
Name: 2017-08-03 00:00:00, dtype: float64
5、获取一个标量以及快速访问一个标量。(意思等价)
df.loc(dates[0],'A')
df.at[dates[0],'A']
五、通过位置选择。
1、通过传递数值进行 位置选择(选择的是行)
print(df)
print(df.iloc[3])# 取第四行的数据
//
2017-08-01 0.150731 -0.167577 1.524639 -0.324676
2017-08-02 -0.118888 -0.100348 0.666961 0.618953
2017-08-03 0.634666 -0.937274 0.280727 -2.504798
2017-08-04 0.395685 -0.650699 -0.573055 -2.597323
2017-08-05 1.403470 0.636823 0.289136 1.294045
2017-08-06 -1.337154 1.718215 -0.093115 -1.529378
A 0.395685
B -0.650699
C -0.573055
D -2.597323
Name: 2017-08-04 00:00:00, dtype: float64
2、通过数值进行切片,与numpy /python 中的情况类似
print(df.iloc[3:5,0:2])
//
A B
2017-08-04 1.064645 1.346498
2017-08-05 -0.253172 0.713125
Process finished with exit code 0
3、通过指定一个位置的列表,与numpy/python 中的情况类似
print(df.iloc[[1,2,4],[0,2]])
//
A C
2017-08-02 -0.987798 -0.081209
2017-08-03 0.039192 -0.244556
2017-08-05 -0.043044 -0.337542
Process finished with exit code 0
4、对行和列进行切片
print(df.iloc[1:3,:])#对行进行切片
print(df.iloc[:,1:3]) #对列进行切片
//
A B C D
2017-08-02 0.598266 0.410904 -0.587577 0.979511
2017-08-03 -0.206643 -0.859047 -0.110966 -0.696556
B C
2017-08-01 -0.341683 -0.378477
2017-08-02 0.410904 -0.587577
2017-08-03 -0.859047 -0.110966
2017-08-04 -0.486971 0.248343
2017-08-05 2.641393 -1.229395
2017-08-06 1.023585 0.630160
Process finished with exit code 0
5、获取特定的值
print(df.iloc[1,1])
print(df.iat[1,1])
//
0.426774827739
0.426774827739
Process finished with exit code 0
六 布尔索引
1、使用一个单独列的值来选择数据:
print(df[df.A > 0])
//
A B C D
2017-08-01 1.113008 -1.663732 -0.321016 0.079169
2017-08-03 0.000043 0.217510 0.929183 0.687162
2017-08-04 0.167007 -2.045754 0.318775 -0.168735
2017-08-05 0.721842 0.982646 0.367562 0.468250
2017-08-06 0.853404 -0.344893 2.266556 -0.304485
2、使用where操作来选择数据:
print(df[df>0])
//
A B C D
2017-08-01 0.183377 NaN 0.411767 0.535607
2017-08-02 0.169412 1.331423 0.815747 NaN
2017-08-03 0.807875 NaN NaN NaN
2017-08-04 0.559794 NaN 1.384336 0.011442
2017-08-05 0.250613 NaN NaN NaN
2017-08-06 NaN NaN NaN NaN
3、使用isin()方法来过滤:
df2 = df.copy()
df2['E'] = ['one','one','two','three','four','three']
print(df2)
print(df2[df2['E'].isin(['one','four'])])
//
A B C D E
2017-08-01 0.812330 1.333924 -1.267516 0.536965 one
2017-08-02 0.775528 1.071335 -1.344108 2.075409 one
2017-08-03 -0.826466 2.613913 0.153449 0.475277 two
2017-08-04 -0.190586 -0.613647 -0.044821 -1.224435 three
2017-08-05 -0.025830 -0.432906 0.990122 0.583328 four
2017-08-06 1.258012 -1.513605 2.128874 0.615266 three
A B C D E
2017-08-01 0.812330 1.333924 -1.267516 0.536965 one
2017-08-02 0.775528 1.071335 -1.344108 2.075409 one
2017-08-05 -0.025830 -0.432906 0.990122 0.583328 four
Process finished with exit code 0
七、设置
1、设置新的列:
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20170801',periods=6))
print(s1)
//
2017-08-01 1
2017-08-02 2
2017-08-03 3
2017-08-04 4
2017-08-05 5
2017-08-06 6
Freq: D, dtype: int64
2、通过标签设置新的值
pdf.at[dates[0],'A']=0
3、通过位置设置新的值:
df.iat[0,1]=0
4、通过一个numpy数组设置一组新值
df.loc[:'D'] = np.array([5]*len(df))
5、通过where操作来设置新的值:
df2 =df.copy()
df2[df2>0]=-df2
print(df2)
//
A B C D
2017-08-01 -0.994060 -0.915353 -0.103166 -0.400363
2017-08-02 -0.483559 -0.775021 -0.549397 -0.525969
2017-08-03 -1.394320 -1.430371 -1.592913 -0.479944
2017-08-04 -0.164794 -0.728599 -0.187677 -1.950128
2017-08-05 -0.827711 -0.678634 -1.089334 -1.224033
2017-08-06 -0.159199 -0.204388 -0.697539 -0.193927
八、缺失值的处理
在pandas中,使用np.nan来代替缺失值, 这些值将默认不会包含在计算中,
1、reindex()方法可以对指定轴上的索引进行改变/增加、删除操作 这将返回原始数据的一个拷贝
df2 = df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])
df2.loc[0:2,'E'] = 1
print(df2)
//
A B C D E
2017-08-01 -0.631494 -0.427917 -2.263502 0.499206 1.0
2017-08-02 -0.363730 -1.641072 0.251015 1.127523 1.0
2017-08-03 1.284590 -0.249052 -0.410623 0.304096 NaN
2017-08-04 1.166190 1.371307 -0.636533 0.959298 NaN
2、去掉包含缺失值的行
df2 = df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])
df2.loc[0:2,'E'] = 1
print(df2)
print(df2.dropna(how='any'))
//
A B C D E
2017-08-01 0.844367 0.282881 -0.596537 1.106986 1.0
2017-08-02 -0.297101 -0.935122 0.100942 2.362901 1.0
2017-08-03 0.775431 0.187812 1.244166 1.549058 NaN
2017-08-04 0.146374 0.604248 -0.275827 -1.339286 NaN
A B C D E
2017-08-01 0.844367 0.282881 -0.596537 1.106986 1.0
2017-08-02 -0.297101 -0.935122 0.100942 2.362901 1.0
3、对缺失值进行补充:
print(df2.fillna(value=5))
//
A B C D E
2017-08-01 0.134828 -0.409695 0.398769 -0.607453 1.0
2017-08-02 0.461754 -0.555881 1.073785 2.268998 1.0
2017-08-03 2.494864 0.894457 -1.012622 0.225400 5.0
2017-08-04 -0.789087 -0.632234 -1.368574 -1.554802 5.0
4、对数据进行布尔填充:
print(pd.isnull(df2))
//
A B C D E
2017-08-01 False False False False False
2017-08-02 False False False False False
2017-08-03 False False False False True
2017-08-04 False False False False True
九、相关操作
1、执行描述性统计:
print(df)
print(df.mean())
print(df.mean(1))
//
A B C D
2017-08-01 -0.211556 0.151164 0.827327 -1.055808
2017-08-02 -0.181771 -0.388966 2.045035 0.118686
2017-08-03 -1.790196 -1.234819 0.432210 0.390979
2017-08-04 0.240327 0.018650 0.558570 0.044662
2017-08-05 -0.336993 -0.496467 2.512642 0.198416
2017-08-06 0.824033 -0.083081 0.021884 3.237294
A -0.242693
B -0.338920
C 1.066278
D 0.489038
dtype: float64
2017-08-01 -0.072218
2017-08-02 0.398246
2017-08-03 -0.550457
2017-08-04 0.215553
2017-08-05 0.469400
2017-08-06 1.000032
Freq: D, dtype: float64
2、对于拥有不同维度,需要对齐的对象进行操作。pandas会自动的沿着指定的维度进行广播:
s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)
print(s)
//
2017-08-01 NaN
2017-08-02 NaN
2017-08-03 1.0
2017-08-04 3.0
2017-08-05 5.0
2017-08-06 NaN
Freq: D, dtype: float64
s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)
print(s)
print(df.sub(s,axis='index'))
//
2017-08-01 NaN
2017-08-02 NaN
2017-08-03 1.0
2017-08-04 3.0
2017-08-05 5.0
2017-08-06 NaN
Freq: D, dtype: float64
A B C D
2017-08-01 NaN NaN NaN NaN
2017-08-02 NaN NaN NaN NaN
2017-08-03 -1.830701 0.504959 -0.709195 -0.847686
2017-08-04 -3.172816 -2.310923 -1.233996 -1.966726
2017-08-05 -4.368654 -4.300136 -3.945498 -3.271250
2017-08-06 NaN NaN NaN NaN
3、对数据应用函数
print(df)
print(df.apply(np.cumsum))# 累计求和
//
A B C D
2017-08-01 2.643719 -1.955916 -0.625852 -0.200130
2017-08-02 0.510827 -1.552670 -0.669396 1.146157
2017-08-03 -0.168740 0.055871 -0.108548 1.198155
2017-08-04 -1.524010 0.844605 -0.157555 1.387963
2017-08-05 0.220436 -1.345371 0.038241 -0.055306
2017-08-06 -1.299420 -1.663129 -2.586310 1.218230
A B C D
2017-08-01 2.643719 -1.955916 -0.625852 -0.200130
2017-08-02 3.154545 -3.508586 -1.295248 0.946027
2017-08-03 2.985805 -3.452715 -1.403796 2.144182
2017-08-04 1.461795 -2.608110 -1.561352 3.532144
2017-08-05 1.682231 -3.953481 -1.523111 3.476838
2017-08-06 0.382811 -5.616610 -4.109421 4.695068
print(df.apply(lambda x:x.max() - x.min()))
//
A 1.510451
B 3.455243
C 1.772878
D 2.703384
dtype: float64
4、value_counts()
s = pd.Series(np.random.randint(0,7,size=10))
print(s)
print(s.value_counts())
print(type(s.value_counts()))
print(pd.value_counts(s))
//
0 1
1 1
2 3
3 5
4 2
5 4
6 2
7 0
8 2
9 6
dtype: int32
2 3
1 2
6 1
5 1
4 1
3 1
0 1
dtype: int64
<class 'pandas.core.series.Series'>
2 3
1 2
6 1
5 1
4 1
3 1
0 1
dtype: int64
Process finished with exit code 0
5、字符串方法
Series对象在其str属性中配备了一组字符串处理方法,可以很容易的应用到数组中的每个元素,如下段代码所示。
s.str.lower()
#将s Series 字符串变成小写。
十、合并
Pandas 提供了大量的方法能够轻松的对Series,DataFrame和Panel对象进行各种符合各种逻辑关系的合并 。
1、concat
df = pd.DataFrame(np.random.randn(10,4))
print("is--",df)
pieces = [df[:3],df[3:7],df[7:]]
df1 = pd.concat(pieces)
print("is==",df1)
//
is-- 0 1 2 3
0 -1.033090 -0.884844 -0.868251 0.504889
1 0.264918 -2.486587 -1.985965 -1.773131
2 -1.056175 -0.647278 -2.123045 0.292224
3 0.664369 -0.185612 -0.628173 0.214924
4 0.740074 0.515744 -2.547639 -0.116405
5 -0.578900 -0.966765 0.455904 -0.623074
6 -2.000682 1.712143 0.532171 0.807520
7 -1.604986 1.015452 2.270513 0.734573
8 -0.817223 0.211418 0.543868 0.930036
9 -0.288155 -1.231245 -1.162904 0.372432
is== 0 1 2 3
0 -1.033090 -0.884844 -0.868251 0.504889
1 0.264918 -2.486587 -1.985965 -1.773131
2 -1.056175 -0.647278 -2.123045 0.292224
3 0.664369 -0.185612 -0.628173 0.214924
4 0.740074 0.515744 -2.547639 -0.116405
5 -0.578900 -0.966765 0.455904 -0.623074
6 -2.000682 1.712143 0.532171 0.807520
7 -1.604986 1.015452 2.270513 0.734573
8 -0.817223 0.211418 0.543868 0.930036
9 -0.288155 -1.231245 -1.162904 0.372432
2、Join类似与SQL类型的合并
left = pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})
right = pd.DataFrame({'key':['foo','foo'],'rval':[4,5]})
print(left)
print(right)
print(pd.merge(left,right,on='key))
//
key lval
0 foo 1
1 foo 2
key rval
0 foo 4
1 foo 5
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
3、append将一行连接到一个DataFrame上。
print(df)
s =df.iloc[3]
print(df.append(s,ignore_index = True))
//
A B C D
2017-08-01 0.040061 -0.552461 -1.442001 -0.443429
2017-08-02 1.850017 -0.526213 0.548660 -0.760185
2017-08-03 1.233839 -0.338561 -1.441615 0.008451
2017-08-04 -0.622002 -1.308038 -0.926464 -0.890123
2017-08-05 -0.271971 0.994797 -0.806631 -0.191609
2017-08-06 0.078366 -0.014979 0.603089 -1.145712
A B C D
0 0.040061 -0.552461 -1.442001 -0.443429
1 1.850017 -0.526213 0.548660 -0.760185
2 1.233839 -0.338561 -1.441615 0.008451
3 -0.622002 -1.308038 -0.926464 -0.890123
4 -0.271971 0.994797 -0.806631 -0.191609
5 0.078366 -0.014979 0.603089 -1.145712
6 -0.622002 -1.308038 -0.926464 -0.890123
4、分组
对于”group by”操作,我们通常是指以下一个或多个操作步骤:
- (Splitting)按照一些规则将数据分为不同的组;
- (Applying)对于每组数据分别执行一个函数;
- (Combining)将结果组合到一个数据结构中;
1、 分组并对每个分组执行sum函数:
2、 通过多个列进行分组形成一个层次索引,然后执行函数: