【仅作为本人学习的记录草稿】
pandas最重要的内容是两种数据结构:Series和DataFrame。
先熟悉一下基本命令。
1. import:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
2. Create Series and DataFrame
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
# create Series by passing a list of values
>>>s = pd.Series([1,3,5,np.nan,6,8])
>>>s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
# create DataFrame by passing a NumPy array with a datetime index and labeled columns
>>>dates = pd.date_range('20181024',periods=6)
>>>dates
DatetimeIndex(['2018-10-24', '2018-10-25', '2018-10-26', '2018-10-27',
'2018-10-28', '2018-10-29'],
dtype='datetime64[ns]', freq='D')
>>>df = pd.DataFrame(np.random.randn(6,4),index = dates, columns = list('ABCD'))
>>>df
A B C D
2018-10-24 -0.448578 -0.430941 0.116371 0.029906
2018-10-25 1.472781 -1.880300 1.626077 -0.397423
2018-10-26 -1.182278 0.110466 -0.365284 -0.758648
2018-10-27 0.792962 1.972461 -0.009865 0.584242
2018-10-28 1.869130 -1.191335 -0.903691 -1.425850
2018-10-29 0.718443 0.994759 -1.202812 0.096397
# create DataFrame by passing a dict of objects that can be converted to series-like
>>>df2 = pd.DataFrame({'A':1.,
'B':pd.Timestamp('20181024'),
'C':pd.Series(1,index=list(range(4)),dtype='float32'),
'D':np.array([3]*4,dtype='int32'),
'E':pd.Categorical(["test","train","test","train"]),
'F':'foo'})
>>>df2
A B C D E F
0 1.0 2018-10-24 1.0 3 test foo
1 1.0 2018-10-24 1.0 3 train foo
2 1.0 2018-10-24 1.0 3 test foo
3 1.0 2018-10-24 1.0 3 train foo
>>>df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
3. Viewing Data
>>>df.head()
A B C D
2018-10-24 -0.448578 -0.430941 0.116371 0.029906
2018-10-25 1.472781 -1.880300 1.626077 -0.397423
2018-10-26 -1.182278 0.110466 -0.365284 -0.758648
2018-10-27 0.792962 1.972461 -0.009865 0.584242
2018-10-28 1.869130 -1.191335 -0.903691 -1.425850
>>>df.tail()
A B C D
2018-10-25 1.472781 -1.880300 1.626077 -0.397423
2018-10-26 -1.182278 0.110466 -0.365284 -0.758648
2018-10-27 0.792962 1.972461 -0.009865 0.584242
2018-10-28 1.869130 -1.191335 -0.903691 -1.425850
2018-10-29 0.718443 0.994759 -1.202812 0.096397
>>>df.tail(2)
A B C D
2018-10-28 1.869130 -1.191335 -0.903691 -1.425850
2018-10-29 0.718443 0.994759 -1.202812 0.096397
# Display the index,columns and the underlying NumPy data:
>>>df.index
DatetimeIndex(['2018-10-24', '2018-10-25', '2018-10-26', '2018-10-27',
'2018-10-28', '2018-10-29'],
dtype='datetime64[ns]', freq='D')
>>>df.columns
Index([u'A', u'B', u'C', u'D'], dtype='object')
>>>df.values
array([[-0.44857797, -0.43094069, 0.1163714 , 0.02990564],
[ 1.47278109, -1.88029967, 1.62607715, -0.39742327],
[-1.18227842, 0.11046589, -0.36528377, -0.75864786],
[ 0.79296163, 1.97246094, -0.0098655 , 0.58424183],
[ 1.86912973, -1.19133507, -0.90369126, -1.42584953],
[ 0.71844329, 0.99475901, -1.20281179, 0.09639716]])
>>>df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.537077 -0.070815 -0.123201 -0.311896
std 1.155506 1.414410 0.996348 0.711954
min -1.182278 -1.880300 -1.202812 -1.425850
25% -0.156823 -1.001236 -0.769089 -0.668342
50% 0.755702 -0.160237 -0.187575 -0.183759
75% 1.302826 0.773686 0.084812 0.079774
max 1.869130 1.972461 1.626077 0.584242
# Transposing the data
>>>df.T
2018-10-24 2018-10-25 ... 2018-10-28 2018-10-29
A -0.448578 1.472781 ... 1.869130 0.718443
B -0.430941 -1.880300 ... -1.191335 0.994759
C 0.116371 1.626077 ... -0.903691 -1.202812
D 0.029906 -0.397423 ... -1.425850 0.096397
# sort by axis
>>>df.sort_index(axis=0,ascending=False)
A B C D
2018-10-29 0.718443 0.994759 -1.202812 0.096397
2018-10-28 1.869130 -1.191335 -0.903691 -1.425850
2018-10-27 0.792962 1.972461 -0.009865 0.584242
2018-10-26 -1.182278 0.110466 -0.365284 -0.758648
2018-10-25 1.472781 -1.880300 1.626077 -0.397423
2018-10-24 -0.448578 -0.430941 0.116371 0.029906
# sort by values
>>>df.sort_values(by='B')
A B C D
2018-10-25 1.472781 -1.880300 1.626077 -0.397423
2018-10-28 1.869130 -1.191335 -0.903691 -1.425850
2018-10-24 -0.448578 -0.430941 0.116371 0.029906
2018-10-26 -1.182278 0.110466 -0.365284 -0.758648
2018-10-29 0.718443 0.994759 -1.202812 0.096397
2018-10-27 0.792962 1.972461 -0.009865 0.584242
4. Selection
# Getting a single column
>>>df['A']
2018-10-24 -0.448578
2018-10-25 1.472781
2018-10-26 -1.182278
2018-10-27 0.792962
2018-10-28 1.869130
2018-10-29 0.718443
Freq: D, Name: A, dtype: float64
>>>df.A
2018-10-24 -0.448578
2018-10-25 1.472781
2018-10-26 -1.182278
2018-10-27 0.792962
2018-10-28 1.869130
2018-10-29 0.718443
Freq: D, Name: A, dtype: float64
# Selecting via [], which slices the rows
>>>df
A B C D
2018-10-24 -0.448578 -0.430941 0.116371 0.029906
2018-10-25 1.472781 -1.880300 1.626077 -0.397423
2018-10-26 -1.182278 0.110466 -0.365284 -0.758648
2018-10-27 0.792962 1.972461 -0.009865 0.584242
2018-10-28 1.869130 -1.191335 -0.903691 -1.425850
2018-10-29 0.718443 0.994759 -1.202812 0.096397
>>>df[0:3]
A B C D
2018-10-24 -0.448578 -0.430941 0.116371 0.029906
2018-10-25 1.472781 -1.880300 1.626077 -0.397423
2018-10-26 -1.182278 0.110466 -0.365284 -0.758648
>>>df['2018-10-27':]
A B C D
2018-10-27 0.792962 1.972461 -0.009865 0.584242
2018-10-28 1.869130 -1.191335 -0.903691 -1.425850
2018-10-29 0.718443 0.994759 -1.202812 0.096397
# Selection by Label
>>>df.loc[:,['A','B']]
A B
2018-10-24 -0.448578 -0.430941
2018-10-25 1.472781 -1.880300
2018-10-26 -1.182278 0.110466
2018-10-27 0.792962 1.972461
2018-10-28 1.869130 -1.191335
2018-10-29 0.718443 0.994759
>>>df.loc['2018-10-24':'2018-10-25',['A','B']]
A B
2018-10-24 -0.448578 -0.430941
2018-10-25 1.472781 -1.880300
>>>df.loc['2018-10-24',['A','B']]
A -0.448578
B -0.430941
Name: 2018-10-24 00:00:00, dtype: float64
>>>df.loc[dates[0],'A'] # Getting a scalar value:
-0.4485779663058197
>>>df.at[dates[0],'A']
-0.4485779663058197
# Selection by Position
>>>df.iloc[3]
A 0.792962
B 1.972461
C -0.009865
D 0.584242
Name: 2018-10-27 00:00:00, dtype: float64
Boolean Indexing
>>>df[df.A>0]
A B C D
2018-10-25 1.472781 -1.880300 1.626077 -0.397423
2018-10-27 0.792962 1.972461 -0.009865 0.584242
2018-10-28 1.869130 -1.191335 -0.903691 -1.425850
2018-10-29 0.718443 0.994759 -1.202812 0.096397
# isin() method for filtering:
>>>df2 = df.copy()
>>>df2['E'] = ['one','two','three','four','five','six']
>>>df2
A B C D E
2018-10-24 -0.448578 -0.430941 0.116371 0.029906 one
2018-10-25 1.472781 -1.880300 1.626077 -0.397423 two
2018-10-26 -1.182278 0.110466 -0.365284 -0.758648 three
2018-10-27 0.792962 1.972461 -0.009865 0.584242 four
2018-10-28 1.869130 -1.191335 -0.903691 -1.425850 five
2018-10-29 0.718443 0.994759 -1.202812 0.096397 six
>>>df2[df2['E'].isin(['one','six'])]
A B C D E
2018-10-24 -0.448578 -0.430941 0.116371 0.029906 one
2018-10-29 0.718443 0.994759 -1.202812 0.096397 six
Setting
>>>s1 = pd.Series([1,2,3,4,5,6],index = pd.date_range('20181024',periods=6))
>>>s1
2018-10-24 1
2018-10-25 2
2018-10-26 3
2018-10-27 4
2018-10-28 5
2018-10-29 6
Freq: D, dtype: int64
>>>df['G'] = s1
>>>df
A B C D G
2018-10-24 -0.448578 -0.430941 0.116371 0.029906 1
2018-10-25 1.472781 -1.880300 1.626077 -0.397423 2
2018-10-26 -1.182278 0.110466 -0.365284 -0.758648 3
2018-10-27 0.792962 1.972461 -0.009865 0.584242 4
2018-10-28 1.869130 -1.191335 -0.903691 -1.425850 5
2018-10-29 0.718443 0.994759 -1.202812 0.096397 6
# Setting values by label:
>>>df.at[dates[0],'A'] = 0
>>>df
A B C D G
2018-10-24 0.000000 -0.430941 0.116371 0.029906 1
2018-10-25 1.472781 -1.880300 1.626077 -0.397423 2
2018-10-26 -1.182278 0.110466 -0.365284 -0.758648 3
2018-10-27 0.792962 1.972461 -0.009865 0.584242 4
2018-10-28 1.869130 -1.191335 -0.903691 -1.425850 5
2018-10-29 0.718443 0.994759 -1.202812 0.096397 6
# Setting values by position:
>>>df.iat[0,1] = 0
>>>df
A B C D G
2018-10-24 0.000000 0.000000 0.116371 0.029906 1
2018-10-25 1.472781 -1.880300 1.626077 -0.397423 2
2018-10-26 -1.182278 0.110466 -0.365284 -0.758648 3
2018-10-27 0.792962 1.972461 -0.009865 0.584242 4
2018-10-28 1.869130 -1.191335 -0.903691 -1.425850 5
2018-10-29 0.718443 0.994759 -1.202812 0.096397 6
# Setting by assigning with a NumPy array:
>>>df.loc[:,'D'] = np.array([5]*len(df))
>>>df
A B C D G
2018-10-24 0.000000 0.000000 0.116371 5 1
2018-10-25 1.472781 -1.880300 1.626077 5 2
2018-10-26 -1.182278 0.110466 -0.365284 5 3
2018-10-27 0.792962 1.972461 -0.009865 5 4
2018-10-28 1.869130 -1.191335 -0.903691 5 5
2018-10-29 0.718443 0.994759 -1.202812 5 6
# Look:
>>>np.array([5])
array([5])
>>>np.array([5]*8)
array([5, 5, 5, 5, 5, 5, 5, 5])
Missing Data
The value np.nan represents missing data.
>>>df.loc[0:1,'G'] = np.nan
>>>df
A B C D G
2018-10-24 0.000000 0.000000 0.116371 5 NaN
2018-10-25 1.472781 -1.880300 1.626077 5 2.0
2018-10-26 -1.182278 0.110466 -0.365284 5 3.0
2018-10-27 0.792962 1.972461 -0.009865 5 4.0
2018-10-28 1.869130 -1.191335 -0.903691 5 5.0
2018-10-29 0.718443 0.994759 -1.202812 5 6.0
# Drop any rows that have missing data:
>>>df.dropna(how='any')
A B C D G
2018-10-25 1.472781 -1.880300 1.626077 5 2.0
2018-10-26 -1.182278 0.110466 -0.365284 5 3.0
2018-10-27 0.792962 1.972461 -0.009865 5 4.0
2018-10-28 1.869130 -1.191335 -0.903691 5 5.0
2018-10-29 0.718443 0.994759 -1.202812 5 6.0