文章目录
原文:pandas官方文档 10 minutes to pandas
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
1 创建对象
1.1 Series
- Series是一个列向量,它仅包含一列数据及其索引,通过向Series中传递一个列表来创建:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print (s)
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
1.2 DataFrame
- DataFrame是一个矩阵,通过传递一个Numpy数组,以日期作为行索引,以’ABCD’作为列索引来创建:
dates = pd.date_range('20191214', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print (df)
A B C D
2019-12-14 0.616569 0.051401 1.121409 0.307267
2019-12-15 0.483995 -1.251278 -1.772022 1.282650
2019-12-16 0.366885 -1.617911 0.438302 1.124135
2019-12-17 0.717372 0.233221 1.839167 -1.535794
2019-12-18 1.753858 0.266110 -0.333931 0.281345
2019-12-19 1.048606 0.984771 -0.279106 0.592285
- 通过传递一个字典来创建DataFrame,每个value都会被转化成一个 Series:
df2 = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})
print (df2)
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
#DataFrame每一列的数据类型都不同:
print (df2.dtypes)
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
- 所以我们可以认为,DataFrame是由一列一列的Series拼成的
2 查看数据
用head和tail查看开始和结尾几行的数据,用index和columns查看行索引和列索引,用vlaues显示不带索引的内容:
print (df.head(), '\n')
print (df.tail(3), '\n')
print (df.index, '\n')
print (df.columns, '\n')
print (df.values, '\n')
A B C D
2019-12-14 0.616569 0.051401 1.121409 0.307267
2019-12-15 0.483995 -1.251278 -1.772022 1.282650
2019-12-16 0.366885 -1.617911 0.438302 1.124135
2019-12-17 0.717372 0.233221 1.839167 -1.535794
2019-12-18 1.753858 0.266110 -0.333931 0.281345
A B C D
2019-12-17 0.717372 0.233221 1.839167 -1.535794
2019-12-18 1.753858 0.266110 -0.333931 0.281345
2019-12-19 1.048606 0.984771 -0.279106 0.592285
DatetimeIndex(['2019-12-14', '2019-12-15', '2019-12-16', '2019-12-17',
'2019-12-18', '2019-12-19'],
dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')
[[ 0.61656893 0.05140077 1.12140918 0.30726721]
[ 0.48399521 -1.25127827 -1.77202211 1.2826497 ]
[ 0.36688539 -1.61791122 0.43830175 1.12413452]
[ 0.71737246 0.23322127 1.83916745 -1.53579412]
[ 1.75385779 0.26611012 -0.33393125 0.281345 ]
[ 1.04860622 0.98477115 -0.27910636 0.59228493]]
- Numpy arrays整个数组只有一个数据类型,而DataFrame每一列一个数据类型。所以在将df2转化为numpy的时候,每列的数据类型统一为object:
print (df.to_numpy())
df2.to_numpy()
[[ 0.61656893 0.05140077 1.12140918 0.30726721]
[ 0.48399521 -1.25127827 -1.77202211 1.2826497 ]
[ 0.36688539 -1.61791122 0.43830175 1.12413452]
[ 0.71737246 0.23322127 1.83916745 -1.53579412]
[ 1.75385779 0.26611012 -0.33393125 0.281345 ]
[ 1.04860622 0.98477115 -0.27910636 0.59228493]]
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
- <kbdDescribe显示每一列数据的统计概要:
print(df.describe())
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.831214 -0.222281 0.168970 0.341981
std 0.508670 0.998449 1.264150 1.008979
min 0.366885 -1.617911 -1.772022 -1.535794
25% 0.517139 -0.925609 -0.320225 0.287826
50% 0.666971 0.142311 0.079598 0.449776
75% 0.965798 0.257888 0.950632 0.991172
max 1.753858 0.984771 1.839167 1.282650
- T表示矩阵的转置:
print (df.T)
2019-12-14 2019-12-15 2019-12-16 2019-12-17 2019-12-18 2019-12-19
A 0.616569 0.483995 0.366885 0.717372 1.753858 1.048606
B 0.051401 -1.251278 -1.617911 0.233221 0.266110 0.984771
C 1.121409 -1.772022 0.438302 1.839167 -0.333931 -0.279106
D 0.307267 1.282650 1.124135 -1.535794 0.281345 0.592285
- sort_index按照轴排序,axis=1表示第二个维度,即按照列索引排序:
print (df.sort_index(axis=1, ascending=False))
D C B A
2019-12-14 0.307267 1.121409 0.051401 0.616569
2019-12-15 1.282650 -1.772022 -1.251278 0.483995
2019-12-16 1.124135 0.438302 -1.617911 0.366885
2019-12-17 -1.535794 1.839167 0.233221 0.717372
2019-12-18 0.281345 -0.333931 0.266110 1.753858
2019-12-19 0.592285 -0.279106 0.984771 1.048606
- sort_values按照值排序,by='B’表示对第二列排序:
print (df.sort_values(by='B'))
A B C D
2019-12-16 0.366885 -1.617911 0.438302 1.124135
2019-12-15 0.483995 -1.251278 -1.772022 1.282650
2019-12-14 0.616569 0.051401 1.121409 0.307267
2019-12-17 0.717372 0.233221 1.839167 -1.535794
2019-12-18 1.753858 0.266110 -0.333931 0.281345
2019-12-19 1.048606 0.984771 -0.279106 0.592285
3 选择
3.1 获取行、列
- 与numpy类似,DataFrame也可以用[]进行选择:
print (df['A'], '\n')
print (df[0:3], '\n') #行切片
print (df['20191216':'20191219'], '\n')
2019-12-14 0.616569
2019-12-15 0.483995
2019-12-16 0.366885
2019-12-17 0.717372
2019-12-18 1.753858
2019-12-19 1.048606
Freq: D, Name: A, dtype: float64
A B C D
2019-12-14 0.616569 0.051401 1.121409 0.307267
2019-12-15 0.483995 -1.251278 -1.772022 1.282650
2019-12-16 0.366885 -1.617911 0.438302 1.124135
A B C D
2019-12-16 0.366885 -1.617911 0.438302 1.124135
2019-12-17 0.717372 0.233221 1.839167 -1.535794
2019-12-18 1.753858 0.266110 -0.333931 0.281345
2019-12-19 1.048606 0.984771 -0.279106 0.592285
- 通过label进行选择
print (df.loc[dates[0]], '\n')
print (df.loc[:, ['A', 'B']], '\n')
print (df.loc['20191216':'20191218', ['A', 'B']], '\n')
print (df.at[dates[0], 'A'], '\n') #选择单个数据时用at,速度更快
A 0.616569
B 0.051401
C 1.121409
D 0.307267
Name: 2019-12-14 00:00:00, dtype: float64
A B
2019-12-14 0.616569 0.051401
2019-12-15 0.483995 -1.251278
2019-12-16 0.366885 -1.617911
2019-12-17 0.717372 0.233221
2019-12-18 1.753858 0.266110
2019-12-19 1.048606 0.984771
A B
2019-12-16 0.366885 -1.617911
2019-12-17 0.717372 0.233221
2019-12-18 1.753858 0.266110
0.6165689271991402
- 通过整数下标进行选择
print (df.iloc[3], '\n')
print (df.iloc[3:5, 0:2], '\n')
print (df.iloc[[1,2,4], [0,2]], '\n')
print (df.iloc[1:3, :], '\n')
print (df.iat[1,1], '\n')
A 0.717372
B 0.233221
C 1.839167
D -1.535794
Name: 2019-12-17 00:00:00, dtype: float64
A B
2019-12-17 0.717372 0.233221
2019-12-18 1.753858 0.266110
A C
2019-12-15 0.483995 -1.772022
2019-12-16 0.366885 0.438302
2019-12-18 1.753858 -0.333931
A B C D
2019-12-15 0.483995 -1.251278 -1.772022 1.282650
2019-12-16 0.366885 -1.617911 0.438302 1.124135
-1.2512782747247186
- 通过布尔值下标进行选择
print (df[df.A > 0], '\n') #可以用df.A来表示
print (df[df > 0], '\n')
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
print (df2, '\n')
print (df2[df2['E'].isin(['two', 'four'])]) #用isin来过滤
A B C D
2019-12-14 0.616569 0.051401 1.121409 0.307267
2019-12-15 0.483995 -1.251278 -1.772022 1.282650
2019-12-16 0.366885 -1.617911 0.438302 1.124135
2019-12-17 0.717372 0.233221 1.839167 -1.535794
2019-12-18 1.753858 0.266110 -0.333931 0.281345
2019-12-19 1.048606 0.984771 -0.279106 0.592285
A B C D
2019-12-14 0.616569 0.051401 1.121409 0.307267
2019-12-15 0.483995 NaN NaN 1.282650
2019-12-16 0.366885 NaN 0.438302 1.124135
2019-12-17 0.717372 0.233221 1.839167 NaN
2019-12-18 1.753858 0.266110 NaN 0.281345
2019-12-19 1.048606 0.984771 NaN 0.592285
A B C D E
2019-12-14 0.616569 0.051401 1.121409 0.307267 one
2019-12-15 0.483995 -1.251278 -1.772022 1.282650 one
2019-12-16 0.366885 -1.617911 0.438302 1.124135 two
2019-12-17 0.717372 0.233221 1.839167 -1.535794 three
2019-12-18 1.753858 0.266110 -0.333931 0.281345 four
2019-12-19 1.048606 0.984771 -0.279106 0.592285 three
A B C D E
2019-12-16 0.366885 -1.617911 0.438302 1.124135 two
2019-12-18 1.753858 0.266110 -0.333931 0.281345 four
3.2 设置行、列
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20191215', periods=6))
print (s1, '\n')
df['F'] = s1
df.at[dates[0], 'A'] = 0
df.iat[0, 1] = 0
df.loc[:, 'D'] = np.array([5] * len(df))
print (df)
df2 = df.copy()
df2[df2 > 0] = -df2
print (df2)
2019-12-15 1
2019-12-16 2
2019-12-17 3
2019-12-18 4
2019-12-19 5
2019-12-20 6
Freq: D, dtype: int64
A B C D F
2019-12-14 0.000000 0.000000 1.121409 5 NaN
2019-12-15 0.483995 -1.251278 -1.772022 5 1.0
2019-12-16 0.366885 -1.617911 0.438302 5 2.0
2019-12-17 0.717372 0.233221 1.839167 5 3.0
2019-12-18 1.753858 0.266110 -0.333931 5 4.0
2019-12-19 1.048606 0.984771 -0.279106 5 5.0
A B C D F
2019-12-14 0.000000 0.000000 -1.121409 -5 NaN
2019-12-15 -0.483995 -1.251278 -1.772022 -5 -1.0
2019-12-16 -0.366885 -1.617911 -0.438302 -5 -2.0
2019-12-17 -0.717372 -0.233221 -1.839167 -5 -3.0
2019-12-18 -1.753858 -0.266110 -0.333931 -5 -4.0
2019-12-19 -1.048606 -0.984771 -0.279106 -5 -5.0
4 缺失值
- pandas用np.nan表示缺失值,通常它不会被计算。
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1], 'E'] = 1
print (df1, '\n')
print (df1.dropna(how='any'), '\n') #删除所有存在缺失值的行
print (df1.fillna(value=5), '\n') #将所有缺失值用5填充
print (pd.isna(df1)) #获取所有位置是否是缺失值的boolean值
A B C D F E
2019-12-14 0.000000 0.000000 1.121409 5 NaN 1.0
2019-12-15 0.483995 -1.251278 -1.772022 5 1.0 1.0
2019-12-16 0.366885 -1.617911 0.438302 5 2.0 NaN
2019-12-17 0.717372 0.233221 1.839167 5 3.0 NaN
A B C D F E
2019-12-15 0.483995 -1.251278 -1.772022 5 1.0 1.0
A B C D F E
2019-12-14 0.000000 0.000000 1.121409 5 5.0 1.0
2019-12-15 0.483995 -1.251278 -1.772022 5 1.0 1.0
2019-12-16 0.366885 -1.617911 0.438302 5 2.0 5.0
2019-12-17 0.717372 0.233221 1.839167 5 3.0 5.0
A B C D F E
2019-12-14 False False False False True False
2019-12-15 False False False False False False
2019-12-16 False False False False False True
2019-12-17 False False False False False True
4 操作
4.1 统计
- 通常,操作都会把 NaN 排除在外
print (df.mean(), '\n')
print (df.mean(1),'\n') #对列求均值
print (pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2), '\n') #将所有数据下移两位
A 0.728453
B -0.230848
C 0.168970
D 5.000000
F 3.000000
dtype: float64
2019-12-14 1.530352
2019-12-15 0.692139
2019-12-16 1.237455
2019-12-17 2.157952
2019-12-18 2.137207
2019-12-19 2.350854
Freq: D, dtype: float64
2019-12-14 NaN
2019-12-15 NaN
2019-12-16 1.0
2019-12-17 3.0
2019-12-18 5.0
2019-12-19 NaN
Freq: D, dtype: float64
4.2 Apply
- 对数据apply一个函数
print (df, '\n')
print (df.apply(np.cumsum), '\n') #cumsum表示每行逐次累加
print (df.apply(lambda x: x.max()-x.min())) #每一列的最大值减去最小值
A B C D F
2019-12-14 0.000000 0.000000 1.121409 5 NaN
2019-12-15 0.483995 -1.251278 -1.772022 5 1.0
2019-12-16 0.366885 -1.617911 0.438302 5 2.0
2019-12-17 0.717372 0.233221 1.839167 5 3.0
2019-12-18 1.753858 0.266110 -0.333931 5 4.0
2019-12-19 1.048606 0.984771 -0.279106 5 5.0
A B C D F
2019-12-14 0.000000 0.000000 1.121409 5 NaN
2019-12-15 0.483995 -1.251278 -0.650613 10 1.0
2019-12-16 0.850881 -2.869189 -0.212311 15 3.0
2019-12-17 1.568253 -2.635968 1.626856 20 6.0
2019-12-18 3.322111 -2.369858 1.292925 25 10.0
2019-12-19 4.370717 -1.385087 1.013819 30 15.0
A 1.753858
B 2.602682
C 3.611190
D 0.000000
F 4.000000
dtype: float64
4.3 直方图
s = pd.Series(np.random.randint(0, 7, size=10))
print (s, '\n')
print (s.value_counts())
0 4
1 3
2 3
3 6
4 4
5 5
6 2
7 2
8 3
9 0
dtype: int32
3 3
4 2
2 2
6 1
5 1
0 1
dtype: int64
4.4 字符串函数
- Series在str属性中自带了很多字符串函数:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
print (s.str.lower(), '\n')
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
5 合并
5.1 Concat
- 简单地按行拼接
df = pd.DataFrame(np.random.randn(10, 4))
print (df, '\n')
#break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
print (pd.concat(pieces))
0 1 2 3
0 1.434680 -0.776149 -1.218836 2.397118
1 -1.632868 -0.806080 0.634619 -0.760958
2 1.028005 -0.020240 0.696301 0.090692
3 -1.516807 0.952234 0.758215 -0.272409
4 0.540413 0.258468 -0.563238 -1.380825
5 0.021616 0.097505 -1.052843 0.272892
6 -0.957512 0.088345 -0.972309 -1.224522
7 0.211805 1.674883 -0.919839 -0.159892
8 1.651509 0.908129 -0.416815 -0.629014
9 -0.616352 0.160981 1.059860 -0.521867
0 1 2 3
0 1.434680 -0.776149 -1.218836 2.397118
1 -1.632868 -0.806080 0.634619 -0.760958
2 1.028005 -0.020240 0.696301 0.090692
3 -1.516807 0.952234 0.758215 -0.272409
4 0.540413 0.258468 -0.563238 -1.380825
5 0.021616 0.097505 -1.052843 0.272892
6 -0.957512 0.088345 -0.972309 -1.224522
7 0.211805 1.674883 -0.919839 -0.159892
8 1.651509 0.908129 -0.416815 -0.629014
9 -0.616352 0.160981 1.059860 -0.521867
5.2 Join
- SQL样式的合并方法
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
print (left, '\n')
print (right, '\n')
print (pd.merge(left, right, on='key'))
key lval
0 foo 1
1 foo 2
key rval
0 foo 4
1 foo 5
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
5.3 Append
- 合并行为DataFrame
df = pd.DataFrame(np.random.randn(8,4), columns=['A', 'B', 'C', 'D'])
print (df, '\n')
s =df.iloc[3]
print (df.append(s, ignore_index=True))
A B C D
0 0.481732 1.270100 0.518214 0.028305
1 0.004151 0.602318 1.274942 -2.683102
2 0.130687 -0.402626 0.701901 -1.182579
3 -1.047072 0.656017 0.468616 -0.694842
4 -0.097726 1.669621 0.901135 1.337648
5 0.493724 -0.865273 -1.390465 1.143148
6 -0.691091 0.460376 0.540177 -0.205331
7 -0.634427 2.625610 0.067031 -2.041728
A B C D
0 0.481732 1.270100 0.518214 0.028305
1 0.004151 0.602318 1.274942 -2.683102
2 0.130687 -0.402626 0.701901 -1.182579
3 -1.047072 0.656017 0.468616 -0.694842
4 -0.097726 1.669621 0.901135 1.337648
5 0.493724 -0.865273 -1.390465 1.143148
6 -0.691091 0.460376 0.540177 -0.205331
7 -0.634427 2.625610 0.067031 -2.041728
8 -1.047072 0.656017 0.468616 -0.694842
6 Grouping
与SQL中的group by类似,包含以下步骤:
- 根据规则将数据分组
- 将函数独立的应用到每个分组
- 将结果合并
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
print (df, '\n')
print (df.groupby('A').sum(), '\n')
print (df.groupby(['A', 'B']).sum())
A B C D
0 foo one 0.182951 -0.346016
1 bar one 1.906573 -0.386902
2 foo two 0.102443 -1.087078
3 bar three 1.023196 0.467237
4 foo two -0.940578 -0.414947
5 bar two -1.075379 -0.844441
6 foo one 0.514689 0.038393
7 foo three -2.136912 -1.001738
C D
A
bar 1.854390 -0.764105
foo -2.277406 -2.811385
C D
A B
bar one 1.906573 -0.386902
three 1.023196 0.467237
two -1.075379 -0.844441
foo one 0.697640 -0.307622
three -2.136912 -1.001738
two -0.838134 -1.502025
7 Reshaping
7.1 Stack层叠
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]))
print (tuples)
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
print (df2, '\n')
[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]
A B
first second
bar one 0.220077 1.659987
two -1.504953 0.350697
baz one 1.631244 0.637820
two -0.083256 0.370784
stacked = df2.stack()
print (stacked, '\n')
print(stacked.unstack(), '\n')
print(stacked.unstack(0), '\n')
print(stacked.unstack(1), '\n')
first second
bar one A 0.220077
B 1.659987
two A -1.504953
B 0.350697
baz one A 1.631244
B 0.637820
two A -0.083256
B 0.370784
dtype: float64
A B
first second
bar one 0.220077 1.659987
two -1.504953 0.350697
baz one 1.631244 0.637820
two -0.083256 0.370784
first bar baz
second
one A 0.220077 1.631244
B 1.659987 0.637820
two A -1.504953 -0.083256
B 0.350697 0.370784
second one two
first
bar A 0.220077 -1.504953
B 1.659987 0.350697
baz A 1.631244 -0.083256
B 0.637820 0.370784
7.2 Pivot tables
- pivot是把指定索引所对应的内容,作为新表的行(index)、列(columns)
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
'B': ['A', 'B', 'C'] * 4,
'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D': np.random.randn(12),
'E': np.random.randn(12)})
print (df, '\n')
print (pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']), '\n')
A B C D E
0 one A foo -0.647116 -0.383797
1 one B foo -1.009740 -0.149929
2 two C foo -0.000333 0.233550
3 three A bar 0.308752 -0.204858
4 one B bar 0.157966 -0.066707
5 one C bar 1.028976 -1.697499
6 two A foo -0.170791 -0.161074
7 three B foo 1.222346 -0.465335
8 one C foo -0.024742 -1.984697
9 one A bar -1.253050 0.675345
10 two B bar -0.124737 -0.169243
11 three C bar 0.501580 1.049995
C bar foo
A B
one A -1.253050 -0.647116
B 0.157966 -1.009740
C 1.028976 -0.024742
three A 0.308752 NaN
B NaN 1.222346
C 0.501580 NaN
two A NaN -0.170791
B -0.124737 NaN
C NaN -0.000333
8 Time series
rng = pd.date_range('1/1/2020', periods=120, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
print (ts, '\n')
print (ts.resample('1Min').sum()) #resapmle重采样,每隔一分钟为index,对值进行求和
2020-01-01 00:00:00 349
2020-01-01 00:00:01 141
2020-01-01 00:00:02 467
2020-01-01 00:00:03 152
2020-01-01 00:00:04 491
...
2020-01-01 00:01:55 29
2020-01-01 00:01:56 383
2020-01-01 00:01:57 310
2020-01-01 00:01:58 204
2020-01-01 00:01:59 115
Freq: S, Length: 120, dtype: int32
2020-01-01 00:00:00 15490
2020-01-01 00:01:00 13744
Freq: T, dtype: int32
# 时区表示
rng = pd.date_range('3/6/2020 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
print (ts, '\n')
ts_utc = ts.tz_localize('UTC')
print (ts_utc, '\n')
2020-03-06 -0.264762
2020-03-07 1.358651
2020-03-08 -0.316810
2020-03-09 -0.134225
2020-03-10 -0.960168
Freq: D, dtype: float64
2020-03-06 00:00:00+00:00 -0.264762
2020-03-07 00:00:00+00:00 1.358651
2020-03-08 00:00:00+00:00 -0.316810
2020-03-09 00:00:00+00:00 -0.134225
2020-03-10 00:00:00+00:00 -0.960168
Freq: D, dtype: float64
#转换为另一个时区
print (ts_utc.tz_convert('US/Eastern'), '\n')
2020-03-05 19:00:00-05:00 -0.264762
2020-03-06 19:00:00-05:00 1.358651
2020-03-07 19:00:00-05:00 -0.316810
2020-03-08 20:00:00-04:00 -0.134225
2020-03-09 20:00:00-04:00 -0.960168
Freq: D, dtype: float64
#在时间跨度的表示之间进行转换
rng = pd.date_range('1/1/2020', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
print (ts, '\n')
ps = ts.to_period()
print (ps, '\n')
print (ps.to_timestamp(),'\n')
2020-01-31 -0.325968
2020-02-29 -0.989123
2020-03-31 -0.323522
2020-04-30 0.491792
2020-05-31 -0.161315
Freq: M, dtype: float64
2020-01 -0.325968
2020-02 -0.989123
2020-03 -0.323522
2020-04 0.491792
2020-05 -0.161315
Freq: M, dtype: float64
2020-01-01 -0.325968
2020-02-01 -0.989123
2020-03-01 -0.323522
2020-04-01 0.491792
2020-05-01 -0.161315
Freq: MS, dtype: float64
#在period和timestamp之间进行转换可以使用一些方便的算术函数:
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
print (ts[:6], '\n')
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
print (ts.head(), '\n')
1990Q1 -0.694847
1990Q2 1.123105
1990Q3 0.395293
1990Q4 -0.517952
1991Q1 0.689373
1991Q2 0.664599
Freq: Q-NOV, dtype: float64
1990-03-01 09:00 -0.694847
1990-06-01 09:00 1.123105
1990-09-01 09:00 0.395293
1990-12-01 09:00 -0.517952
1991-03-01 09:00 0.689373
Freq: H, dtype: float64
9 Categoricals
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],
"raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})
#将raw_grade转化为categorical的数据类型
df["grade"] = df["raw_grade"].astype("category")
print (df["grade"], '\n')
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
df["grade"].cat.categories = ["very good", "good", "very bad"]
print (df["grade"], '\n')
#重新排序类别,同时添加缺少的类别;新类别按照旧类别的顺序匹配
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
print (df["grade"], '\n')
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (3, object): [very good, good, very bad]
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
#排序是按类别中的顺序进行的,而不是词汇顺序。
print (df.sort_values(by="grade"), '\n')
print (df.groupby("grade").size())
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
10 Plotting
ts = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2020', periods=1000))
ts = ts.cumsum()
ts.plot()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure()
df.plot()
plt.legend(loc='best') #best表示自动分配最佳位置
11 读取、写入数据
11.1 CSV
#写入
df.to_csv('foo.csv')
#读取
print (pd.read_csv('foo.csv'), '\n')
Unnamed: 0 A B C D
0 2020-01-01 0.674980 0.021207 -0.984289 -0.261462
1 2020-01-02 2.434732 0.361986 1.523058 -0.428913
2 2020-01-03 4.390606 0.680106 0.511064 -0.722390
3 2020-01-04 4.721387 0.118680 0.446910 -0.788997
4 2020-01-05 5.966886 -1.940727 1.120733 -0.169760
.. ... ... ... ... ...
995 2022-09-22 9.467967 9.121625 57.829448 31.631881
996 2022-09-23 8.552319 10.117774 56.513708 30.713588
997 2022-09-24 10.064727 9.115677 56.434259 32.638695
998 2022-09-25 9.974457 7.912468 56.121923 32.281864
999 2022-09-26 8.935891 8.412897 58.160465 32.012264
[1000 rows x 5 columns]
11.2 HDF5
#写入
df.to_hdf('foo.h5', 'df')
#读取
print (pd.read_hdf('foo.h5', 'df'), '\n')
A B C D
2020-01-01 0.674980 0.021207 -0.984289 -0.261462
2020-01-02 2.434732 0.361986 1.523058 -0.428913
2020-01-03 4.390606 0.680106 0.511064 -0.722390
2020-01-04 4.721387 0.118680 0.446910 -0.788997
2020-01-05 5.966886 -1.940727 1.120733 -0.169760
... ... ... ... ...
2022-09-22 9.467967 9.121625 57.829448 31.631881
2022-09-23 8.552319 10.117774 56.513708 30.713588
2022-09-24 10.064727 9.115677 56.434259 32.638695
2022-09-25 9.974457 7.912468 56.121923 32.281864
2022-09-26 8.935891 8.412897 58.160465 32.012264
[1000 rows x 4 columns]
11.3 Excel
#写入
df.to_excel('foo.xlsx', sheet_name='Sheet1')
#读取
print (pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA']), '\n')
Unnamed: 0 A B C D
0 2020-01-01 0.674980 0.021207 -0.984289 -0.261462
1 2020-01-02 2.434732 0.361986 1.523058 -0.428913
2 2020-01-03 4.390606 0.680106 0.511064 -0.722390
3 2020-01-04 4.721387 0.118680 0.446910 -0.788997
4 2020-01-05 5.966886 -1.940727 1.120733 -0.169760
.. ... ... ... ... ...
995 2022-09-22 9.467967 9.121625 57.829448 31.631881
996 2022-09-23 8.552319 10.117774 56.513708 30.713588
997 2022-09-24 10.064727 9.115677 56.434259 32.638695
998 2022-09-25 9.974457 7.912468 56.121923 32.281864
999 2022-09-26 8.935891 8.412897 58.160465 32.012264
[1000 rows x 5 columns]
12 注意
- 不能直接把返回值当作布尔值:
if pd.Series([False, True, False]):
print("I was true")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-210-5c782b38cd2f> in <module>
----> 1 if pd.Series([False, True, False]):
2 print("I was true")
D:\Applications\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1553 "The truth value of a {0} is ambiguous. "
1554 "Use a.empty, a.bool(), a.item(), a.any() or a.all().".format(
-> 1555 self.__class__.__name__
1556 )
1557 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().