pandas库是用来处理数据,进行数据清洗的非常常用的库,本文是对pandas的简短介绍,主要面向新用户。
通常,我们导入如下:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
创建对象
通过列表创建Series,让pandas创建一个默认整数索引:
s = pd.Series([1,3,5,np.nan,6,8])
print(s)
输出:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
通过传递numpy数组,使用datetime索引和标记的列来创建DataFrame:
dates = pd.date_range('20130101', periods=6)
print(dates)
输出:
DatetimeIndex([‘2013-01-01’, ‘2013-01-02’, ‘2013-01-03’, ‘2013-01-04’,
‘2013-01-05’, ‘2013-01-06’],
dtype=’datetime64[ns]’, freq=’D’)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print(df)
输出:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
2013-01-06 -0.673690 0.113648 -1.478427 0.524988
通过传递可以转换为类系列的对象的dict来创建DataFrame。
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
print(df2)
输出:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
具有特定dtypes:
print(df2.dtypes)
输出:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
浏览数据
查看数据框的头部和尾部数据
print(df.head())
输出:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
print(df.tail(3))
输出:
A B C D
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
2013-01-06 -0.673690 0.113648 -1.478427 0.524988
显示索引,列和基础numpy数据:
print(df.index)
print(df.columns)
print(df.values)
输出:
DatetimeIndex([‘2013-01-01’, ‘2013-01-02’, ‘2013-01-03’, ‘2013-01-04’,
‘2013-01-05’, ‘2013-01-06’],
dtype=’datetime64[ns]’, freq=’D’)
Index([u’A’, u’B’, u’C’, u’D’], dtype=’object’)
array([[ 0.4691, -0.2829, -1.5091, -1.1356],
[ 1.2121, -0.1732, 0.1192, -1.0442],
[-0.8618, -2.1046, -0.4949, 1.0718],
[ 0.7216, -0.7068, -1.0396, 0.2719],
[-0.425 , 0.567 , 0.2762, -1.0874],
[-0.6737, 0.1136, -1.4784, 0.525 ]])
describe()显示您的数据的快速统计摘要
print(df.describe())
输出:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.073711 -0.431125 -0.687758 -0.233103
std 0.843157 0.922818 0.779887 0.973118
min -0.861849 -2.104569 -1.509059 -1.135632
25% -0.611510 -0.600794 -1.368714 -1.076610
50% 0.022070 -0.228039 -0.767252 -0.386188
75% 0.658444 0.041933 -0.034326 0.461706
max 1.212112 0.567020 0.276232 1.071804
转置数据:
print(df.T)
输出:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690
B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648
C -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427
D -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988
按轴排序
print(df.sort_index(axis=1, ascending=False))
输出:
D C B A
2013-01-01 -1.135632 -1.509059 -0.282863 0.469112
2013-01-02 -1.044236 0.119209 -0.173215 1.212112
2013-01-03 1.071804 -0.494929 -2.104569 -0.861849
2013-01-04 0.271860 -1.039575 -0.706771 0.721555
2013-01-05 -1.087401 0.276232 0.567020 -0.424972
2013-01-06 0.524988 -1.478427 0.113648 -0.673690
按值排序
print(df.sort_values(by='B'))
输出:
A B C D
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-06 -0.673690 0.113648 -1.478427 0.524988
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
Getting
选择单个列,产生Series,等效于df.A
print(df['A'])
输出:
2013-01-01 0.469112
2013-01-02 1.212112
2013-01-03 -0.861849
2013-01-04 0.721555
2013-01-05 -0.424972
2013-01-06 -0.673690
Freq: D, Name: A, dtype: float64
通过[]选择,通过切片选择行。
print(df[0:3])
print(df['20130102':'20130104'])
输出:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
A B C D
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
Selection by Label
使用标签获取横截面
print(df.loc[dates[0]])
输出:
A 0.469112
B -0.282863
C -1.509059
D -1.135632
Name: 2013-01-01 00:00:00, dtype: float64
按标签选择多轴
print(df.loc[:,['A','B']])
输出:
A B
2013-01-01 0.469112 -0.282863
2013-01-02 1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04 0.721555 -0.706771
2013-01-05 -0.424972 0.567020
2013-01-06 -0.673690 0.113648
显示标签切片,两个端点都包含
print(df.loc['20130102':'20130104',['A','B']])
输出:
A B
2013-01-02 1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04 0.721555 -0.706771
减少返回对象的维度
print(df.loc['20130102',['A','B']])
输出:
A 1.212112
B -0.173215
Name: 2013-01-02 00:00:00, dtype: float64
获取标量值:
print(df.loc[dates[0],'A'])
输出:
0.46911229990718628
为了获得对标量的快速访问(等同于之前的方法):
print(df.at[dates[0],'A'])
输出:
0.46911229990718628
Selection by Position
通过传递的整数的位置选择
print(df.iloc[3])
输出:
A 0.721555
B -0.706771
C -1.039575
D 0.271860
Name: 2013-01-04 00:00:00, dtype: float64
通过整数切片,行为类似于numpy / python:
print(df.iloc[3:5,0:2])
输出:
A B
2013-01-04 0.721555 -0.706771
2013-01-05 -0.424972 0.567020
显式获取值:
print(df.iloc[1,1])
输出:
-0.17321464905330858
布尔索引
使用单个列的值的条件来选择数据。
print(df[df.A > 0])
输出:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
使用isin()方法进行过滤:
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
print(df2)
print(df2[df2['E'].isin(['two','four'])])
输出:
A B C D E
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one
2013-01-02 1.212112 -0.173215 0.119209 -1.044236 one
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
2013-01-06 -0.673690 0.113648 -1.478427 0.524988 three
A B C D E
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
Setting
设置新列会自动按索引对齐数据
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
print(s1)
输出:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
#按标签设置值
df.at[dates[0],'A'] = 0
#按位置设置值
df.iat[0,1] = 0
#通过分配numpy数组进行设置
df.loc[:,'D'] = np.array([5] * len(df))
#输出df
print(df)
输出:
A B C D F
2013-01-01 0.000000 0.000000 -1.509059 5 NaN
2013-01-02 1.212112 -0.173215 0.119209 5 1.0
2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0
2013-01-04 0.721555 -0.706771 -1.039575 5 3.0
2013-01-05 -0.424972 0.567020 0.276232 5 4.0
2013-01-06 -0.673690 0.113648 -1.478427 5 5.0
Missing Data
pandas主要使用值np.nan来表示缺失的数据。在计算中是默认不包括缺失值的。
reindexing允许您更改/添加/删除指定轴上的索引。这将返回数据的副本。
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
print(df1)
输出:
A B C D F E
2013-01-01 0.000000 0.000000 -1.509059 5 NaN 1.0
2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0
2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 NaN
2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 NaN
使用drop方法删除任何含有缺少数据的行。
print(df1.dropna(how='any'))
输出:
A B C D F E
2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0
filling方法填充缺失值
print(df1.fillna(value=5))
输出:
A B C D F E
2013-01-01 0.000000 0.000000 -1.509059 5 5.0 1.0
2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0
2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 5.0
2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 5.0
使用isnull方获取值为nan的布尔值
print(pd.isnull(df1))
输出:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
Operations
一般操作排除缺失值。
执行(axis=0)描述性统计(求均值)
print(df.mean())
输出:
A -0.004474
B -0.383981
C -0.687758
D 5.000000
F 3.000000
dtype: float64
执行列(axis=1)描述统计
print(df.mean(1))
输出:
2013-01-01 0.872735
2013-01-02 1.431621
2013-01-03 0.707731
2013-01-04 1.395042
2013-01-05 1.883656
2013-01-06 1.592306
Freq: D, dtype: float64
使用具有不同维度并需要对齐的对象进行操作。此外,pandas会自动沿指定的维度广播。
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
print(s)
输出:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
print(df.sub(s, axis='index'))
输出:
A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -1.861849 -3.104569 -1.494929 4.0 1.0
2013-01-04 -2.278445 -3.706771 -4.039575 2.0 0.0
2013-01-05 -5.424972 -4.432980 -4.723768 0.0 -1.0
2013-01-06 NaN NaN NaN NaN NaN
字符串方法
Series 的str属性配置了一组字符串处理方法,可以方便地对数组的每个元素进行操作,如下面的代码段所示。注意,str中的模式匹配默认使用正则表达式(在某些情况下总是使用它们)。
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
print(s.str.lower())
输出:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
以上,pandas还有其他的许多操作,等以后有时间了,在慢慢码上代码,希望这些对需要处理数据的小伙伴们能有帮助,最后再多说一句,如果觉得本文有用的话,请在下方给个点赞。。。
最后的最后,分享给大家一句看过的诗
隐约雷鸣,阴霾天空,但盼风雨来,能留你在此。 —-《万叶集》