Python基础-pandas-快速上手

李金的中文Python笔记[https://github.com/lijin-thu/notes-python]的学习笔记及摘要。

十分钟上手 Pandas

pandas 是一个 Python Data Analysis Library

安装请参考官网的教程,如果安装了 Anaconda,则不需要安装 pandas 库。

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

产生 Pandas 对象

pandas 中有三种基本结构:

  • Series
    • 1D labeled homogeneously-typed array
  • DataFrame
    • General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns
  • Panel
    • General 3D labeled, also size-mutable array

Series

一维 Series 可以用一维列表初始化:

s = pd.Series([1,3,5,np.nan,6,8])

print s
0     1
1     3
2     5
3   NaN
4     6
5     8
dtype: float64

默认情况下,Series 的下标都是数字(可以使用额外参数指定),类型是统一的。

DataFrame

DataFrame 则是个二维结构,这里首先构造一组时间序列,作为我们第一维的下标:

dates = pd.date_range('20130101', periods=6)

print dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

然后创建一个 DataFrame 结构:

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

df
ABCD
2013-01-01-0.605936-0.861658-1.0019241.528584
2013-01-02-0.1654080.3883381.1871871.819818
2013-01-030.065255-1.608074-1.282331-0.286067
2013-01-041.2893050.497115-0.2253510.040239
2013-01-050.0382320.875057-0.0925260.934432
2013-01-06-2.163453-0.0102791.6998861.291653

默认情况下,如果不指定 index 参数和 columns,那么他们的值将用从 0 开始的数字替代。

除了向 DataFrame 中传入二维数组,我们也可以使用字典传入数据:

df2 = pd.DataFrame({'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo' })

df2
ABCDEF
012013-01-0213testfoo
112013-01-0213trainfoo
212013-01-0213testfoo
312013-01-0213trainfoo

字典的每个 key 代表一列,其 value 可以是各种能够转化为 Series 的对象。

Series 要求所有的类型都一致不同,DataFrame 值要求每一列数据的格式相同:

df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

查看数据

头尾数据

headtail 方法可以分别查看最前面几行和最后面几行的数据(默认为 5):

df.head()
ABCD
2013-01-01-0.605936-0.861658-1.0019241.528584
2013-01-02-0.1654080.3883381.1871871.819818
2013-01-030.065255-1.608074-1.282331-0.286067
2013-01-041.2893050.497115-0.2253510.040239
2013-01-050.0382320.875057-0.0925260.934432

最后 3 行:

df.tail(3)
ABCD
2013-01-041.2893050.497115-0.2253510.040239
2013-01-050.0382320.875057-0.0925260.934432
2013-01-06-2.163453-0.0102791.6998861.291653

下标,列标,数据

下标使用 index 属性查看:

df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

列标使用 columns 属性查看:

df.columns
Index([u'A', u'B', u'C', u'D'], dtype='object')

数据值使用 values 查看:

df.values
array([[-0.60593585, -0.86165752, -1.00192387,  1.52858443],
       [-0.16540784,  0.38833783,  1.18718697,  1.81981793],
       [ 0.06525454, -1.60807414, -1.2823306 , -0.28606716],
       [ 1.28930486,  0.49711531, -0.22535143,  0.04023897],
       [ 0.03823179,  0.87505664, -0.0925258 ,  0.93443212],
       [-2.16345271, -0.01027865,  1.69988608,  1.29165337]])

统计数据

查看简单的统计数据:

df.describe()
ABCD
count6.0000006.0000006.0000006.000000
mean-0.257001-0.1199170.0474900.888110
std1.1266570.9387051.1826290.841529
min-2.163453-1.608074-1.282331-0.286067
25%-0.495804-0.648813-0.8077810.263787
50%-0.0635880.189030-0.1589391.113043
75%0.0584990.4699210.8672591.469352
max1.2893050.8750571.6998861.819818

转置

df.T
2013-01-01 00:00:002013-01-02 00:00:002013-01-03 00:00:002013-01-04 00:00:002013-01-05 00:00:002013-01-06 00:00:00
A-0.605936-0.1654080.0652551.2893050.038232-2.163453
B-0.8616580.388338-1.6080740.4971150.875057-0.010279
C-1.0019241.187187-1.282331-0.225351-0.0925261.699886
D1.5285841.819818-0.2860670.0402390.9344321.291653

排序

sort_index(axis=0, ascending=True) 方法按照下标大小进行排序,axis=0 表示按第 0 维进行排序。

df.sort_index(ascending=False)
ABCD
2013-01-06-2.163453-0.0102791.6998861.291653
2013-01-050.0382320.875057-0.0925260.934432
2013-01-041.2893050.497115-0.2253510.040239
2013-01-030.065255-1.608074-1.282331-0.286067
2013-01-02-0.1654080.3883381.1871871.819818
2013-01-01-0.605936-0.861658-1.0019241.528584
df.sort_index(axis=1, ascending=False)
DCBA
2013-01-011.528584-1.001924-0.861658-0.605936
2013-01-021.8198181.1871870.388338-0.165408
2013-01-03-0.286067-1.282331-1.6080740.065255
2013-01-040.040239-0.2253510.4971151.289305
2013-01-050.934432-0.0925260.8750570.038232
2013-01-061.2916531.699886-0.010279-2.163453

sort_values(by, axis=0, ascending=True) 方法按照 by 的值的大小进行排序,例如按照 B 列的大小:

df.sort_values(by="B")
ABCD
2013-01-030.065255-1.608074-1.282331-0.286067
2013-01-01-0.605936-0.861658-1.0019241.528584
2013-01-06-2.163453-0.0102791.6998861.291653
2013-01-02-0.1654080.3883381.1871871.819818
2013-01-041.2893050.497115-0.2253510.040239
2013-01-050.0382320.875057-0.0925260.934432

索引

虽然 DataFrame 支持 Python/Numpy 的索引语法,但是推荐使用 .at, .iat, .loc, .iloc 和 .ix 方法进行索引。

读取数据

选择单列数据:

df["A"]
2013-01-01   -0.605936
2013-01-02   -0.165408
2013-01-03    0.065255
2013-01-04    1.289305
2013-01-05    0.038232
2013-01-06   -2.163453
Freq: D, Name: A, dtype: float64

也可以用 df.A

df.A
2013-01-01   -0.605936
2013-01-02   -0.165408
2013-01-03    0.065255
2013-01-04    1.289305
2013-01-05    0.038232
2013-01-06   -2.163453
Freq: D, Name: A, dtype: float64

使用切片读取多行:

df[0:3]
ABCD
2013-01-01-0.605936-0.861658-1.0019241.528584
2013-01-02-0.1654080.3883381.1871871.819818
2013-01-030.065255-1.608074-1.282331-0.286067

index 名字也可以进行切片:

df["20130101":"20130103"]
ABCD
2013-01-01-0.605936-0.861658-1.0019241.528584
2013-01-02-0.1654080.3883381.1871871.819818
2013-01-030.065255-1.608074-1.282331-0.286067

使用 label 索引

loc 可以方便的使用 label 进行索引:

df.loc[dates[0]]
A   -0.605936
B   -0.861658
C   -1.001924
D    1.528584
Name: 2013-01-01 00:00:00, dtype: float64

多列数据:

df.loc[:,['A','B']]
AB
2013-01-01-0.605936-0.861658
2013-01-02-0.1654080.388338
2013-01-030.065255-1.608074
2013-01-041.2893050.497115
2013-01-050.0382320.875057
2013-01-06-2.163453-0.010279

选择多行多列:

df.loc['20130102':'20130104',['A','B']]
AB
2013-01-02-0.1654080.388338
2013-01-030.065255-1.608074
2013-01-041.2893050.497115

数据降维:

df.loc['20130102',['A','B']]
A   -0.165408
B    0.388338
Name: 2013-01-02 00:00:00, dtype: float64

得到标量值:

df.loc[dates[0],'B']
-0.86165751902832299

不过得到标量值可以用 at,速度更快:

%timeit -n100 df.loc[dates[0],'B']
%timeit -n100 df.at[dates[0],'B']

print df.at[dates[0],'B']
100 loops, best of 3: 329 µs per loop
100 loops, best of 3: 31.1 µs per loop
-0.861657519028

使用位置索引

iloc 使用位置进行索引:

df.iloc[3]
A    1.289305
B    0.497115
C   -0.225351
D    0.040239
Name: 2013-01-04 00:00:00, dtype: float64

连续切片:

df.iloc[3:5,0:2]
AB
2013-01-041.2893050.497115
2013-01-050.0382320.875057

索引不连续的部分:

df.iloc[[1,2,4],[0,2]]
AC
2013-01-02-0.1654081.187187
2013-01-030.065255-1.282331
2013-01-050.038232-0.092526

索引整行:

df.iloc[1:3,:]
ABCD
2013-01-02-0.1654080.3883381.1871871.819818
2013-01-030.065255-1.608074-1.282331-0.286067

整列:

df.iloc[:, 1:3]
BC
2013-01-01-0.861658-1.001924
2013-01-020.3883381.187187
2013-01-03-1.608074-1.282331
2013-01-040.497115-0.225351
2013-01-050.875057-0.092526
2013-01-06-0.0102791.699886

标量值:

df.iloc[1,1]
0.3883378290420279

当然,使用 iat 索引标量值更快:

%timeit -n100 df.iloc[1,1]
%timeit -n100 df.iat[1,1]

df.iat[1,1]
100 loops, best of 3: 236 µs per loop
100 loops, best of 3: 14.5 µs per loop





0.3883378290420279

布尔型索引

所有 A 列大于 0 的行:

df[df.A > 0]
ABCD
2013-01-030.065255-1.608074-1.282331-0.286067
2013-01-041.2893050.497115-0.2253510.040239
2013-01-050.0382320.875057-0.0925260.934432

只留下所有大于 0 的数值:

df[df > 0]
ABCD
2013-01-01NaNNaNNaN1.528584
2013-01-02NaN0.3883381.1871871.819818
2013-01-030.065255NaNNaNNaN
2013-01-041.2893050.497115NaN0.040239
2013-01-050.0382320.875057NaN0.934432
2013-01-06NaNNaN1.6998861.291653

使用 isin 方法做 filter 过滤:

df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']

df2
ABCDE
2013-01-01-0.605936-0.861658-1.0019241.528584one
2013-01-02-0.1654080.3883381.1871871.819818one
2013-01-030.065255-1.608074-1.282331-0.286067two
2013-01-041.2893050.497115-0.2253510.040239three
2013-01-050.0382320.875057-0.0925260.934432four
2013-01-06-2.163453-0.0102791.6998861.291653three
df2[df2['E'].isin(['two','four'])]
ABCDE
2013-01-030.065255-1.608074-1.282331-0.286067two
2013-01-050.0382320.875057-0.0925260.934432four

设定数据的值

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))

s1
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

像字典一样,直接指定 F 列的值为 s1,此时以 df 已有的 index 为标准将二者进行合并,s1 中没有的 index 项设为 NaN,多余的项舍去:

df['F'] = s1

df
ABCDF
2013-01-01-0.605936-0.861658-1.0019241.528584NaN
2013-01-02-0.1654080.3883381.1871871.8198181
2013-01-030.065255-1.608074-1.282331-0.2860672
2013-01-041.2893050.497115-0.2253510.0402393
2013-01-050.0382320.875057-0.0925260.9344324
2013-01-06-2.163453-0.0102791.6998861.2916535

或者使用 atiat 修改单个值:

df.at[dates[0],'A'] = 0

df
ABCDF
2013-01-010.000000-0.861658-1.0019241.528584NaN
2013-01-02-0.1654080.3883381.1871871.8198181
2013-01-030.065255-1.608074-1.282331-0.2860672
2013-01-041.2893050.497115-0.2253510.0402393
2013-01-050.0382320.875057-0.0925260.9344324
2013-01-06-2.163453-0.0102791.6998861.2916535
df.iat[0, 1] = 0

df
ABCDF
2013-01-010.0000000.000000-1.0019241.528584NaN
2013-01-02-0.1654080.3883381.1871871.8198181
2013-01-030.065255-1.608074-1.282331-0.2860672
2013-01-041.2893050.497115-0.2253510.0402393
2013-01-050.0382320.875057-0.0925260.9344324
2013-01-06-2.163453-0.0102791.6998861.2916535

设定一整列:

df.loc[:,'D'] = np.array([5] * len(df))

df
ABCDF
2013-01-010.0000000.000000-1.0019245NaN
2013-01-02-0.1654080.3883381.18718751
2013-01-030.065255-1.608074-1.28233152
2013-01-041.2893050.497115-0.22535153
2013-01-050.0382320.875057-0.09252654
2013-01-06-2.163453-0.0102791.69988655

设定满足条件的数值:

df2 = df.copy()

df2[df2 > 0] = -df2

df2
ABCDF
2013-01-010.0000000.000000-1.001924-5NaN
2013-01-02-0.165408-0.388338-1.187187-5-1
2013-01-03-0.065255-1.608074-1.282331-5-2
2013-01-04-1.289305-0.497115-0.225351-5-3
2013-01-05-0.038232-0.875057-0.092526-5-4
2013-01-06-2.163453-0.010279-1.699886-5-5

缺失数据

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1

df1
ABCDFE
2013-01-010.0000000.000000-1.0019245NaN1
2013-01-02-0.1654080.3883381.187187511
2013-01-030.065255-1.608074-1.28233152NaN
2013-01-041.2893050.497115-0.22535153NaN

丢弃所有缺失数据的行得到的新数据:

df1.dropna(how='any')
ABCDFE
2013-01-02-0.1654080.3883381.187187511

填充缺失数据:

df1.fillna(value=5)
ABCDFE
2013-01-010.0000000.000000-1.001924551
2013-01-02-0.1654080.3883381.187187511
2013-01-030.065255-1.608074-1.282331525
2013-01-041.2893050.497115-0.225351535

检查缺失数据的位置:

pd.isnull(df1)
ABCDFE
2013-01-01FalseFalseFalseFalseTrueFalse
2013-01-02FalseFalseFalseFalseFalseFalse
2013-01-03FalseFalseFalseFalseFalseTrue
2013-01-04FalseFalseFalseFalseFalseTrue

计算操作

统计信息

每一列的均值:

df.mean()
A   -0.156012
B    0.023693
C    0.047490
D    5.000000
F    3.000000
dtype: float64

每一行的均值:

df.mean(1)
2013-01-01    0.999519
2013-01-02    1.482023
2013-01-03    0.834970
2013-01-04    1.912214
2013-01-05    1.964153
2013-01-06    1.905231
Freq: D, dtype: float64

多个对象之间的操作,如果维度不对,pandas 会自动调用 broadcasting 机制:

s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)

print s
2013-01-01   NaN
2013-01-02   NaN
2013-01-03     1
2013-01-04     3
2013-01-05     5
2013-01-06   NaN
Freq: D, dtype: float64

相减 df - s

df.sub(s, axis='index')
ABCDF
2013-01-01NaNNaNNaNNaNNaN
2013-01-02NaNNaNNaNNaNNaN
2013-01-03-0.934745-2.608074-2.28233141
2013-01-04-1.710695-2.502885-3.22535120
2013-01-05-4.961768-4.124943-5.0925260-1
2013-01-06NaNNaNNaNNaNNaN

apply 操作

R 中的 apply 操作类似,接收一个函数,默认是对将函数作用到每一列上:

df.apply(np.cumsum)
ABCDF
2013-01-010.0000000.000000-1.0019245NaN
2013-01-02-0.1654080.3883380.185263101
2013-01-03-0.100153-1.219736-1.097067153
2013-01-041.189152-0.722621-1.322419206
2013-01-051.2273830.152436-1.4149452510
2013-01-06-0.9360690.1421570.2849413015

求每列最大最小值之差:

df.apply(lambda x: x.max() - x.min())
A    3.452758
B    2.483131
C    2.982217
D    0.000000
F    4.000000
dtype: float64

直方图

s = pd.Series(np.random.randint(0, 7, size=10))
print s
0    2
1    5
2    6
3    6
4    6
5    3
6    5
7    0
8    4
9    4
dtype: int64

直方图信息:

print s.value_counts()
6    3
5    2
4    2
3    1
2    1
0    1
dtype: int64

绘制直方图信息:

h = s.hist()

png

字符串方法

Series 或者 DataFrame 的某一列是字符串时,我们可以用 .str 对这个字符串数组进行字符串的基本操作:

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

print s.str.lower()
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

合并

连接

df = pd.DataFrame(np.random.randn(10, 4))

df
0123
0-2.3463730.105651-0.0480270.010637
1-0.6821980.9430430.147312-0.657871
20.515766-0.7682860.3615701.146278
3-0.607277-0.003086-1.4990011.165728
4-1.226279-0.177246-1.379631-0.639261
50.807364-1.8550600.3259681.898831
60.438539-0.728131-0.0099240.398360
71.497457-1.506314-1.5576240.869043
80.945985-0.519435-0.510359-1.077751
91.597679-0.285955-1.0607360.608629

可以使用 pd.concat 函数将多个 pandas 对象进行连接:

pieces = [df[:2], df[4:5], df[7:]]

pd.concat(pieces)
0123
0-2.3463730.105651-0.0480270.010637
1-0.6821980.9430430.147312-0.657871
4-1.226279-0.177246-1.379631-0.639261
71.497457-1.506314-1.5576240.869043
80.945985-0.519435-0.510359-1.077751
91.597679-0.285955-1.0607360.608629

数据库中的 Join

merge 可以实现数据库中的 join 操作:

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

print left
print right
   key  lval
0  foo     1
1  foo     2
   key  rval
0  foo     4
1  foo     5
pd.merge(left, right, on='key')
keylvalrval
0foo14
1foo15
2foo24
3foo25

append

DataFrame 中添加行:

df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])

df
ABCD
01.587778-0.1102970.6022451.212597
1-0.5511090.337387-0.2209190.363332
21.207373-0.1283940.619937-0.612694
3-0.978282-1.0381700.048995-0.788973
40.843893-1.0790210.0922120.485422
5-0.0565941.8312061.910864-1.331739
6-0.487106-1.4953670.8534400.410854
71.830852-0.0148930.2540250.197422

将第三行的值添加到最后:

s = df.iloc[3]

df.append(s, ignore_index=True)
ABCD
01.587778-0.1102970.6022451.212597
1-0.5511090.337387-0.2209190.363332
21.207373-0.1283940.619937-0.612694
3-0.978282-1.0381700.048995-0.788973
40.843893-1.0790210.0922120.485422
5-0.0565941.8312061.910864-1.331739
6-0.487106-1.4953670.8534400.410854
71.830852-0.0148930.2540250.197422
8-0.978282-1.0381700.048995-0.788973

Grouping

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

df
ABCD
0fooone0.7730620.206503
1barone1.414609-0.346719
2footwo0.9641740.706623
3barthree0.182239-1.516509
4footwo-0.0962550.494177
5bartwo-0.759471-0.389213
6fooone-0.257519-1.411693
7foothree-0.1093680.241862

按照 A 的值进行分类:

df.groupby('A').sum()
CD
A
bar0.837377-2.252441
foo1.2740940.237472

按照 A, B 的值进行分类:

df.groupby(['A', 'B']).sum()
CD
AB
barone1.414609-0.346719
three0.182239-1.516509
two-0.759471-0.389213
fooone0.515543-1.205191
three-0.1093680.241862
two0.8679191.200800

改变形状

Stack

产生一个多 indexDataFrame

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

df
AB
firstsecond
barone-0.1091740.958551
two-0.254743-0.975924
bazone-0.132039-0.119009
two0.587063-0.819037
fooone-0.7541230.430747
two-0.4265440.389822
quxone-0.382501-0.562910
two-0.5292870.826337

stack 方法将 columns 变成一个新的 index 部分:

df2 = df[:4]

stacked = df2.stack()

stacked
first  second   
bar    one     A   -0.109174
               B    0.958551
       two     A   -0.254743
               B   -0.975924
baz    one     A   -0.132039
               B   -0.119009
       two     A    0.587063
               B   -0.819037
dtype: float64

可以使用 unstack() 将最后一级 index 放回 column

stacked.unstack()
AB
firstsecond
barone-0.1091740.958551
two-0.254743-0.975924
bazone-0.132039-0.119009
two0.587063-0.819037

也可以指定其他的级别:

stacked.unstack(1)
secondonetwo
first
barA-0.109174-0.254743
B0.958551-0.975924
bazA-0.1320390.587063
B-0.119009-0.819037

时间序列

金融分析中常用到时间序列数据:

rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)

ts
2012-03-06    1.096788
2012-03-07    0.029678
2012-03-08    0.511461
2012-03-09   -0.332369
2012-03-10    1.720321
Freq: D, dtype: float64

标准时间表示:

ts_utc = ts.tz_localize('UTC')

ts_utc
2012-03-06 00:00:00+00:00    1.096788
2012-03-07 00:00:00+00:00    0.029678
2012-03-08 00:00:00+00:00    0.511461
2012-03-09 00:00:00+00:00   -0.332369
2012-03-10 00:00:00+00:00    1.720321
Freq: D, dtype: float64

改变时区表示:

ts_utc.tz_convert('US/Eastern')
2012-03-05 19:00:00-05:00    1.096788
2012-03-06 19:00:00-05:00    0.029678
2012-03-07 19:00:00-05:00    0.511461
2012-03-08 19:00:00-05:00   -0.332369
2012-03-09 19:00:00-05:00    1.720321
Freq: D, dtype: float64

Categoricals

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

df
idraw_grade
01a
12b
23b
34a
45a
56e

可以将 grade 变成类别:

df["grade"] = df["raw_grade"].astype("category")

df["grade"]
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

将类别的表示转化为有意义的字符:

df["grade"].cat.categories = ["very good", "good", "very bad"]

df["grade"]
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (3, object): [very good, good, very bad]

添加缺失的类别:

df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

使用 grade 分组:

df.groupby("grade").size()
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

绘图

使用 ggplot 风格:

plt.style.use('ggplot')

Series 绘图:

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))

p = ts.cumsum().plot()

png

DataFrame 按照 columns 绘图:

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                  columns=['A', 'B', 'C', 'D'])

df.cumsum().plot()
p = plt.legend(loc="best")

png

文件读写

csv

写入文件:

df.to_csv('foo.csv')

从文件中读取:

pd.read_csv('foo.csv').head()
Unnamed: 0ABCD
02000-01-01-1.0115541.200283-0.310949-1.060734
12000-01-02-1.0308940.660518-0.214002-0.422014
22000-01-03-0.4886921.709209-0.6022081.115456
32000-01-04-0.4402430.8266920.321648-0.351698
42000-01-05-0.1656841.2973030.8172330.174767

hdf5

写入文件:

df.to_hdf("foo.h5", "df")

读取文件:

pd.read_hdf('foo.h5','df').head()
ABCD
2000-01-01-1.0115541.200283-0.310949-1.060734
2000-01-02-1.0308940.660518-0.214002-0.422014
2000-01-03-0.4886921.709209-0.6022081.115456
2000-01-04-0.4402430.8266920.321648-0.351698
2000-01-05-0.1656841.2973030.8172330.174767

excel

写入文件:

df.to_excel('foo.xlsx', sheet_name='Sheet1')

读取文件:

pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA']).head()
ABCD
2000-01-01-1.0115541.200283-0.310949-1.060734
2000-01-02-1.0308940.660518-0.214002-0.422014
2000-01-03-0.4886921.709209-0.6022081.115456
2000-01-04-0.4402430.8266920.321648-0.351698
2000-01-05-0.1656841.2973030.8172330.174767

清理生成的临时文件:

import glob
import os

for f in glob.glob("foo*"):
    os.remove(f)
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值