Pandas学习笔记

本文是通过阅读此处网址的内容以及自己查询而得的Pandas学习笔记。

一.简要介绍

Pandas是基于Numpy开发出来的,它是python的核心数据分析支持库,主要的数据结构是Series(带标签的一维同构数组)继和DataFrame(带标签的、大小可变的、二维异构表格),这些数据结构的值都是可变的;但Series的长度不可变,DataFrame可插入列。

Pandas与NumPy的本质区别是:NumPy数组只有一种数据类型,而DataFrame中每列的数据类型各不相同;当DataFrame的列由多种数据类型组成时,在输出底层数据的NumPy对象时的操作耗费系统资源较大。

部分数据类型

字符串类型:object

整数类型:int8,int16,int32,int64

浮点数类型:float64,float32

二.相关操作

1.导入

import numpy as np
import pandas as pd

2.生成对象

用列表生成Series;

s = pd.Series([1,5,14,np.nan,23])
print(s)

此时默认生成整数索引,结果如下:

0     1.0
1     5.0
2    14.0
3     NaN
4    23.0
dtype: float64

用Numpy数组生成DataFrame

dates = pd.date_range("20220521",periods=6)
print(dates)

其中,date_range()是panda中常见的函数,用于生成固定频率的时间索引,在这次的代码中,制定了开始时间以及时间序列的数量,那么接下来就会生成从“20220521”开始,频率为默认值“D”的6个时间的DatetimeIndex时间索引;

DatetimeIndex(['2022-05-21', '2022-05-22', '2022-05-23', '2022-05-24',
               '2022-05-25', '2022-05-26'],
              dtype='datetime64[ns]', freq='D')

此外,还有这种方式:

dates = pd.date_range("20220521",periods=6)
df = pd.DataFrame(np.random.randn(6,4),index = dates, columns = list('ABCD'))
print(df)

其中,np.random.randn(6,4)代表生成一个两维的数据表格,数据部分包括6行4列,表格的行标签是之前生成的dates,列标签是A,B,C,D;

                   A         B         C         D
2022-05-21  0.136886  1.041708 -1.321150  0.451349
2022-05-22  0.320517  0.915637 -0.376431 -1.490658
2022-05-23  0.390247  0.000224  1.161696 -1.622723
2022-05-24 -0.741657  1.345109 -0.656991 -1.895109
2022-05-25 -0.168533  1.175146 -0.099444 -0.803727
2022-05-26 -0.511003 -0.276564  2.746680 -1.134566

还可以采用Series字典对象生成DataFrame:

df2 = pd.DataFrame({
    'A': 1,
    'B': pd.Timestamp('20220524'),
    'C': pd.Series(1, index = list(range(4)),dtype = 'float32'),
    'D': np.array([3]*4, dtype = 'int32'),
    'E': pd.Categorical(["test","trian","test","train"]),
    'F': "foo"
    })
print(df2)

结果如下:

   A          B    C  D      E    F
0  1 2022-05-24  1.0  3   test  foo
1  1 2022-05-24  1.0  3  trian  foo
2  1 2022-05-24  1.0  3   test  foo
3  1 2022-05-24  1.0  3  train  foo

查询每一列的数据类型,可以得到每一列都有不同的数据类型;

A             int64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

3.查看数据

采用tail()函数以及head()函数查看之前定义好的df的头部以及尾部数据。

print(df.head(1))
print(df.tail(2))
                   A         B         C         D
2022-05-21 -0.249276  0.591296  0.992551 -1.133971
                   A         B         C         D
2022-05-25 -1.266972 -0.605986  1.588265  0.712359
2022-05-26  0.604126 -1.017234  0.232689 -0.579932

采用index()以及column()函数显示索引以及列名;

print(df.index)
print(df.columns)
DatetimeIndex(['2022-05-21', '2022-05-22', '2022-05-23', '2022-05-24',
               '2022-05-25', '2022-05-26'],
              dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')

通过DataFrame.to_numpy()函数输出底层数据的NumPy对象;但输出不包含行索引和列标签。在调用这个函数时,Pandas查找支持DataFrame里所有的数据类型的NumPy数据类型;

print(df.to_numpy())
[[-0.01568141  0.32649045  1.35315451 -0.73029335]
 [ 1.52319239  0.39413554 -1.79469583  1.16339447]
 [-0.38151711 -0.26693644 -0.43819009  0.86742392]
 [-0.62045309 -1.94559426 -0.37615171  1.97012684]
 [-1.5739133   1.66931392 -1.33115334 -0.75294336]
 [ 0.03869211  0.31861187  0.96338101  0.09202964]]

通过describe()函数可以快速查看数据的统计摘要;

print(df.describe())

最终显示的结果里,包括有:计数、平均值、标准差、最小值、最大值以及四分之一分位数、二分之一分位数、四分之三分位数。

              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean  -0.362490 -0.058584 -0.428353  0.235016
std    1.355071  0.900411  0.588704  1.441720
min   -2.598253 -1.204845 -0.922017 -1.849070
25%   -0.776751 -0.648614 -0.835257 -0.393448
50%   -0.329908 -0.147642 -0.693182  0.140082
75%    0.412253  0.656982 -0.077645  0.896637
max    1.349188  1.039041  0.493345  2.401473

转置数据;

print(df.T)
   2022-05-21  2022-05-22  2022-05-23  2022-05-24  2022-05-25  2022-05-26
A   -0.036124    0.687316   -0.856321    2.429078   -2.038939    0.376852
B    0.359951   -1.019624    0.278458    1.134377   -0.956782   -0.201985
C   -0.176906    0.330358    0.135873   -1.556026    1.406605   -1.092472
D    0.182999    0.576017   -0.141801   -1.459510   -0.488253    1.263315

按轴排序;

print(df.sort_index(axis = 1,ascending = False))

 参数axis 默认值是 0,即指按照行的索引进行排序;若axis 设置为 1,即指按照列的索引进行排序;ascending 默认为 True,即升序;设置为 False 时候为降序。

                   D         C         B         A
2022-05-21 -0.039538  0.102864  1.809570  0.422249
2022-05-22  0.164445  0.128373 -0.488638  1.562228
2022-05-23 -1.019688  1.247957  1.711259  0.374710
2022-05-24  2.933171  0.179653 -0.349013 -0.356169
2022-05-25  0.483066  0.902441  2.752587 -0.858579
2022-05-26 -1.359240 -0.029439  0.315487  0.136530

按值排序;

print(df.sort_values(by = 'B'))

按照B列的数据值排序;

                   A         B         C         D
2022-05-22 -0.003005 -0.510940 -1.850251 -0.602501
2022-05-21  0.199195  0.448159  1.344446 -0.156287
2022-05-25 -1.032422  0.733475 -0.966291  0.494933
2022-05-24  2.097197  0.888254  1.117285 -0.026202
2022-05-26 -1.373749  1.635690 -0.610852  0.726204
2022-05-23  0.476349  1.862359 -0.921950  0.565911

4.选择

获取数据

选取单列,产生Series;

print(df.A)
2022-05-21   -1.569664
2022-05-22   -0.519668
2022-05-23    1.400681
2022-05-24   -1.563633
2022-05-25   -0.864536
2022-05-26    1.321613
Freq: D, Name: A, dtype: float64

将行进行切片;

print(df[0:3])
                   A         B         C         D
2022-05-21 -0.074677  0.511283  1.363140  1.209577
2022-05-22 -1.790185  0.186599  0.386680  0.389305
2022-05-23  0.779339  0.857374 -0.292863  0.324927
print(df['20220521':'20220524'])
                   A         B         C         D
2022-05-21 -0.034967  0.169557  0.496417  0.557630
2022-05-22 -0.363198  1.165939 -0.029852 -0.742354
2022-05-23 -1.065871  0.094265  0.335761  1.295312
2022-05-24  0.173184  1.589024  0.565842  0.973264

按照标签选择

提取一行数据;

print(df.loc[dates[0]])
A   -1.752385
B    2.007287
C   -0.466264
D   -0.059769
Name: 2022-05-21 00:00:00, dtype: float64

用标签选择多列数据;

print(df.loc[:,['A','B']])
                   A         B
2022-05-21  1.832760 -1.234210
2022-05-22  1.060735  0.604725
2022-05-23  0.042338  0.429720
2022-05-24 -0.120698  0.354572
2022-05-25  0.167567 -1.642642
2022-05-26  0.284255  0.106209

用标签切片,包含行与列结束点;

print(df.loc["20220521":'20220523',['C','D']])
                   C         D
2022-05-21 -0.064251  1.404674
2022-05-22 -1.469895  0.604737
2022-05-23 -0.164208  1.140707

返回对象,实现降维效果;

print(df.loc['20220522',['A','C']])
A    0.102988
C    0.702026
Name: 2022-05-22 00:00:00, dtype: float64

提取标量的值;

print(df.loc[dates[0],'D'])
0.3523557087717819

等效方法,可快速访问标量;

print(df.at[dates[0],'C'])
0.2784856822405241

 按位置选择

按照整数位置选择;

print(df.iloc[3])
A    0.504944
B    0.636838
C    0.062362
D   -1.163998
Name: 2022-05-24 00:00:00, dtype: float64

整数切片;

print(df.iloc[3:5,0:3])
                   A         B         C
2022-05-24  0.883995 -0.387231  0.628574
2022-05-25 -0.289374  0.010543  0.199258

采用整数列表按照位置切片;

print(df.iloc[[1,2,5],[0,3]])
                   A         D
2022-05-22  0.763304 -0.387213
2022-05-23  0.894361 -0.342509
2022-05-26 -0.618824  0.701133

显式整行切片;

print(df.iloc[1:3,:])
                   A         B         C         D
2022-05-22  1.479378  0.782299 -0.519816  0.892910
2022-05-23 -0.890134 -1.006557  0.597311 -0.598442

显式整列切片;

print(df.iloc[:,1:3])
                   B         C
2022-05-21 -0.691051 -1.694787
2022-05-22  0.750126 -0.709624
2022-05-23 -1.652578  1.136266
2022-05-24 -2.098970 -0.647041
2022-05-25 -1.143725 -1.053974
2022-05-26  0.846129  0.317639

显式提取值;

print(df.iloc[1,1])
-1.220320008727587

等效方法,也可以快速访问标量;

print(df.iat[1,1])
-1.220320008727587

布尔索引

用单列的值选择数据;

print(df[df.A>0])
                   A         B         C         D
2022-05-23  1.424633 -0.411235 -1.050350  0.073719
2022-05-26  0.216383 -0.602787  0.242099 -0.003117

选择DataFrame里满足条件的值;

print(df[df>0])
                   A         B         C         D
2022-05-21  0.034121       NaN       NaN       NaN
2022-05-22       NaN  0.832257       NaN  0.673357
2022-05-23  2.551887  1.052720       NaN  2.171475
2022-05-24       NaN       NaN  0.724251       NaN
2022-05-25       NaN  2.065618  0.419077  0.231083
2022-05-26  0.034950       NaN       NaN       NaN

采用isin()函数进行筛选;先添加一行“E”,isin()函数接受一个列表,并判断该列中元素是否在列表中;

df2 = df.copy()
df2['E'] = ['one','one','two','three','four','three']
print(df2)
print()
print(df2[df2['E'].isin(['two','four'])])
                   A         B         C         D      E
2022-05-21  1.784070  0.717759  1.173878 -0.543134    one
2022-05-22 -1.160707  0.622619 -0.371278  0.569910    one
2022-05-23  0.726340  0.765890  0.811662 -0.158632    two
2022-05-24  0.309843  0.208132  0.011794 -0.633664  three
2022-05-25  0.445362  1.235948  1.280879 -0.051112   four
2022-05-26 -0.499006  0.404253  1.310874 -0.230863  three

                   A         B         C         D     E
2022-05-23  0.726340  0.765890  0.811662 -0.158632   two
2022-05-25  0.445362  1.235948  1.280879 -0.051112  four

赋值

用索引自动对齐新增列的数据;新建一个值为[1,2,3,4,5,6],索引index为20220521到20220526的Series,并将series赋值给df作为df新增的F列

s1 = pd.Series([1,2,3,4,5,6],index = pd.date_range("20220521",periods = 6))
print(s1)
df['F'] = s1
2022-05-21    1
2022-05-22    2
2022-05-23    3
2022-05-24    4
2022-05-25    5
2022-05-26    6
Freq: D, dtype: int64

按照标签赋值;通过at方法把满足df中dates[0],列A的值修改为0;

df.at[dates[0],'A'] = 0

按照位置赋值;使用iat方法修改df中行下标为0,列下标为1的值等于0;

df.iat[0,1] = 0

按照NumPy数组赋值; 使用loc方法把df的D列值修改为5*len(df)

df.loc[:,'D'] = np.array([5]*len(df))

而后打印赋值结果,可得:

                   A         B         C  D  F
2022-05-21  0.000000  0.000000  1.189929  5  1
2022-05-22  0.720087 -0.733954 -0.227088  5  2
2022-05-23  0.304696  1.125239  2.176908  5  3
2022-05-24  0.418205 -0.414897 -0.365389  5  4
2022-05-25 -1.047534  0.502702 -0.247725  5  5
2022-05-26  1.004945  0.826539 -0.677019  5  6

采用where条件进行赋值;

df2 = df.copy()
df2[df2>0] = -df2
print(df2)
                   A         B         C         D
2022-05-21 -1.395279 -0.998688 -1.105492 -0.077170
2022-05-22 -0.554307 -0.123219 -0.263776 -0.368944
2022-05-23 -0.867589 -0.143677 -1.149439 -0.052096
2022-05-24 -0.316583 -1.197679 -0.244917 -1.372555
2022-05-25 -1.166079 -0.265879 -1.385109 -1.612151
2022-05-26 -0.413096 -0.344400 -0.837535 -0.152082

5.缺失值

Pandas主要使用np.nan表示缺失数据,在计算时,默认情况下不包含空值;

在使用reindex重建索引时,可以更改、添加、删除指定轴的索引,并返回数据副本,并不更改原数据。

df1 = df.reindex(index = dates[0:4],columns= list(df.columns)+['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
print(df1)

按照设定的行列重建索引后,修改其中特定行列的值,并打印出来;

                   A         B         C  D  F    E
2022-05-21  0.000000  0.000000 -0.881176  5  1  1.0
2022-05-22  0.408340 -0.195753 -1.627599  5  2  1.0
2022-05-23  0.379201 -0.329105 -0.351990  5  3  NaN
2022-05-24  0.510544  1.734054  0.474862  5  4  NaN

使用dropna()函数删除所有含缺失值的行;

print(df1.dropna(how = 'any'))
                   A         B         C  D  F    E
2022-05-21  0.000000  0.000000 -0.398187  5  1  1.0
2022-05-22 -1.021497 -1.067449 -1.213461  5  2  1.0

使用fillna()函数填充缺失值;

                   A         B         C  D  F    E
2022-05-21  0.000000  0.000000 -0.537663  5  1  1.0
2022-05-22 -0.447872  0.335033  1.338705  5  2  1.0
2022-05-23 -0.868069  0.114179  0.250253  5  3  5.0
2022-05-24  0.496782 -1.299465 -2.181665  5  4  5.0

使用isna()函数提取nan值的布尔掩码;

print(pd.isna(df1))
                A      B      C      D      F      E
2022-05-21  False  False  False  False  False  False
2022-05-22  False  False  False  False  False  False
2022-05-23  False  False  False  False  False   True
2022-05-24  False  False  False  False  False   True

6.运算

统计

一般情况,运算时排除缺失值;

描述性统计;(列)

print(df.mean())
A    0.100793
B    0.377849
C    0.044165
D    5.000000
F    3.500000
dtype: float64

在另一个轴也即行轴上执行同样的操作;

print(df.mean(1))
2022-05-21    1.002418
2022-05-22    1.599843
2022-05-23    1.815638
2022-05-24    2.085704
2022-05-25    1.524031
2022-05-26    1.711351
Freq: D, dtype: float64

在不同维度的对象进行运算时,要先将其对齐(Pandas自动沿指定维度广播)。

s = pd.Series([1,3,5,np.nan,6,8],index = dates).shift(2)
print(s)

shift()函数对数据进行移动;

2022-05-21    NaN
2022-05-22    NaN
2022-05-23    1.0
2022-05-24    3.0
2022-05-25    5.0
2022-05-26    NaN
Freq: D, dtype: float64

元素级相减;

print(df.sub(s,axis = 'index'))
                   A         B         C    D    F
2022-05-21       NaN       NaN       NaN  NaN  NaN
2022-05-22       NaN       NaN       NaN  NaN  NaN
2022-05-23 -1.472729 -1.114310 -1.966175  4.0  2.0
2022-05-24 -2.250075 -3.093553 -3.755624  2.0  1.0
2022-05-25 -6.066469 -3.576010 -5.852282  0.0  0.0
2022-05-26       NaN       NaN       NaN  NaN  NaN

Apply函数

调来一个函数,对数据进行批量处理;

print(df.apply(np.cumsum))
                   A         B         C   D   F
2022-05-21  0.000000  0.000000 -0.117359   5   1
2022-05-22 -1.624370  0.391208  1.857898  10   3
2022-05-23 -1.860172  2.902743  2.030917  15   6
2022-05-24 -2.390905  2.147162  0.670798  20  10
2022-05-25 -2.749462  2.377652 -0.947589  25  15
2022-05-26 -4.870885  0.924509 -2.626332  30  21

通过cumsum()函数返回沿给定轴的元素的累积和;

print(df.apply(np.cumsum))
                   A         B         C   D   F
2022-05-21  0.000000  0.000000 -1.511653   5   1
2022-05-22 -0.064132 -0.430295  0.847677  10   3
2022-05-23 -0.795267 -0.249659  1.122162  15   6
2022-05-24 -1.526372  0.735427  2.770422  20  10
2022-05-25 -0.392146  1.599917  4.034942  25  15
2022-05-26 -0.380017  1.188050  4.987291  30  21

使用lambda()定义每一列最大值减去最小值的函数;

print(df.apply(lambda x : x.max() - x.min()))
A    2.984022
B    1.708451
C    4.276404
D    0.000000
F    5.000000
dtype: float64

直方图

s = pd.Series(np.random.randint(0,7,size = 10))
print(s)
0    4
1    2
2    1
3    3
4    0
5    3
6    1
7    2
8    3
9    5
dtype: int32

 计数;

print(s.value_counts())
6    2
5    2
1    2
3    2
0    1
2    1
dtype: int64

字符串方法

str属性中包含一组字符串处理功能,其模式匹配默认使用正则表达式;

s = pd.Series(['A','B','C','Aaba','Baca',np.nan,'CABA','dog','cat'])
print(s.str.lower())
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

7.合并

结合

concat()函数用于连接Pandas对象;

df = pd.DataFrame(np.random.randn(10,4))
print(df)
          0         1         2         3
0 -1.359922  0.985887 -0.003946  0.988086
1 -0.165162 -0.918188 -0.969180  1.341331
2  1.099910 -0.962338  0.202542  0.250508
3 -0.836569  1.490216  0.404249  0.758221
4  0.377200 -0.329764 -1.031423 -0.921214
5 -1.536218 -0.410249 -1.239609 -1.124185
6 -0.862180  0.415266 -1.721721 -0.236350
7  1.883878 -0.095544 -0.950778 -0.525479
8  0.944001  1.099845  1.922751  0.837286
9 -0.954146  0.325708  0.119797  0.355755

 分解为多组后,连接起来;

pieces = [df[:3],df[3:7],df[7:]]
print(pd.concat(pieces))
                   A         B         C  D  F
2022-05-21  0.000000  0.000000 -1.351565  5  1
2022-05-22 -0.790049  2.016855  0.078462  5  2
2022-05-23  0.846315 -0.592889 -1.361169  5  3
2022-05-24  0.300818  0.174860 -0.410828  5  4
2022-05-25  0.843421  0.682123 -0.095293  5  5
2022-05-26  0.568242 -0.133728  0.459318  5  6

连接

按照SQL风格连接起来;

left = pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})
right = pd.DataFrame({'key':['foo','foo'],'lval':[4,5]})
print(left)
print(right)
print(pd.merge(left,right,on = 'key'))
   key  lval
0  foo     1
1  foo     2
   key  lval
0  foo     4
1  foo     5
   key  lval_x  lval_y
0  foo       1       4
1  foo       1       5
2  foo       2       4
3  foo       2       5

追加

可以使用append()函数为DataFrame追加行;

df = pd.DataFrame(np.random.randn(8,4),columns= ['A','B','C','D'])
print(df)
s = df.iloc[3]
print(s)
print(df.append(s,ignore_index = True))

 忽略索引ignore_index=True

          A         B         C         D
0 -1.544461  0.155749 -0.851200 -0.356837
1  0.011902  1.191635  0.493092 -0.365457
2 -0.566577 -0.142893 -1.565211  1.869204
3  0.988017 -2.153860 -1.026299 -1.091889
4  1.203632  0.368603 -0.835276  0.459193
5  0.507430  0.024230 -0.658526 -1.026438
6  0.087510 -0.725613 -2.053941  0.522262
7  0.114488  0.104691 -0.028875  0.180440
A    0.988017
B   -2.153860
C   -1.026299
D   -1.091889
Name: 3, dtype: float64
          A         B         C         D
0 -1.544461  0.155749 -0.851200 -0.356837
1  0.011902  1.191635  0.493092 -0.365457
2 -0.566577 -0.142893 -1.565211  1.869204
3  0.988017 -2.153860 -1.026299 -1.091889
4  1.203632  0.368603 -0.835276  0.459193
5  0.507430  0.024230 -0.658526 -1.026438
6  0.087510 -0.725613 -2.053941  0.522262
7  0.114488  0.104691 -0.028875  0.180440
8  0.988017 -2.153860 -1.026299 -1.091889

8.分组

group by 主要涵盖一下几个流程:

先按照条件把数据分割成多组;然后将每组单独应用函数;最后将处理结果组合成一个数据结构;

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                       'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three',
                      'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})
print(df)
     A      B         C         D
0  foo    one  1.167424 -0.934787
1  bar    one  0.780026  1.425194
2  foo    two  1.385424 -0.137203
3  bar  three  0.662306  0.379540
4  foo    two  0.378391  1.542871
5  bar    two -0.008416 -0.325737
6  foo    one -1.546695 -0.413336
7  foo  three -0.202212  0.007095

先分组,然后再用sum()函数计算每组的汇总数据;

print(df.groupby('A').sum())
            C         D
A                      
bar  3.982001  2.100006
foo  0.227701  1.864154

多列分组后,生成多层索引,也可以应用sum()函数;

print(df.groupby(['A','B']).sum())
                  C         D
A   B                        
bar one   -1.689814  0.275729
    three -0.209854  0.808712
    two   -1.027787 -1.969185
foo one    1.417323  0.591048
    three  0.006879 -0.328129
    two   -2.258776 -1.161554

9.重塑

堆叠

MultiIndex表示建立多级索引,下文是按照tuple的方式建立的;

tuples = list(zip(*[['bar','bar','baz','baz',
                     'foo','foo','qux','qux'],
                    ['one','two','one','two',
                     'one','two','one','two']]))
index = pd.MultiIndex.from_tuples(tuples,names = ['first','second'])
df = pd.DataFrame(np.random.randn(8,2),index = index,columns=['A','B'])
df2 = df[:4]
print(df2)
                     A         B
first second                    
bar   one    -0.848519  1.328762
      two    -1.534984 -2.450740
baz   one     0.731656  0.786888
      two     0.200778  0.694727

stack()函数可以把DataFrame列压缩至一层;

stacked = df2.stack()
print(stacked)
first  second   
bar    one     A    0.125944
               B   -1.118293
       two     A    0.667313
               B   -0.307445
baz    one     A   -0.689515
               B    0.279703
       two     A    1.158254
               B    1.359036
dtype: float64

压缩之后的DataFrame或者Series具有多层索引,采用stack()的逆操作unstack()则可进行拆叠,默认情况下拆叠最后一层;

print(stacked.unstack())
                     A         B
first second                    
bar   one    -0.322104 -1.303836
      two    -0.263043 -0.830459
baz   one    -1.517066 -0.042109
      two    -0.476060  0.950878

可以指定拆叠的层数;

print(stacked.unstack(1))
second        one       two
first                      
bar   A  0.383808 -0.072475
      B -0.112976 -0.162860
baz   A -0.741657  1.031196
      B  1.470863 -0.304738
print(stacked.unstack(0))
first          bar       baz
second                      
one    A -1.445239  1.248117
       B  0.193994 -0.823906
two    A  1.529949  0.300386
       B  0.348373 -0.265187

10.数据透视表

df = pd.DataFrame({'A':['one','one','two','three']*3,
                   'B':['A','B','C']*4,
                   'C':['foo','foo','foo','bar','bar','bar']*2,
                   'D':np.random.randn(12),
                   'E':np.random.randn(12)
})
print(df)
        A  B    C         D         E
0     one  A  foo -0.483057  0.039645
1     one  B  foo -0.414430  0.729397
2     two  C  foo -0.144770  0.115875
3   three  A  bar -0.455463  0.513543
4     one  B  bar  0.284159  0.717968
5     one  C  bar -2.001673 -1.027018
6     two  A  foo  1.271966  0.955445
7   three  B  foo  0.800737  0.546006
8     one  C  foo -0.727104 -0.166272
9     one  A  bar  0.746088  0.025253
10    two  B  bar -0.657521  1.935400
11  three  C  bar  1.059533 -0.005286

生成数据透视表;(数据透视表是一种交互式的表,而且可以动态的改变他们的版面布置,以便按照不同方式分析数据,也可以重新安排行号、列标和页字段。)

print(pd.pivot_table(df,values='D',index=['A','B'],columns=['C']))
C             bar       foo
A     B                    
one   A -0.393722 -0.475678
      B -0.358087 -0.493856
      C -0.167854  0.414786
three A  0.072712       NaN
      B       NaN  0.412877
      C -0.726184       NaN
two   A       NaN -0.818590
      B -1.001960       NaN
      C       NaN -0.157245

11.时间序列

Pandas为频率转换时的重采样提供了简单易用又强大高效的功能;下面是将秒级的数据转换为5分钟为频率的数据;使用resample()将其向下采样到5min的容器中,并将落入容器的时间戳的值相加。

rng = pd.date_range('23/5/2022',periods = 100,freq = 'S')
ts = pd.Series(np.random.randint(0,500,len(rng)),index = rng)
print(ts.resample('5Min').sum())
2022-05-23    27528
Freq: 5T, dtype: int32

时区表示;

rng = pd.date_range('24/5/2022 00:00',periods=5,freq = 'D')
ts = pd.Series(np.random.randn(len(rng)),rng)
print(ts)
2022-05-24    0.629037
2022-05-25    0.305843
2022-05-26   -0.146363
2022-05-27   -0.282303
2022-05-28   -0.495081
Freq: D, dtype: float64
ts_utc = ts.tz_localize('UTC')
print(ts_utc)
2022-05-24 00:00:00+00:00   -1.146774
2022-05-25 00:00:00+00:00    1.857884
2022-05-26 00:00:00+00:00    0.400149
2022-05-27 00:00:00+00:00   -1.155352
2022-05-28 00:00:00+00:00    1.635605
Freq: D, dtype: float64

转换为其他时区;

print(ts_utc.tz_convert('US/Eastern'))
2022-05-23 20:00:00-04:00    1.528297
2022-05-24 20:00:00-04:00    0.445658
2022-05-25 20:00:00-04:00    1.828093
2022-05-26 20:00:00-04:00    1.692153
2022-05-27 20:00:00-04:00   -1.026198
Freq: D, dtype: float64

转换时间段:

rng = pd.date_range('1/1/2022',periods= 5,freq='M')
ts = pd.Series(np.random.randn(len(rng)),index= rng)
print(ts)
ps = ts.to_period()    #将日期时间离散化
print(ps)
2022-01-31    0.911192
2022-02-28   -0.741799
2022-03-31   -0.433314
2022-04-30   -0.780637
2022-05-31    0.689170
Freq: M, dtype: float64
2022-01    0.911192
2022-02   -0.741799
2022-03   -0.433314
2022-04   -0.780637
2022-05    0.689170
Freq: M, dtype: float64
print(ps.to_timestamp())
2022-01-01   -0.445026
2022-02-01    1.338020
2022-03-01    0.351717
2022-04-01    0.537906
2022-05-01    0.370323
Freq: MS, dtype: float64

Pandas函数也可以很方便的转换时间段与时间戳;

prng = pd.period_range('1999Q1','2000Q4',freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)),prng)
ts.index = (prng.asfreq('M','e')+1).asfreq('H','s')+9
print(ts.head())
1999-03-01 09:00    0.130733
1999-06-01 09:00   -2.153159
1999-09-01 09:00    0.520082
1999-12-01 09:00   -0.965103
2000-03-01 09:00   -0.392060
Freq: H, dtype: float64

12.类别型

将grade的原生数据转换为类别型数据;

df = pd.DataFrame({"id":[1,2,3,4,5,6],
                   "raw_grade":['a','b','b','a','a','e']
})
df["grade"] = df["raw_grade"].astype("category")
print(df["grade"])
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

用有含义的名字重命名不同类型;给categories赋值,可以改变类别标签。赋值的时候是按照顺序进行对应的;改变类别标签集合,操作过后数据的标签不变,但是标签的集合改变;

df["grade"].cat.categories = ["very good","good","very bad"]
df["grade"] = df["grade"].cat.set_categories(["very bad","bad","medium","good","very good"])
print(df["grade"])
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

按照生成类别时的顺序排序(并非按照词汇排序);

print(df.sort_values(by = 'grade'))
   id raw_grade      grade
5   6         e   very bad
1   2         b       good
2   3         b       good
0   1         a  very good
3   4         a  very good
4   5         a  very good

按照类别分组时,及时某个类别为空也会显示出来;

print(df.groupby("grade").size())
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

13.可视化

ts = pd.Series(np.random.randn(1000),
               index = pd.date_range('1/1/2022',periods = 1000))
ts = ts.cumsum()         #各行的累加值
ts.plot()
plt.show()
AxesSubplot(0.125,0.11;0.775x0.77)

 DataFrame()中的plot()函数也可以快速的绘制所有带标签的列;

df = pd.DataFrame(np.random.randn(1000,4),index = ts.index,
                  columns = ['A','B','C','D'])
df = df.cumsum()
print(plt.figure())
print(df.plot())
print(plt.legend(loc = 'best'))
plt.show()

14.数据输入/输出

CSV

将数据写入csv文件;

df.to_csv("foo.csv")

读取csv文件数据;

print(pd.read_csv("foo.csv"))
     Unnamed: 0          A          B         C          D
0    2022-01-01  -0.376720  -1.050965  0.676105  -0.155736
1    2022-01-02  -0.094961  -1.546079 -0.556264  -0.856252
2    2022-01-03  -0.524001  -1.987073 -1.254280   0.691564
3    2022-01-04  -3.139603   0.256445 -1.671734  -0.196728
4    2022-01-05  -2.477721   0.381823 -2.623101  -0.331353
..          ...        ...        ...       ...        ...
995  2024-09-22  21.916246  27.080269 -5.498463  45.427545
996  2024-09-23  21.987481  26.228331 -5.950425  46.577939
997  2024-09-24  22.121124  25.001296 -5.843399  45.658839
998  2024-09-25  22.269591  24.761772 -4.648639  44.929457
999  2024-09-26  22.108488  25.141986 -5.192166  44.995610

[1000 rows x 5 columns]

HDF5

将数据写入HDF5 Store:

df.to_hdf('foo.h5','df')

读取其中的数据;

print(pd.read_hdf("foo.h5","df"))
                    A         B          C         D
2022-01-01  -0.434350 -0.332814   0.892750  1.157850
2022-01-02  -0.840929 -0.683873  -0.275972  2.419774
2022-01-03  -0.225382 -1.585506   0.196728  3.477412
2022-01-04  -0.830408 -1.598484  -0.474299  5.660145
2022-01-05  -0.250991 -1.412912  -0.876879  6.528144
...               ...       ...        ...       ...
2024-09-22  57.132175 -0.205451 -46.591026 -9.453061
2024-09-23  58.581851 -0.007550 -47.742120 -8.684937
2024-09-24  57.213156  0.110778 -47.451625 -6.414811
2024-09-25  58.161486 -0.709191 -46.552347 -6.542223
2024-09-26  56.969844  0.304673 -46.967793 -7.499496

[1000 rows x 4 columns]

Excel

写入Excel文件;

df.to_excel('foo.xlsx',sheet_name='Sheet1')

读取文件信息;

print(pd.read_excel('foo.xlsx','Sheet1',index_col = None,na_values= ['NA']))
    Unnamed: 0          A          B          C          D
0   2022-01-01  -1.417080  -0.361437  -0.606728   0.638914
1   2022-01-02  -0.793184  -0.432722  -0.834653   0.419487
2   2022-01-03   0.030355  -0.136183  -0.702133  -0.615193
3   2022-01-04  -1.417608   0.682946  -1.466145   0.005104
4   2022-01-05  -0.608892   0.604069  -0.862176   0.817751
..         ...        ...        ...        ...        ...
995 2024-09-22  22.628256  25.999521 -48.243060  31.111053
996 2024-09-23  22.017860  26.358772 -47.372658  30.697530
997 2024-09-24  22.003023  24.433845 -46.477812  32.226947
998 2024-09-25  21.668497  27.670671 -45.666538  33.775446
999 2024-09-26  21.691661  27.582134 -43.666923  33.974122

[1000 rows x 5 columns]

15.各种坑

在执行某些操作时,会触发异常;比如下面这种情况:

if pd.Series([False,True,False]):
    print("I was true")

会出现错误提示如下: 

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值