Python.Pandas学习笔记

イザナ二

于 2022-10-28 00:46:16 发布

阅读量149

点赞数 2

文章标签： pandas python 学习

本文链接：https://blog.csdn.net/Demo_Zero/article/details/127563912

版权

Pandas库

pandas与numpy主要都是数组，其区别就像字典与列表，pandas的行与列都会有可编辑的“key”值，而numpy的行与列都是“0，1，2…”的索引值

pands库通常结合numpy进行运用

引入pandas库

import pandas as pd

DataFrame数组

pd.Series([ object ])

这里的数组返回不同于nadarray数组，这里的数组是默认按纵向返回的

a=pd.Series([1,2,3,4,5])
print(a)

0    1
1    2
2    3
3    4
4    5
dtype: int64

定义以及参数

pd.DataFrame(object, index=, columns= )

object的对象内容可以是数组，也可以是字典

a.columns 返回出列的索引

a.index 返回出行的索引

a.values 返回所有的值

a.describe() 返回数据列的统计数据

a.sort_index() 根据行排序

a.sort_columns() 根据列排序

a.sort_values( by=) 根据by值，决定排序哪一列或哪一行

DataFrame数组的切片

a.iloc[ ]（纯数字筛选）

[ a:b, c:d ] 逗号分割，前面表示返回的指定行范围内的元素，后面表示指定列范围内的元素，默认不包括:后的终止值

d=pd.date_range('20221026',periods=3)
a=pd.DataFrame(np.arange(12).reshape(3,4), index=d,columns=['A','B','C','D'])
print(a)
print(a.iloc[0:2,1:3])

            A  B   C   D
2022-10-26  0  1   2   3
2022-10-27  4  5   6   7
2022-10-28  8  9  10  11
            B  C
2022-10-26  1  2
2022-10-27  5  6

a.loc[ ] （纯标签筛选）

d=pd.date_range('20221026',periods=3)
a=pd.DataFrame(np.arange(12).reshape(3,4), index=d,columns=['A','B','C','D'])
print(a.loc['20221027':,['A','C']])

            A   C
2022-10-27  4   6
2022-10-28  8  10

都可以进行不连续的筛选

a[a.‘index’/‘column’ 逻辑比较 num]

d=pd.date_range('20221026',periods=3)
a=pd.DataFrame(np.arange(12).reshape(3,4), index=d,columns=['A','B','C','D'])
print(a,'\n')
print(a[a.B>3],'\n')
print(a.C[a.B>3])

            A  B   C   D
2022-10-26  0  1   2   3
2022-10-27  4  5   6   7
2022-10-28  8  9  10  11 

            A  B   C   D
2022-10-27  4  5   6   7
2022-10-28  8  9  10  11 

2022-10-27     6
2022-10-28    10
Freq: D, Name: C, dtype: int32

DataFrame数组的合并

使用concat进行合并

a=pd.DataFrame(np.arange(12).reshape(3,4), columns=['A','B','C','D'])
b=pd.DataFrame(np.linspace(1,23,12,dtype='int64').reshape(3,4), columns=['A','B','C','D'])
c=pd.concat([a,b])
print(a,'\n')
print(b,'\n')
print(c)

   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11 

    A   B   C   D
0   1   3   5   7
1   9  11  13  15
2  17  19  21  23 

    A   B   C   D
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
0   1   3   5   7
1   9  11  13  15
2  17  19  21  23

可见数组进行了纵向的合并，而且行的索引是不改变地合并在一起的

可以使用 axis、ignore_index 参数来改变上述情况

a=pd.DataFrame(np.arange(12).reshape(3,4), columns=['A','B','C','D'])
b=pd.DataFrame(np.linspace(1,23,12,dtype='int64').reshape(3,4), columns=['A','B','C','D'])
c=pd.concat([a,b],axis=1,ignore_index=True)
print(a,'\n')
print(b,'\n')
print(c)

   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11 

    A   B   C   D
0   1   3   5   7
1   9  11  13  15
2  17  19  21  23 

   0  1   2   3   4   5   6   7
0  0  1   2   3   1   3   5   7
1  4  5   6   7   9  11  13  15
2  8  9  10  11  17  19  21  23

当两个数组的列的索引不相同时，会出现下状况（用Nan填充原来没有的列）

a=pd.DataFrame(np.arange(12).reshape(3,4), columns=['A','B','C','D'])
b=pd.DataFrame(np.linspace(1,23,12,dtype='int64').reshape(3,4), columns=['B','C','D','E'])
c=pd.concat([a,b],ignore_index=True)
print(c)

     A   B   C   D     E
0  0.0   1   2   3   NaN
1  4.0   5   6   7   NaN
2  8.0   9  10  11   NaN
3  NaN   1   3   5   7.0
4  NaN   9  11  13  15.0
5  NaN  17  19  21  23.0

可以使用join参数，当值为inner时，提取返回两个数组中共同的列索引，避免了Nan值；当值为outer时，则是上述的情况（join的默认值为outer）

c=pd.concat([a,b],ignore_index=True,join='inner')
print(c)

    B   C   D
0   1   2   3
1   5   6   7
2   9  10  11
3   1   3   5
4   9  11  13
5  17  19  21

同时，对于行的索引不同的数组进行横向的合并，也会有这样的情况

a=pd.DataFrame(np.arange(12).reshape(4,3),index=['A','B','C','D'])
b=pd.DataFrame(np.linspace(1,23,12,dtype='int64').reshape(4,3),index=['B','C','D','E'])
c=pd.concat([a,b],axis=1)
print(c)

     0     1     2     0     1     2
A  0.0   1.0   2.0   NaN   NaN   NaN
B  3.0   4.0   5.0   1.0   3.0   5.0
C  6.0   7.0   8.0   7.0   9.0  11.0
D  9.0  10.0  11.0  13.0  15.0  17.0
E  NaN   NaN   NaN  19.0  21.0  23.0