Pandas基础

True Royal

已于 2023-04-03 21:36:27 修改

阅读量293

点赞数

文章标签： pandas python 机器学习

于 2022-07-08 12:06:33 首次发布

本文链接：https://blog.csdn.net/qq_60650533/article/details/125674158

版权

本文介绍了如何在Python中使用Pandas库与Numpy库协同工作。通过创建和操作Series及DataFrame，展示了数据的初始化、属性访问、排序和数据选择。还探讨了数据的导入导出、合并操作，如concat和merge，并提供了相关示例。

摘要由CSDN通过智能技术生成

一般与numpy一起使用

1，基本属性

(1)pd.Series 序列

把数标上一个序号

用法

import numpy as np
import pandas as pd
s=pd.Series([1,3,6,np.nan,44,1])
print(s)

结果

0 1.0
1 3.0
2 6.0
3 NaN
4 44.0
5 1.0
dtype: float64

（2）pd.DataFrame

dates=pd.date_range('20220101',periods=6)
df=pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d'])

pd.date_range在这里生成了六个数据

np.random.randn(6,4)定义为六行四列

index行的索引

columns列的名字

这里也可以不写index和columns使用默认的

运行结果

a b c d
2022-01-01 1.326867 -0.389798 0.186750 1.986374
2022-01-02 0.664617 2.271001 -0.487486 0.421387
2022-01-03 -1.309948 1.147211 -1.573718 1.434102
2022-01-04 0.700662 -1.795573 -0.043631 2.298165
2022-01-05 -0.285750 -0.473514 1.030131 0.104888
2022-01-06 0.887182 1.187699 1.017360 -1.431845

还可以用字典的方式定义

df=pd.DataFrame({'A':1.,
                 'B':pd.Timestamp('20220102'),
                 'C':pd.Series(1,index=list(range(4)),dtype='float32'),
                 'D':np.array([3]*4,dtype='int32'),
                 'E':pd.Categorical(["test","train","test","train"]),
                 'F':'foo'})

结果

A B C D E F
0 1.0 2022-01-02 1.0 3 test foo
1 1.0 2022-01-02 1.0 3 train foo
2 1.0 2022-01-02 1.0 3 test foo
3 1.0 2022-01-02 1.0 3 train foo

还可以单独取出各个属性：

df.index取出所有行的名称

df.columns取出所有列的名称

df.values取出所有元素

df.describe()打印出平均值，方差等属性

df.T将矩阵转置

排序

import numpy as np
import pandas as pd
dates=pd.date_range('20220101',periods=6)
df=pd.DataFrame({'A':1.,
                 'B':pd.Timestamp('20220102'),
                 'C':pd.Series(1,index=list(range(4)),dtype='float32'),
                 'D':np.array([3]*4,dtype='int32'),
                 'E':pd.Categorical(["test","train","test","train"]),
                 'F':'foo'})
df2=df.sort_index(axis=1,ascending=False)
print(df2)
df2=df.sort_index(axis=0,ascending=False)
print(df2)
df2=df.sort_values(by='E')
print(df2)

结果

F E D C B A
0 foo test 3 1.0 2022-01-02 1.0
1 foo train 3 1.0 2022-01-02 1.0
2 foo test 3 1.0 2022-01-02 1.0
3 foo train 3 1.0 2022-01-02 1.0

A B C D E F
3 1.0 2022-01-02 1.0 3 train foo
2 1.0 2022-01-02 1.0 3 test foo
1 1.0 2022-01-02 1.0 3 train foo
0 1.0 2022-01-02 1.0 3 test foo

A B C D E F
0 1.0 2022-01-02 1.0 3 test foo
2 1.0 2022-01-02 1.0 3 test foo
1 1.0 2022-01-02 1.0 3 train foo
3 1.0 2022-01-02 1.0 3 train foo

2，选择数据

原始数据

A B C D
2022-01-01 0 1 2 3
2022-01-02 4 5 6 7
2022-01-03 8 9 10 11
2022-01-04 12 13 14 15
2022-01-05 16 17 18 19
2022-01-06 20 21 22 23

(1)print(df.['A'],df.A)

2022-01-01 0
2022-01-02 4
2022-01-03 8
2022-01-04 12
2022-01-05 16
2022-01-06 20
Freq: D, Name: A, dtype: int32 2022-01-01 0
2022-01-02 4
2022-01-03 8
2022-01-04 12
2022-01-05 16
2022-01-06 20
Freq: D, Name: A, dtype: int32

Process finished with exit code 0

(2)print(df[0:3],df['20220102':'20220104'])

A B C D
2022-01-01 0 1 2 3
2022-01-02 4 5 6 7
2022-01-03 8 9 10 11 A B C D
2022-01-02 4 5 6 7
2022-01-03 8 9 10 11
2022-01-04 12 13 14 15

Process finished with exit code 0

(3)print(df.loc['20220102'])