python使用教程pandas-Python之Pandas使用教程

1.Pandas概述

Pandas是Python的一个数据分析包,该工具为解决数据分析任务而创建。

Pandas纳入大量库和标准数据模型,提供高效的操作数据集所需的工具。

Pandas提供大量能使我们快速便捷地处理数据的函数和方法。

Pandas是字典形式,基于NumPy创建,让NumPy为中心的应用变得更加简单。

2.Pandas安装

3.Pandas引入

import pandas as pd

4.Pandas数据结构

4.1Series

import numpy as np

import pandas as pd

s=pd.Series([1,2,3,np.nan,5,6])

print(s)

----------执行以上程序,返回的结果为----------

0 1.0

1 2.0

2 3.0

3 NaN

4 5.0

5 6.0

dtype: float64

4.2DataFrame

DataFrame是表格型数据结构,包含一组有序的列,每列可以是不同的值类型。DataFrame有行索引和列索引,可以看成由Series组成的字典。

import numpy as np

import pandas as pd

dates=pd.date_range('2019-08-01',periods=6)

pd=pd.DataFrame(np.random.randn(6,4),index=dates,columns=['A','B','C','D'])

print('输出6行4列的表格:')

print(pd)

print(' ')

print('输出第二列:')

print(pd['B'])

print(' ')

----------执行以上程序,返回的结果为----------

输出6行4列的表格:

A B C D

2019-08-01 0.796050 -0.383286 -1.465294 -0.272321

2019-08-02 -1.431981 -0.875381 1.371449 0.321703

2019-08-03 -1.497636 1.258925 -1.374210 -0.765626

2019-08-04 2.518305 0.125094 2.647512 -0.024748

2019-08-05 -0.319238 0.395384 -0.582052 -0.396132

2019-08-06 -0.519434 1.873216 1.685524 -1.493000

输出第二列:

2019-08-01 -0.383286

2019-08-02 -0.875381

2019-08-03 1.258925

2019-08-04 0.125094

2019-08-05 0.395384

2019-08-06 1.873216

Freq: D, Name: B, dtype: float64

-------------------------------------------

import numpy as np

import pandas as pd

from datetimeimport datetime as dt

print('通过字典创建DataFrame:')

df_1=pd.DataFrame({'A':1.0,

'B':pd.Timestamp(2019,8,19),

'C':pd.Series(1,index=list(range(4)),dtype='float32'),

'D':np.array([3]*4,dtype='int32'),

'E':pd.Categorical(['test','train','test','train']),

'F':'foo'})

print(df_1)

print(' ')

print('返回每列的数据类型:')

print(df_1.dtypes)

print(' ')

print('返回行的序号:')

print(df_1.index)

print(' ')

print('返回列的序号名字:')

print(df_1.columns)

print(' ')

print('把每个值进行打印出来:')

print(df_1.values)

print(' ')

print('数字总结:')

print(df_1.describe())

print(' ')

print('翻转数据:')

print(df_1.T)

print(' ')

print('按第一列进行排序:')

#axis等于1按列进行排序 如ABCDEFG 然后ascending倒叙进行显示

print(df_1.sort_index(1,ascending=False))

print(' ')

print('按某列的值进行排序:')

print(df_1.sort_values('E'))

print(' ')

----------执行以上程序,返回的结果为----------

通过字典创建DataFrame:

A B C D E F

0 1.0 2019-08-19 1.0 3 test foo

1 1.0 2019-08-19 1.0 3 train foo

2 1.0 2019-08-19 1.0 3 test foo

3 1.0 2019-08-19 1.0 3 train foo

返回每列的数据类型:

A float64

B datetime64[ns]

C float32

D int32

E category

F object

dtype: object

返回行的序号:

Int64Index([0, 1, 2, 3], dtype='int64')

返回列的序号名字:

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

把每个值进行打印出来:

[[1.0 Timestamp('2019-08-19 00:00:00') 1.0 3 'test' 'foo']

[1.0 Timestamp('2019-08-19 00:00:00') 1.0 3 'train' 'foo']

[1.0 Timestamp('2019-08-19 00:00:00') 1.0 3 'test' 'foo']

[1.0 Timestamp('2019-08-19 00:00:00') 1.0 3 'train' 'foo']]

数字总结:

A C D

count 4.0 4.0 4.0

mean 1.0 1.0 3.0

std 0.0 0.0 0.0

min 1.0 1.0 3.0

25% 1.0 1.0 3.0

50% 1.0 1.0 3.0

75% 1.0 1.0 3.0

max 1.0 1.0 3.0

翻转数据:

0 1 2 3

A 1 1 1 1

B 2019-08-19 00:00:00 2019-08-19 00:00:00 2019-08-19 00:00:00 2019-08-19 00:00:00

C 1 1 1 1

D 3 3 3 3

E test train test train

F foo foo foo foo

按第一列进行排序:

F E D C B A

0 foo test 3 1.0 2019-08-19 1.0

1 foo train 3 1.0 2019-08-19 1.0

2 foo test 3 1.0 2019-08-19 1.0

3 foo train 3 1.0 2019-08-19 1.0

按某列的值进行排序:

A B C D E F

0 1.0 2019-08-19 1.0 3 test foo

2 1.0 2019-08-19 1.0 3 test foo

1 1.0 2019-08-19 1.0 3 train foo

3 1.0 2019-08-19 1.0 3 train foo

5.Pandas选择数据

import numpy as np

import pandas as pd

dates=pd.date_range('2019-08-01',periods=6)

df=pd.DataFrame(np.random.randn(6,4),index=dates,columns=['A','B','C','D'])

print('输出6行4列的数据:')

print(df)

print('打印B列数据:')

print(df['B'])

----------执行以上程序,返回的结果为----------

输出6行4列的数据:

A B C D

2019-08-01 -0.856790 -1.968381 -0.590032 -0.511943

2019-08-02 0.032420 0.750065 -1.168060 -1.571403

2019-08-03 0.962793 -2.377613 1.447871 -1.515988

2019-08-04 1.078565 1.780728 -0.060782 1.393749

2019-08-05 -1.785669 1.161425 0.440988 1.233997

2019-08-06 -0.740927 -0.877388 -0.868203 1.395331

打印B列数据:

2019-08-01 -1.968381

2019-08-02 0.750065

2019-08-03 -2.377613

2019-08-04 1.780728

2019-08-05 1.161425

2019-08-06 -0.877388

Freq: D, Name: B, dtype: float64

切片选择

print('切片选择:')

print(df[0:3],df['20190801':'20190804'])

----------执行以上程序,返回的结果为----------

切片选择:

A B C D

2019-08-01 -0.456445 -1.641900 0.878254 -0.265412

2019-08-02 0.223910 -1.524222 0.428250 0.410542

2019-08-03 -1.248945 0.649155 -1.039407 0.138473

A B C D

2019-08-01 -0.456445 -1.641900 0.878254 -0.265412

2019-08-02 0.223910 -1.524222 0.428250 0.410542

2019-08-03 -1.248945 0.649155 -1.039407 0.138473

2019-08-04 -1.135849 1.404054 -0.771489 -0.685064

根据标签loc-行标签进行选择数据

print('根据行标签进行选择数据:')

print(df.loc['2019-08-01',['A','B']])

----------执行以上程序,返回的结果为----------

根据行标签进行选择数据:

A -0.495304

B -0.083505

Name: 2019-08-01 00:00:00, dtype: float64

根据序列iloc-行号进行选择数据

import numpy as np

import pandas as pd

print('输出第三行第一列的数据:')

print(df.iloc[3,1])

print(' ')

print('进行切片选择:')

print(df.iloc[3:5,0:2])

print(' ')

print('进行不连续筛选:')

print(df.iloc[[1,2,4],[0,2]])

----------执行以上程序,返回的结果为----------

输出第三行第一列的数据:

1.2355112660049548

进行切片选择:

A B

2019-08-04 -0.943150 1.235511

2019-08-05 -0.245097 -1.272304

进行不连续筛选:

A C

2019-08-02 -0.212743 -0.584698

2019-08-03 0.012863 -0.896789

2019-08-05 -0.245097 2.646507

根据混合的两种ix

import numpy as np

import pandas as pd

print(df.ix(:3,[A,C]))

----------执行以上程序,返回的结果为----------

A C

2019-08-01 1.591064 1.272731

2019-08-02 1.820216 0.657560

2019-08-03 0.358265 -1.197687

根据判断筛选

import numpy as np

import pandas as pd

print('根据判断筛选:')

print(df[df.A>0])

----------执行以上程序,返回的结果为----------

根据判断筛选:

A B C D

2019-08-01 1.098786 0.261861 1.430775 -1.161001

2019-08-05 0.527853 -0.612058 -0.906565 1.279515

6.Pandas设置数据

根据loc和iloc设置

import numpy as np

import pandas as pd

dates=pd.date_range('2019-08-01',periods=6)

df=pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])

print('输出6行4列的数据:')

print(df)

print(' ')

print('更改后的数据:')

df.iloc[2,2]=999

df.loc['2019-08-01','D']=999

print(df)

print(' ')

----------执行以上程序,返回的结果为----------

输出6行4列的数据:

A B C D

2019-08-01 0 1 2 3

2019-08-02 4 5 6 7

2019-08-03 8 9 10 11

2019-08-04 12 13 14 15

2019-08-05 16 17 18 19

2019-08-06 20 21 22 23

更改后的数据:

A B C D

2019-08-01 0 1 2 999

2019-08-02 4 5 6 7

2019-08-03 8 9 999 11

2019-08-04 12 13 14 15

2019-08-05 16 17 18 19

2019-08-06 20 21 22 23

根据条件设置

import numpy as np

import pandas as pd

print('根据条件设置:')

df[df.A>0]=999

print(df)

----------执行以上程序,返回的结果为----------

根据条件设置:

A B C D

2019-08-01 0 1 2 999

2019-08-02 999 999 999 999

2019-08-03 999 999 999 999

2019-08-04 999 999 999 999

2019-08-05 999 999 999 999

2019-08-06 999 999 999 999

根据行或列设置

import numpy as np

import pandas as pd

print('根据行或列设置:')

df['C']=np.nan

print(df)

----------执行以上程序,返回的结果为----------

根据行或列设置:

A B C D

2019-08-01 0 1 NaN 999

2019-08-02 999 999 NaN 999

2019-08-03 999 999 NaN 999

2019-08-04 999 999 NaN 999

2019-08-05 999 999 NaN 999

2019-08-06 999 999 NaN 999

添加数据

import numpy as np

import pandas as pd

print('添加数据:')

df['E']=pd.Series([1,2,3,4,5,6],index=pd.date_range('2019-08-03',periods=6))

print(df)

----------执行以上程序,返回的结果为----------

添加数据:

A B C D E

2019-08-01 0 1 NaN 999 NaN

2019-08-02 999 999 NaN 999 NaN

2019-08-03 999 999 NaN 999 1.0

2019-08-04 999 999 NaN 999 2.0

2019-08-05 999 999 NaN 999 3.0

2019-08-06 999 999 NaN 999 4.0

7.Pandas处理丢失数据

处理数据中NaN数据

import numpy as np

import pandas as pd

dates=pd.date_range('2019-08-01',periods=6)

df=pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])

df.iloc[0,1]=np.nan

df.iloc[1,2]=np.nan

print('输出6行4列的数据:')

print(df)

print(' ')

print('使用dropna()函数去掉NaN的行或列:')

print(df.dropna(0,how='any'))#0对行进行操作 1对列进行操作 any:只要存在NaN即可drop掉 all:必须全部是NaN才可drop

print(' ')

print('使用fillna()函数替换NaN值:')

print(df.fillna(value=0))#将NaN值替换为0

print(' ')

print('使用isnull()函数判断数据是否丢失:')

print(pd.isnull(df))

----------执行以上程序,返回的结果为----------

输出6行4列的数据:

A B C D

2019-08-01 0 NaN 2.0 3

2019-08-02 4 5.0 NaN 7

2019-08-03 8 9.0 10.0 11

2019-08-04 12 13.0 14.0 15

2019-08-05 16 17.0 18.0 19

2019-08-06 20 21.0 22.0 23

使用dropna()函数去掉NaN的行或列:

A B C D

2019-08-03 8 9.0 10.0 11

2019-08-04 12 13.0 14.0 15

2019-08-05 16 17.0 18.0 19

2019-08-06 20 21.0 22.0 23

使用fillna()函数替换NaN值:

A B C D

2019-08-01 0 0.0 2.0 3

2019-08-02 4 5.0 0.0 7

2019-08-03 8 9.0 10.0 11

2019-08-04 12 13.0 14.0 15

2019-08-05 16 17.0 18.0 19

2019-08-06 20 21.0 22.0 23

使用isnull()函数判断数据是否丢失:

A B C D

2019-08-01 False True False False

2019-08-02 False False True False

2019-08-03 False False False False

2019-08-04 False False False False

2019-08-05 False False False False

2019-08-06 False False False False

8.Pandas导入导出

pandas可以读取与存取像csv、excel、json、html、pickle等格式的资料,详细说明请看官方资料

import numpy as np

import pandas as pd

print('读取csv文件:')

data=pd.read_csv('test2.csv')

print(data)

print('将资料存储成pickle文件:')

print(data.to_pickle('test3.pickle'))

----------执行以上程序,返回的结果为----------

读取csv文件:

A B C D

0 1 1 1 1

1 2 2 2 2

2 3 3 3 3

将资料存储成pickle文件:

None

9.Pandas合并数据

axis合并方向

import numpy as np

import pandas as pd

df1=pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'])

df2=pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'])

df3=pd.DataFrame(np.ones((3,4))*2,columns=['a','b','c','d'])

res=pd.concat([df1,df2,df3],axis=0,ignore_index=True)#0表示竖项合并 1表示横项合并 ingnore_index重置序列index index变为0 1 2 3 4 5 6 7 8

print(res)

----------执行以上程序,返回的结果为----------

a b c d

0 0.0 0.0 0.0 0.0

1 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0

3 1.0 1.0 1.0 1.0

4 1.0 1.0 1.0 1.0

5 1.0 1.0 1.0 1.0

6 2.0 2.0 2.0 2.0

7 2.0 2.0 2.0 2.0

8 2.0 2.0 2.0 2.0

join合并方式

import numpy as np

import pandas as pd

df1=pd.DataFrame(np.ones((3,4))*0,columns=['A','B','C','D'],index=[1,2,3])

df2=pd.DataFrame(np.ones((3,4))*1,columns=['B','C','D','E'],index=[2,3,4])

print('第一个数据为:')

print(df1)

print(' ')

print('第二个数据为:')

print(df2)

print(' ')

print('join行往外合并:相当于全连接')

res=pd.concat([df1,df2],axis=1,join='outer')

print(res)

print(' ')

print('join行相同的进行合并:相当于内连接')

res2=pd.concat([df1,df2],axis=1,join='inner')

print(res2)

print(' ')

print('以df1的序列进行合并:相当于左连接')

res3=pd.concat([df1,df2],axis=1,join_axes=[df1.index])

print(res3)

----------执行以上程序,返回的结果为----------

第一个数据为:

A B C D

1 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0

3 0.0 0.0 0.0 0.0

第二个数据为:

B C D E

2 1.0 1.0 1.0 1.0

3 1.0 1.0 1.0 1.0

4 1.0 1.0 1.0 1.0

join行往外合并:相当于全连接

A B C D B C D E

1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN

2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0

3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0

4 NaN NaN NaN NaN 1.0 1.0 1.0 1.0

join行相同的进行合并:相当于内连接

A B C D B C D E

2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0

3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0

以df1的序列进行合并:相当于左连接

A B C D B C D E

1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN

2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0

3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0

append添加数据

import numpy as np

import pandas as pd

df1=pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'])

df2=pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'])

df3=pd.DataFrame(np.ones((3,4))*2,columns=['a','b','c','d'])

s1=pd.Series([1,2,3,4],index=['a','b','c','d'])

print('将df2合并到df1的下面 并重置index')

res=df1.append(df2,ignore_index=True)

print(res)

print(' ')

print('将s1合并到df1的下面,并重置index')

res2=df1.append(s1,ignore_index=True)

print(res2)

----------执行以上程序,返回的结果为----------

将df2合并到df1的下面 并重置index

a b c d

0 0.0 0.0 0.0 0.0

1 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0

3 1.0 1.0 1.0 1.0

4 1.0 1.0 1.0 1.0

5 1.0 1.0 1.0 1.0

将s1合并到df1的下面,并重置index

a b c d

0 0.0 0.0 0.0 0.0

1 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0

3 1.0 2.0 3.0 4.0

10.Pandas合并merge

依据一组key合并

import pandas as pd

left=pd.DataFrame({'key':['k0','k1','k2','k3'],

'A':['A0','A1','A2','A3'],

'B':['B0','B1','B2','B3']})

print('第一个数据为:')

print(left)

print(' ')

right=pd.DataFrame({'key':['k0','k1','k2','k3'],

'C':['C0','C1','C2','C3'],

'D':['D0','D1','D2','D3']})

print('第二个数据为:')

print(right)

print(' ')

print('依据key进行merge:')

res=pd.merge(left,right,on='key')

print(res)

----------执行以上程序,返回的结果为----------

第一个数据为:

key A B

0 k0 A0 B0

1 k1 A1 B1

2 k2 A2 B2

3 k3 A3 B3

第二个数据为:

key C D

0 k0 C0 D0

1 k1 C1 D1

2 k2 C2 D2

3 k3 C3 D3

依据key进行merge:

key A B C D

0 k0 A0 B0 C0 D0

1 k1 A1 B1 C1 D1

2 k2 A2 B2 C2 D2

3 k3 A3 B3 C3 D3

依据两组key合并

import pandas as pd

left=pd.DataFrame({'key1':['k0','k1','k2','k3'],

'key2':['k0','k1','k0','k1'],

'A':['A0','A1','A2','A3'],

'B':['B0','B1','B2','B3']})

print('第一个数据为:')

print(left)

print(' ')

right=pd.DataFrame({'key1':['k0','k1','k2','k3'],

'key2':['k0','k0','k0','k0'],

'C':['C0','C1','C2','C3'],

'D':['D0','D1','D2','D3']})

print('第二个数据为:')

print(right)

print(' ')

print('内联合并')

res=pd.merge(left,right,on=['key1','key2'],how='inner')

print(res)

print(' ')

print('外联合并')

res2=pd.merge(left,right,on=['key1','key2'],how='outer')

print(res2)

print(' ')

print('左联合并')

res3=pd.merge(left,right,on=['key1','key2'],how='left')

print(res3)

print(' ')

print('右联合并')

res4=pd.merge(left,right,on=['key1','key2'],how='right')

print(res4)

----------执行以上程序,返回的结果为----------

第一个数据为:

key1 key2 A B

0 k0 k0 A0 B0

1 k1 k1 A1 B1

2 k2 k0 A2 B2

3 k3 k1 A3 B3

第二个数据为:

key1 key2 C D

0 k0 k0 C0 D0

1 k1 k0 C1 D1

2 k2 k0 C2 D2

3 k3 k0 C3 D3

内联合并

key1 key2 A B C D

0 k0 k0 A0 B0 C0 D0

1 k2 k0 A2 B2 C2 D2

外联合并

key1 key2 A B C D

0 k0 k0 A0 B0 C0 D0

1 k1 k1 A1 B1 NaN NaN

2 k2 k0 A2 B2 C2 D2

3 k3 k1 A3 B3 NaN NaN

4 k1 k0 NaN NaN C1 D1

5 k3 k0 NaN NaN C3 D3

左联合并

key1 key2 A B C D

0 k0 k0 A0 B0 C0 D0

1 k1 k1 A1 B1 NaN NaN

2 k2 k0 A2 B2 C2 D2

3 k3 k1 A3 B3 NaN NaN

右联合并

key1 key2 A B C D

0 k0 k0 A0 B0 C0 D0

1 k2 k0 A2 B2 C2 D2

2 k1 k0 NaN NaN C1 D1

3 k3 k0 NaN NaN C3 D3

Indicator合并

import pandas as pd

df1=pd.DataFrame({'col1':[0,1],'col_left':['a','b']})

df2=pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})

print('第一个数据为:')

print(df1)

print(' ')

print('第二个数据为:')

print(df2)

print(' ')

print('依据col1进行合并 并启用indicator=True输出每项合并方式:')

res=pd.merge(df1,df2,on='col1',how='outer',indicator=True)

print(res)

print(' ')

----------执行以上程序,返回的结果为----------

第一个数据为:

col1 col_left

0 0 a

1 1 b

第二个数据为:

col1 col_right

0 1 2

1 2 2

2 2 2

依据col1进行合并 并启用indicator=True输出每项合并方式:

col1 col_left col_right _merge

0 0 a NaN left_only

1 1 b 2.0 both

2 2 NaN 2.0 right_only

3 2 NaN 2.0 right_only

依据index合并

import numpy as np

import pandas as pd

left=pd.DataFrame({'A':['A0','A1','A2'],

'B':['B0','B1','B2']},

index=['k0','k1','k2'])

right=pd.DataFrame({'C':['C0','C1','C2'],

'D':['D0','D1','D2']},

index=['k0','k2','k3']

)

print('第一个数据为:')

print(left)

print(' ')

print('第二个数据为:')

print(right)

print(' ')

print('根据index索引进行合并 并选择外联合并')

res=pd.merge(left,right,left_index=True,right_index=True,how='outer')

print(res)

print(' ')

print('根据index索引进行合并 并选择内联合并')

res2=pd.merge(left,right,left_index=True,right_index=True,how='inner')

print(res2)

print(' ')

----------执行以上程序,返回的结果为----------

第一个数据为:

A B

k0 A0 B0

k1 A1 B1

k2 A2 B2

第二个数据为:

C D

k0 C0 D0

k2 C1 D1

k3 C2 D2

根据index索引进行合并 并选择外联合并

A B C D

k0 A0 B0 C0 D0

k1 A1 B1 NaN NaN

k2 A2 B2 C1 D1

k3 NaN NaN C2 D2

根据index索引进行合并 并选择内联合并

A B C D

k0 A0 B0 C0 D0

k2 A2 B2 C1 D1

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值