【python】pandas

最新推荐文章于 2024-05-29 16:21:04 发布

bryant_meng

最新推荐文章于 2024-05-29 16:21:04 发布

阅读量502

点赞数 1

分类专栏： Python 文章标签： python 机器学习

本文链接：https://blog.csdn.net/bryant_meng/article/details/108521819

版权

Python 专栏收录该内容

107 篇文章 7 订阅

订阅专栏

在这里插入图片描述

本博客为 Numpy & Pandas 莫烦 python 数据处理的个人学习笔记！

numpy 的相关介绍可以参考【python】numpy

最后一次更新时间为：2018-12-10

0 前言：

如果用 python 的列表和字典来作比较, 那么可以说 Numpy 是列表形式的，没有数值标签，而 Pandas 就是字典形式。Pandas是基于Numpy构建的，让Numpy为中心的应用变得更加简单。

要使用pandas，首先需要了解他主要两个数据结构：Series和DataFrame。

参考：https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/3-1-pd-intro/

1 Series

import pandas as pd
import numpy as np

s = pd.Series([1,3,6,np.nan,44,1])
s

output

0     1.0
1     3.0
2     6.0
3     NaN
4    44.0
5     1.0
dtype: float64

Series 的字符串表现形式为：索引在左边，值在右边。由于我们没有为数据指定索引。于是会自动创建一个0到N-1（N为长度）的整数型索引。

2 DataFrame

DataFrame 是一个表格型的数据结构，它包含有一组有序的列，每列可以是不同的值类型（数值，字符串，布尔值等）。DataFrame 既有行索引也有列索引，它可以被看做由Series组成的大字典。

dates = pd.date_range('20181129',periods=6)
print(dates)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d'])
df

output

DatetimeIndex(['2018-11-29', '2018-11-30', '2018-12-01', '2018-12-02',
               '2018-12-03', '2018-12-04'],
              dtype='datetime64[ns]', freq='D')

在这里插入图片描述

df = pd.DataFrame(np.random.randn(3,2))
df

output
在这里插入图片描述

2.1 dtypes / index / columns / values

df2 = pd.DataFrame({'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo'})
df2

output
在这里插入图片描述

dtypes

df2.dtypes

output

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

index

df2.index # 行索引

output

Int64Index([0, 1, 2, 3], dtype='int64')

columns

df2.columns # 列索引

output

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

values

df2.values # 值

output

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

2.2 describe / T

df2.describe()

在这里插入图片描述

df2.T

在这里插入图片描述

2.3 sort_index

df2.sort_index(axis=1,ascending=False) #列降序

在这里插入图片描述

df2.sort_index(axis=0,ascending=False) # 行降序

在这里插入图片描述

df2.sort_values(by='E') # E列的值排序

在这里插入图片描述

3 Pandas 选择数据

3.1 简单的筛选

import pandas as pd
import numpy as np

dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
df

output
在这里插入图片描述
取 A 列，两种等价的写法

print(df['A'],'\n')
print(df.A)

output

2018-11-29     0
2018-11-30     4
2018-12-01     8
2018-12-02    12
2018-12-03    16
2018-12-04    20
Freq: D, Name: A, dtype: int32 

2018-11-29     0
2018-11-30     4
2018-12-01     8
2018-12-02    12
2018-12-03    16
2018-12-04    20
Freq: D, Name: A, dtype: int32

取前 3 行两种等价的写法

print(df[0:3],'\n')
print(df['20181129':'20181201'])

output

            A  B   C   D
2018-11-29  0  1   2   3
2018-11-30  4  5   6   7
2018-12-01  8  9  10  11 

            A  B   C   D
2018-11-29  0  1   2   3
2018-11-30  4  5   6   7
2018-12-01  8  9  10  11

3.2 loc

在这里插入图片描述

# select by label:loc
print(df.loc['20181130'],'\n') #　index
print(df.loc['20181130',['A','B']],'\n') #　index
print(df.loc[:,['A','B']])

output

A    4
B    5
C    6
D    7
Name: 2018-11-30 00:00:00, dtype: int32 

A    4
B    5
Name: 2018-11-30 00:00:00, dtype: int32 

             A   B
2018-11-29   0   1
2018-11-30   4   5
2018-12-01   8   9
2018-12-02  12  13
2018-12-03  16  17
2018-12-04  20  21

3.3 iloc

在这里插入图片描述

# select by positio: iloc
print(df.iloc[3],'\n')
print(df.iloc[3,1],'\n')
print(df.iloc[3:5,1:3],'\n')
print(df.iloc[[1,3,5],1:3])

output

A    12
B    13
C    14
D    15
Name: 2018-12-02 00:00:00, dtype: int32 

13 

             B   C
2018-12-02  13  14
2018-12-03  17  18 

             B   C
2018-11-30   5   6
2018-12-02  13  14
2018-12-04  21  22

3.4 ix

在这里插入图片描述
混合上面两种写法

# mixed selection: ix
print(df.ix[:3,['A','C']],'\n')
print(df.ix['20181129':'20181201',[0,2]])

output

            A   C
2018-11-29  0   2
2018-11-30  4   6
2018-12-01  8  10 

            A   C
2018-11-29  0   2
2018-11-30  4   6
2018-12-01  8  10

总结，跨行的话用 [] 框出来[[X,Y],Z]，索引的话不用框出来， [X,Y] 即可

3.5 Boolean indexing

在这里插入图片描述

# boolean indexing
print(df,'\n')
print(df.A>8,'\n') # 对A列进行选择，返回 A 列的是 True 和 False
print(df[df.A>8]) # 返回 True 的数据

output

             A   B   C   D
2018-11-29   0   1   2   3
2018-11-30   4   5   6   7
2018-12-01   8   9  10  11
2018-12-02  12  13  14  15
2018-12-03  16  17  18  19
2018-12-04  20  21  22  23 

2018-11-29    False
2018-11-30    False
2018-12-01    False
2018-12-02     True
2018-12-03     True
2018-12-04     True
Freq: D, Name: A, dtype: bool 

             A   B   C   D
2018-12-02  12  13  14  15
2018-12-03  16  17  18  19
2018-12-04  20  21  22  23

4 Pandas 设置值

4.1 根据位置设置 loc 和 iloc

import pandas as pd
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df

在这里插入图片描述

df.iloc[2,2] = 1111
df.loc['2018-12-02','B'] = 222
df

在这里插入图片描述

4.2 根据条件设置

import pandas as pd
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
print(df,'\n')
df[df.A>4] = 0
print(df)

output
在这里插入图片描述
把A>4的整行都变成了0

只筛选A的话，用如下的方式

import pandas as pd
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df.A[df.A>4] = 0
df

在这里插入图片描述
也可以这样，把B列符合筛选条件的值变为0

import pandas as pd
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df.B[df.A>4] = 0 # B列中，A列大于0的都变成0
df

在这里插入图片描述

4.3 按行或列设置

import pandas as pd
import numpy as np

dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df.loc['2018-11-29',:] = np.nan # 整行都设置为nan
df.loc[:,'A'] = np.nan # 整列都设置为 nan
df.iloc[2,2] = np.nan # 设置某一个位置的值为nan
df

在这里插入图片描述
新增行列

import pandas as pd
import numpy as np

dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df['F'] = np.nan # 添加列
df.loc['2018-12-05'] = np.nan # 添加行
df

在这里插入图片描述

5 Pandas 处理丢失数据

5.1创建含 NaN 的矩阵

import pandas as pd
import numpy as np
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan
df

在这里插入图片描述

5.2 pd.dropna()

如果想直接去掉有 NaN 的行或列, 可以使用 dropna
1）去掉有 nan 的所有行

df.dropna(axis=0)

在这里插入图片描述
2）去掉有 nan 的所有列

df.dropna(axis=1)

在这里插入图片描述
3）how的设置
默认为 any，行列中只要有nan就删掉，也可以换成 all，所有的行或者列为nan才删掉

import pandas as pd
import numpy as np

dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df['F'] = np.nan
df.iloc[0,-1] = 0
print(df)
print(df.dropna(axis=1,how='any'))
print(df.dropna(axis=1,how='all'))

在这里插入图片描述

5.3 pd.fillna()

如果是将 NaN 的值用其他值代替, 比如代替成 0

import pandas as pd
import numpy as np
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan
df.fillna(value=0)

在这里插入图片描述

5.4 isnull()

判断是否有缺失数据 NaN, 为 True 表示缺失数据

import pandas as pd
import numpy as np
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan
df.isnull()

在这里插入图片描述

结合 np.any() 使用会更好

import pandas as pd
import numpy as np
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan
np.any(df.isnull()==True)

ouput

True

6 Pandas 文件导入导出

很简单便捷，导入都用read_XXX，导出都用to_XXX
http://pandas.pydata.org/pandas-docs/stable/io.html
在这里插入图片描述
新建一个 excel 试验下

1）导入

import pandas as pd
data = pd.read_excel('C://Users/Administrator/Desktop/1.xlsx')
print(data)

output

   StudentID name  age genda
0          1    A   18     男
1          2    B   19     女

会默认给你添加 index
2）导出

data.to_pickle('C://Users/Administrator/Desktop/student.pickle')

在指定目录下会有student.pickle文件生成，方便。

7 Pandas 合并 concat

pandas处理多组数据的时候往往会要用到数据的合并处理,使用 concat是一种基本的合并方式.而且concat中有很多参数可以调整,合并成你想要的数据形式.

7.1 axis (合并方向)

import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.ones((2,2))*0, columns=['a','b'])
df2 = pd.DataFrame(np.ones((2,2))*1, columns=['a','b'])
df3 = pd.DataFrame(np.ones((2,2))*2, columns=['a','b'])

在这里插入图片描述
concat 默认 axis = 0

res = pd.concat([df1,df2,df3],axis=0,ignore_index=True) # index 没有变
res

在这里插入图片描述

res = pd.concat([df1,df2,df3],axis=1,ignore_index=True) # index 没有变
res

在这里插入图片描述

7.2 join (合并方式)

join = ['inner','outer']

import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d','e'], index=[2,3,4])

在这里插入图片描述

1）inner 只合并相同的index

res = pd.concat([df1,df2],axis=0,join = 'inner',ignore_index=True)
res

在这里插入图片描述

res = pd.concat([df1,df2],axis=1,join = 'inner',ignore_index=True)
res

在这里插入图片描述
2）outer，无脑合并，没有的补nan

res = pd.concat([df1,df2],axis=0,join = 'outer',ignore_index=True)
res

在这里插入图片描述

res = pd.concat([df1,df2],axis=1,join = 'outer',ignore_index=True)
res

在这里插入图片描述

7.3 join_axes (依照 axes 合并)

import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d','e'], index=[2,3,4])

在这里插入图片描述

res = pd.concat([df1,df2],axis = 1)
res

在这里插入图片描述
依照 df1.index进行横向合并

res = pd.concat([df1,df2],axis = 1,join_axes=[df1.index])
res

在这里插入图片描述

依照 df1.columns进行纵向合并

res = pd.concat([df1,df2],axis = 0,join_axes=[df1.columns])
res

在这里插入图片描述

7.4 append

append 只有纵向合并，没有横向合并。

df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d'])
df3 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d'])
s1 = pd.Series([1,2,3,4], index=['a','b','c','d'])
s2 = pd.Series([1,2,3,4], index=['a','b','c','d'])

1）合并 df1 和 df2

res = df1.append(df2,ignore_index=True)
res

在这里插入图片描述
2）合并 df1 ,df2和df3

res = df1.append([df2,df3],ignore_index=True)
res

在这里插入图片描述
3）合并一行数据

res = df1.append(s1,ignore_index=True)
res

在这里插入图片描述

8 Pandas 合并 merge

pandas中的merge和concat类似,但主要是用于两组有key column的数据,统一索引的数据. 通常也被用在Database的处理当中.

8.1 根据某一列合并（on）

import pandas as pd
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

在这里插入图片描述
根据 key合并

res = pd.merge(left,right,on = 'key')
res

在这里插入图片描述

8.2 根据某二列合并（on）

import pandas as pd
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                      'key2': ['K0', 'K1', 'K0', 'K1'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                       'key2': ['K0', 'K0', 'K0', 'K0'],
                       'C': ['C0', 'C1', 'C2', 'C3'],
                       'D': ['D0', 'D1', 'D2', 'D3']})

在这里插入图片描述
依据key1与key2 columns进行合并，并打印出四种结果[‘left’, ‘right’, ‘outer’, ‘inner’]，默认设置的是'inner'

inner
内连接，取交集

res = pd.merge(left,right,on=['key1','key2'],how = 'inner')
res

在这里插入图片描述
注意left frame中的 A2 B2 被匹配了两次

outer
外链接，取并集，并用nan填充

res = pd.merge(left,right,on=['key1','key2'],how = 'outer')
res

在这里插入图片描述
没有的用NaN补充

left
左连接，左侧DataFrame取全部，右侧DataFrame取部分

res = pd.merge(left,right,on=['key1','key2'],how = 'left')
res

在这里插入图片描述

right
右连接，右侧DataFrame取全部，左侧DataFrame取部分

res = pd.merge(left,right,on=['key1','key2'],how = 'right')
res

在这里插入图片描述

8.3 indicator=True

indicator=True会将合并的记录放在新的一列。

import pandas as pd
df1 = pd.DataFrame({'col1':[0,1], 'col_left':['a','b']})
df2 = pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})

在这里插入图片描述

res = pd.merge(df1, df2, on='col1', how='outer', indicator=True)
res

在这里插入图片描述
DIY 最后一列的名字，默认为_merge

res = pd.merge(df1, df2, on='col1', how='outer', indicator='indicator_column')
res

在这里插入图片描述

8.4 依据index合并（left_index / right_index）

import pandas as pd
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                     index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                      'D': ['D0', 'D2', 'D3']},
                     index=['K0', 'K2', 'K3'])

在这里插入图片描述

注意 left_index 和 right_index 必须是 True

outer

res = pd.merge(left, right, left_index=True, right_index=True, how='outer',indicator='indicator_column')
res

在这里插入图片描述

inner

res = pd.merge(left, right, left_index=True, right_index=True, how='inner',indicator='indicator_column')
res

在这里插入图片描述

left

res = pd.merge(left, right, left_index=True, right_index=True, how='left')
res

在这里插入图片描述

right

res = pd.merge(left, right, left_index=True, right_index=True, how='right')
res

在这里插入图片描述

8.5 解决overlapping的问题（suffixes）

import pandas as pd
boys = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'age': [1, 2, 3]})
girls = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'age': [4, 5, 6]})

在这里插入图片描述
有两个age

res = pd.merge(boys, girls, on='k', how='inner')
res

在这里插入图片描述
系统会默认_x,_y，我们用suffixes 来改下名字

res = pd.merge(boys, girls, on='k', suffixes=['_boy', '_girl'], how='inner')
res

在这里插入图片描述

9 Pandas plot 出图

padans 画图官方文档
http://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# plot data
#Series
data = pd.Series(np.random.randn(1000),index=np.arange(1000))
# 为了方便观看效果, 我们累加这个数据
data = data.cumsum()
data.plot()
plt.show()

在这里插入图片描述

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# plot data
#Series
data = pd.DataFrame(np.random.randn(1000,4),
                    index=np.arange(1000),
                    columns=['A','B','C','D'])
data = data.cumsum()
data.plot()
# plot methods:
# bar, hist,box,kde,area,scatter,hexbin,pie
ax = data.plot.scatter(x='A',y='B',color='DarkBlue',label='Class 1')
data.plot.scatter(x='A',y='C',color='DarkGreen',label='Class 2',ax = ax)
plt.show()

在这里插入图片描述

10 补充

使用 Lambda 来修改Pandas 数据框中的值

import pandas as pd
data = [[1,2,3],
       [4,5,6],
       [7,8,9]]

df = pd.DataFrame(data, columns=[0,1,2])
print(df)

def add_num(x):
    return f"{x}01"

df.loc[:,0] = df.loc[:,0].apply(add_num)
print(df)

output

   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9
     0  1  2
0  101  2  3
1  401  5  6
2  701  8  9

简洁一点的写法为

df.loc[:,0] = df.loc[:,0].apply(lambda x:f"{x}01")

修改行则用

df.loc[0,:] = df.loc[0,:].apply(lambda x:f"{x}01")
print(df)

output

     0    1    2
0  101  201  301
1    4    5    6
2    7    8    9

再看一个列子

import pandas as pd
data = [["a",'2','3'],
       ["b",'5','6'],
       ["c",'8','9']]

df = pd.DataFrame(data, columns=[0,1,2])
print(df)

df.loc[:,0] = df.loc[:,0].apply(lambda x:x.title())
print(df)

output

bryant_meng

关注

1
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
【python】pandas

本博客为 Numpy & Pandas 莫烦 python 数据处理的个人学习笔记！numpy 的相关介绍可以参考【python】numpy最后一次更新时间为：2018-12-10文章目录0 前言：1 Series2 DataFrame2.1 dtypes / index / columns / values2.2 describe / T2.3 sort_index3 Pandas 选择数据3.1 简单的筛选3.2 loc3.3 iloc3.4 ix3.5 Boolean index.
复制链接

扫一扫