第七十六篇数据处理与分析 Pandas

最新推荐文章于 2024-04-14 15:42:25 发布

Laughing@me

最新推荐文章于 2024-04-14 15:42:25 发布

阅读量354

点赞数

分类专栏：数据分析文章标签： python pandas

本文链接：https://blog.csdn.net/qq_45503700/article/details/105776257

版权

数据分析专栏收录该内容

3 篇文章 0 订阅

订阅专栏

pandas官网：

https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

一、介绍

pandas能干什么：

pandas是一个强大的Python数据分析的工具包。
pandas是基于NumPy构建的。

pandas的主要功能：

具备对其功能的数据结构DataFrame、Series
集成时间序列功能
提供丰富的数学运算和操作
灵活处理缺失数据

安装方法：

pip install pandas

引用方法：

import pandas as pd

二、Series用法

Series是一种类似与一维数组的对象，由下面两个部分组成：

values：一组数据（ndarray类型）
index：相关的数据索引标签

1 Series的创建

由列表或numpy数组创建

#使用列表创建Series
Series(data=[1,2,3,4],index=['ds','dsa','re','gr'],name='haha')
ds     1
dsa    2
re     3
gr     4
Name: haha, dtype: int64

#使用numpy创建Series
Series(data=np.arange(10,60,6))
0    10
1    16
2    22
3    28
4    34
5    40
6    46
7    52
8    58
dtype: int32

由字典创建:不能在使用index.但是依然存在默认索引
注意：数据源必须为一维数据

dic = {
    'math':100,
    'English':50
}
Series(data=dic,name='qimo')
math       100
English     50
Name: qimo, dtype: int64

2 Series的索引和切片

显式索引：

使用index中的元素作为索引值
使用s.loc[]（推荐）:注意，loc中括号中放置的一定是显示索引

s = Series(np.random.randint(60,100,size=(5,)),index=['a','b','c','d','e'])
a    85
b    65
c    64
d    99
e    93
dtype: int32
输出：
s['b']
65

隐式索引：

使用整数作为索引值
使用.iloc[]（推荐）:iloc中的中括号中必须放置隐式索引

s.iloc[1]
#输出
65
s.iloc[[1,2,3]]
#输出
b    65
c    64
d    99
dtype: int32

切片:隐式索引切片和显示索引切片

显示索引切片:index和loc

s.loc['a':'c']
a    85
b    65
c    64
dtype: int32

隐式索引切片：整数索引值和iloc

s=
a    61
b    73
c    73
d    89
e    80
dtype: int32
s[1:3]
#输出
b    73
c    73
dtype: int32

3. Series的基本概念

可以把Series看成一个定长的有序字典

向Series增加一行：相当于给字典增加一组键值对

s['f'] = 100

查看Series属性

shape，size，index,values

对Series元素进行去重

s = Series(data=[1,1,2,2,3,3,4,4,4,5,6,7,8,7,7,66,43,342,6665,444,333,444])
s.unique()   #Series中的去重函数

4. Series去空值

当索引没有对应的值时，可能出现缺失数据显示NaN（not a number）的情况
使得两个Series进行相加

s1 = Series(data=[1,2,3,4,5],index=['a','b','c','d','e'])
s2 = Series(data=[1,2,3,4,5],index=['a','b','c','f','g'])
s = s1+s2
s
#输出
a    2.0
b    4.0
c    6.0
d    NaN
e    NaN
f    NaN
g    NaN
dtype: float64

可以使用pd.isnull()，pd.notnull()，或s.isnull(),notnull()函数检测缺失数据

s.iloc[[True,False,True,True,False,True,True]]   #True/False是可以作为Series的索引
#输出
a    2.0
c    6.0
d    NaN
f    NaN
g    NaN
dtype: float64

s.loc[s.notnull()]   #空值检测,过滤
#输出
a    2.0
b    4.0
c    6.0
dtype: float64

三、DataFrame用法

DataFrame是一个【表格型】的数据结构。DataFrame由按一定顺序排列的多列数据组成。设计初衷是将Series的使用场景从一维拓展到多维。DataFrame既有行索引，也有列索引。

行索引：index
列索引：columns
值：values

1 .创建DataFrame

使用numpy去创建DataFrame

DataFrame(data=np.random.randint(60,100,size=(3,3)),index=['a','b','c'],columns=['A','B','C'])
#输出
	A	B	C
a	99	69	99
b	91	73	75
c	64	94	74

使用字典去创建DataFrame
DataFrame以字典的键作为每一【列】的名称，以字典的值（一个数组）作为每一列。

dic = {
    'java':[60,70,80],
    'python':[100,100,100]
}
DataFrame(data=dic,index=['zhangsan','lisi','wangwu'])
#输出
	java	python
zhangsan	60	100
lisi	70	100
wangwu	80	100

2.DataFrame的索引和切片

对列进行索引
通过类似字典的方式 df[‘q’]
通过属性的方式 df.q
对行进行索引
使用.loc[]加index来进行行索引
使用.iloc[]加整数来进行行索引
对元素索引的方法
使用列索引
使用行索引(iloc[3,1] or loc[‘C’,‘q’]) 行索引在前，列索引在后
切片
直接用中括号时：

索引表示的是列索引
切片表示的是行切片

在loc和iloc中使用切片(切列) ：      df.loc['B':'C','丙':'丁']

3. DataFrame运算和去空值

运算
在运算中自动对齐不同索引的数据
如果索引不对应，则补NaN

df2=DataFrame(data=np.random.randint(60,150,size=(3,2)),index=['a','b','c'],columns=['A','B'])
df3=DataFrame(data=np.random.randint(60,150,size=(3,3)),index=['a','b','c'],columns=['A','B','c'])
df3+df2
#输出
	A	B	c
a	220	211	NaN
b	185	162	NaN
c	178	139	NaN

去空值
pandas中None与np.nan都视作np.nan
有两种丢失数据：
None： Python自带的，其类型为python object。因此，None不能参与到任何计算中。
np.nan(NaN)：np.nan是浮点类型，能参与到计算中。但计算的结果总是NaN。

>>> df
    0     1   2     3   4     5   6
0  73  98.0  91  73.0  41  73.0  51
1  46  49.0  54   NaN  60  15.0  32
2  39   NaN  50  13.0  16  82.0  64
3  11  84.0  51  67.0  29  61.0  13
4  78  21.0  69  38.0  35   NaN  51

isnull()
搭配any()

>>> df.isnull()
       0      1      2      3      4      5      6
0  False  False  False  False  False  False  False
1  False  False  False   True  False  False  False
2  False   True  False  False  False  False  False
3  False  False  False  False  False  False  False
4  False  False  False  False  False   True  False
>>> df.isnull().any()
0    False
1     True
2    False
3     True
4    False
5     True
6    False
dtype: bool

notnull()
搭配all()

>>> df.notnull()
      0      1     2      3     4      5     6
0  True   True  True   True  True   True  True
1  True   True  True  False  True   True  True
2  True  False  True   True  True   True  True
3  True   True  True   True  True   True  True
4  True   True  True   True  True  False  True

>>> df.notnull().all()
0     True
1    False
2     True
3    False
4     True
5    False
6     True

dropna(): 过滤丢失数据
唯一是和0=列，1=行相反的参数

>>> df.dropna()
    0     1   2     3   4     5   6
0  73  98.0  91  73.0  41  73.0  51
3  11  84.0  51  67.0  29  61.0  13
>>> df.dropna(axis=0) #有空值行删除
    0     1   2     3   4     5   6
0  73  98.0  91  73.0  41  73.0  51
3  11  84.0  51  67.0  29  61.0  13
>>> df.dropna(axis=1) #有空值列删除
    0   2   4   6
0  73  91  41  51
1  46  54  60  32
2  39  50  16  64
3  11  51  29  13
4  78  69  35  51

fillna(): 填充丢失数据
fillna(value=‘A’,method=“bfill/ffill”,axis=“0/1”)

填充固定值

>>> df.fillna(value=99)
    0     1   2     3   4     5   6
0  73  98.0  91  73.0  41  73.0  51
1  46  49.0  54  99.0  60  15.0  32
2  39  99.0  50  13.0  16  82.0  64
3  11  84.0  51  67.0  29  61.0  13
4  78  21.0  69  38.0  35  99.0  51

向前先后填充

#向后填充
>>> df.fillna(method='bfill')
    0     1   2     3   4     5   6
0  73  98.0  91  73.0  41  73.0  51
1  46  49.0  54  13.0  60  15.0  32
2  39  84.0  50  13.0  16  82.0  64
3  11  84.0  51  67.0  29  61.0  13
4  78  21.0  69  38.0  35   NaN  51
#向前填充
>>> df.fillna(method='ffill')
    0     1   2     3   4     5   6
0  73  98.0  91  73.0  41  73.0  51
1  46  49.0  54  73.0  60  15.0  32
2  39  49.0  50  13.0  16  82.0  64
3  11  84.0  51  67.0  29  61.0  13
4  78  21.0  69  38.0  35  61.0  51
#以行向前填充
>>> df.fillna(method='ffill',axis=1)
      0     1     2     3     4     5     6
0  73.0  98.0  91.0  73.0  41.0  73.0  51.0
1  46.0  49.0  54.0  54.0  60.0  15.0  32.0
2  39.0  39.0  50.0  13.0  16.0  82.0  64.0
3  11.0  84.0  51.0  67.0  29.0  61.0  13.0
4  78.0  21.0  69.0  38.0  35.0  35.0  51.0

4. 多层索引和切片

总结：
访问一列或多列 直接用中括号[columnname]  [[columname1,columnname2...]]
访问一行或多行  .loc[indexname]
访问某一个元素  .loc[indexname,columnname]  获取李四期中的php成绩
行切片          .loc[index1:index2]        获取张三李四的期中成绩
列切片          .loc[:,column1:column2]    获取张三李四期中的php和c++成绩

多层行索引
最常见的方法是给DataFrame构造函数的index或者columns参数传递两个或更多的数组

创建1：

>>> df1=DataFrame(data=np.random.randint(10,50,size=(2,4)),index=['A','B'],columns=[['a','b','e','f'],['c','d','a','d']])
>>> df1
    a   b   e   f
    c   d   a   d
A  48  10  12  15
B  41  38  26  20

创建2：

col=pd.MultiIndex.from_product([['qizhong','qimo'],
                                ['chinese','math']])                   
#创建DF对象
df = DataFrame(data=np.random.randint(60,120,size=(2,4)),index=['tom','jay'],
         columns=col)
>>> df
    qizhong         qimo
    chinese math chinese math
tom     117   72     115  102
jay      86   64     115  101

#使用
>>> df['qizhong']
     chinese  math
tom      117    72
jay       86    64
>>> df.loc['tom']
qizhong  chinese    117
         math        72
qimo     chinese    115
         math       102
Name: tom, dtype: int32

聚合函数操作

>>> df.max()
qizhong  chinese    117
         math        72
qimo     chinese    115
         math       102
dtype: int32
>>> df.loc['jay'].max()
115

四. pandas拼接操作

pandas的拼接分为两种：

级联：pd.concat, pd.append
合并：pd.merge, pd.join

1. 使用pd.concat()级联

axis=0
join=‘outer’ / ‘inner’:表示的是级联的方式，outer会将所有的项进行级联（忽略匹配和不匹配），而inner只会将匹配的项级联到一起，不匹配的不级联
ignore_index=False 是否显示索引根据axis来显示

>>> df2
    a   b   c
A  17  11  42
B  38  11  31
C  12  44  19
>>> df1
    a   b   c   d
A  44  40  37  47
B  14  28  12  29
>>> pd.concat([df1,df2])
    a   b   c     d
A  44  40  37  47.0
B  14  28  12  29.0
A  17  11  42   NaN
B  38  11  31   NaN
C  12  44  19   NaN
>>> pd.concat([df1,df2],join="inner")
    b   a   c
A  40  44  37
B  28  14  12
A  11  17  42
B  11  38  31
C  44  12  19

还有一种方式df.append(df1)函数添加

>>> df1.append(df2)
    a   b   c     d
A  44  40  37  47.0
B  14  28  12  29.0
A  17  11  42   NaN
B  38  11  31   NaN
C  12  44  19   NaN
>>>

2. 使用pd.merge()合并

merge与concat的区别在于，merge需要依据某一共同的列来进行合并
使用pd.merge()合并时，会自动根据两者相同column名称的那一列，作为key来进行合并。
参数：

left/right：左/右位置的dataframe。
how：数据合并的方式。left：基于左dataframe列的数据合并；right：基于右dataframe列的数据合并；outer：基于列的数据外合并（取并集）；inner：基于列的数据内合并（取交集）；默认为’inner’。
on：用来合并的列名，这个参数需要保证两个dataframe有相同的列名。
left_on 两个表没相同名的列时，得指定具体列名
right_on 右列名
无相同列，则该列的值置为NaN。
交集：

并集：

以左表为标准，逐一寻找右表对应的值，1对一或一对多

df5 = pd.merge(df1,df2,how='left',on='alpha')

在这里插入图片描述

五、常用函数

1. 根据条件查数据

abb_pop.query('year==2010 & ages=="total"')

2. 删除列

并同步到原始数据 labels Index or column labels to drop

abb_pop.drop(labels='abbreviation',axis=1,inplace=True)

3. 添加新列

>>> df3
  employee        group  hire_date
0     Lisa   Accounting       2004
1     Jake  Engineering       2016
>>> df4
         group supervisor
0   Accounting      Carly
1  Engineering      Guido
2  Engineering      Steve
>>>
>>> df3['new']=df4['group']
>>> df3
  employee        group  hire_date          new
0     Lisa   Accounting       2004   Accounting
1     Jake  Engineering       2016  Engineering

4 set_index() 和reset_index()

DataFrame可以通过set_index方法，可以设置单索引和复合索引，索引为表内的列
DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
append添加新索引，drop为False，inplace为True时，索引将会还原为列

In [307]: data
Out[307]: 
     a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0
 
In [308]: indexed1 = data.set_index('c')
 
In [309]: indexed1
Out[309]: 
     a    b    d
c               
z  bar  one  1.0
y  bar  two  2.0
x  foo  one  3.0
w  foo  two  4.0
 
In [310]: indexed2 = data.set_index(['a', 'b'])
 
In [311]: indexed2
Out[311]: 
         c    d
a   b          
bar one  z  1.0
    two  y  2.0
foo one  x  3.0
    two  w  4.0

reset_index可以还原索引，重新变为默认的整型索引
DataFrame.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill=”)
level控制了具体要还原的那个等级的索引
drop为False则索引列会被还原为普通列，否则会丢失

In [318]: data
Out[318]: 
         c    d
a   b          
bar one  z  1.0
    two  y  2.0
foo one  x  3.0
    two  w  4.0
 
In [319]: data.reset_index()
Out[319]: 
     a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0

5. Pandas rename()

方法用于重命名任何索引，列或行
用法： DataFrame.rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None)

>>> a
          0         1         2         3
0  0.102037  0.304021  0.172873  0.514103
1  0.718055  0.281710  0.155927  0.051859
2  0.819120  0.614601  0.474220  0.107108
>>> a.rename(columns={0:"tian"})
       tian         1         2         3
0  0.102037  0.304021  0.172873  0.514103
1  0.718055  0.281710  0.155927  0.051859
2  0.819120  0.614601  0.474220  0.107108

改名可以直接赋值

test.columns = ['c','b']

6. cumsum()

cumulative是累计、累积的意思，也就是说这个函数可以返回一个累计值，我们经常会遇到月累计、年累计这种指标，用这个函数就很方便了。

>>> a
          0         1         2         3
0  0.102037  0.304021  0.172873  0.514103
1  0.718055  0.281710  0.155927  0.051859
2  0.819120  0.614601  0.474220  0.107108
>>> a[3].cumsum()
0    0.514103
1    0.565961
2    0.673070
Name: 3, dtype: float64

cumsum函数默认忽略了NaN值，我们可以通过参数来设置

skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA

7.shift()

向上或者下移动一的距离，要是没有数据就用NaN来填充

函数中的几个参数意义如下：
period：表示移动的幅度，可以是正数，也可以是负数，默认值是1,1就表示移动一次，移动之后没有对应值的，就赋值为NaN。
freq： DateOffset, timedelta, or time rule string，可选参数，默认值为None，只适用于时间序列
axis：轴向。0表示行向移动（上下移动），1表示列向移动（左右移动）
period与freq的区别：
period移动时，只移动数据，行列索引不移动；
freq移动时，只移动索引，数据不变，且只在索引是时间时生效

>>> a.shift()
          0         1         2         3
0       NaN       NaN       NaN       NaN
1  0.102037  0.304021  0.172873  0.514103
2  0.718055  0.281710  0.155927  0.051859
>>> a.shift(-1)
          0         1         2         3
0  0.718055  0.281710  0.155927  0.051859
1  0.819120  0.614601  0.474220  0.107108
2       NaN       NaN       NaN       NaN

8.cut()

pandas中pd.cut()的功能和作用
将list中的内容按照我们定义的分箱，写入对应的箱中，给每个数据打好标签

>>> socere_list = np.random.randint(25,100,size=20)
>>> socere_list
array([47, 38, 30, 25, 45, 83, 62, 28, 60, 99, 99, 66, 31, 68, 98, 35, 51,
       52, 46, 74])
>>> bins = [0,59,70,80,100]
>>> score_cut = pd.cut(socere_list,bins)
>>> score_cut
[(0, 59], (0, 59], (0, 59], (0, 59], (0, 59], ..., (0, 59], (0, 59], (0, 59], (0, 59], (70, 80]]
Length: 20
Categories (4, interval[int64]): [(0, 59] < (59, 70] < (70, 80] < (80, 100]]
>>> pd.value_counts(score_cut)
(0, 59]      11
(80, 100]     4
(59, 70]      4
(70, 80]      1
dtype: int64
>>>

_get_codes()输出对应分箱的标签,我们分箱的区间有4个(0,1,2,3)
>>> score_cut._get_codes()
array([0, 0, 0, 0, 0, 3, 1, 0, 1, 3, 3, 1, 0, 1, 3, 0, 0, 0, 0, 2],
      dtype=int8)
>>> score_cut._get_codes().tolist()
[0, 0, 0, 0, 0, 3, 1, 0, 1, 3, 3, 1, 0, 1, 3, 0, 0, 0, 0, 2]

9 . diff()

计算数据框元素与数据框中另一个元素相比的差异（默认为上一行中的元素）。

df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],
                   'b': [1, 1, 2, 3, 5, 8],
                   'c': [1, 4, 9, 16, 25, 36]})
df
   a  b   c
0  1  1   1
1  2  1   4
2  3  2   9
3  4  3  16
4  5  5  25
5  6  8  36
df.diff()
     a    b     c
0  NaN  NaN   NaN
1  1.0  0.0   3.0
2  1.0  1.0   5.0
3  1.0  1.0   7.0
4  1.0  2.0   9.0
5  1.0  3.0  11.0

10. agg()

用于汇总数据的功能。如果是函数，则必须在传递DataFrame或传递给DataFrame.apply时起作用。

>>> a
          0         1         2         3
0  0.102037  0.304021  0.172873  0.514103
1  0.718055  0.281710  0.155927  0.051859
2  0.819120  0.614601  0.474220  0.107108
>>> a.agg(['sum','count'])
              0         1        2        3
sum    1.639213  1.200333  0.80302  0.67307
count  3.000000  3.000000  3.00000  3.00000
>>> a.agg(['sum','count'],axis=0)
              0         1        2        3
sum    1.639213  1.200333  0.80302  0.67307
count  3.000000  3.000000  3.00000  3.00000
>>> a.agg(['sum','count'],axis=1)
        sum  count
0  1.093035    4.0
1  1.207551    4.0
2  2.015050    4.0

11. groupby

分组操作涉及拆分对象，应用功能以及合并结果的某种组合。这可用于对大量数据进行分组并在这些组上进行计算操作。

>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
...                               'Parrot', 'Parrot'],
...                    'Max Speed': [380., 370., 24., 26.]})
>>> df
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
>>> df.groupby(['Animal'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10cfc5a90>
>>> df.groupby(['Animal']).mean()
        Max Speed
Animal
Falcon      375.0
Parrot       25.0