pandas

最新推荐文章于 2022-10-24 16:32:41 发布

寂ღ᭄秋࿐

最新推荐文章于 2022-10-24 16:32:41 发布

阅读量1.7k

点赞数 2

本文链接：https://blog.csdn.net/qq_51167531/article/details/122116309

版权

数据分析专栏收录该内容

5 篇文章 0 订阅

订阅专栏

主要的两个结构

Series 和 DataFrame

其导入方式为

import pandas as pd

打印导入的pandas版本

pd.__version__

打印出pandas库需要的所有的版本信息

pd.show_versions()

6.快速将文件格式转换为markdown形式

读取文件

pandas可以读取的文件格式有很多，这里主要介绍读取csv, excel, txt，json文件。

1.读取csv文件

pd.read_csv('data.csv')

2.读取txt文件

pd.read_table('data.txt')

3.读取excel文件

pd.read_excel('data.excel')

4.读取json文件

pd.read_json('data.json')

5.其中参数的主要用法

（1）header = None

表示第一列不会作为数据的列标签

（2）index_col

表示将某一列或者多列作为行标签

（3）usecols

表示选取全部数据中的几列为一组数据，平时默认为全部

（4）nrows

表示读取行的数量，一般为全部

（5）parse_dates

表示将字符串形式的时间格式转换为时间的正常格式

（6）sep

分隔符（在使用时它使用的是正则表达式）

6.快速将文件格式转换为markdown形式

a.to_markdown()

Series

Series是一种类似于一维数组的对象，它由一组数据（各种NumPy数据类型）以及一组与之相关的数据标签（即索引）组成。（即是一个一维的数据结构，它由index（索引）和value（值）组成）

可以导入

from pandas import Series as si

1.Series的创建方式

（1）列表创建

a = pd.Series([1,2,3],index = ['a','b','c'])

a    1
b    2
c    3

（2）标量值创建

a = pd.Series(25,index = ['a','b','c'])

a    25
b    25
c    25

（3）python字典

a = {'a':3500, 'b':7100, 'c':5000}
b = pd.Series(a)

a    3500
b    7100
c    5000

（4）numpy创建

import numpy as np
a = pd.Series(np.arange(5),index = np.arange(10,5,-1))

10    0
9     1
8     2
7     3
6     4

2.索引

（1）通过索引选取Series中的单个或一组值

e = pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd'])
e['a']
e['d']
e[['a', 'c']]

（2）根据索引值映射到数据

e = pd.Series([1,2,3,4], index = ['a', 'b', 'c', 'd'])
'a' in e
'c' in e

True
False

（3）通过索引值改变索引的顺序

e = pd.Series([1,2,3,4], index = ['a', 'b', 'c', 'd'])
e.index = ['b', 'd', 'a', 'c']
e

b    1
d    2
a    3
c    4
dtype: int64

（4）Series与其它方式不同，它会根据运算的索引自动进行对齐数据

x = pd.Series([1,2,3,4], index = ['a','c','e','f'])
y = pd.Series([1,2,3,4], index = ['a', 'b', 'c', 'd'])
x + y

a    2.0
b    NaN
c    5.0
d    NaN
e    NaN
f    NaN
dtype: float64

NaN表示空值，要两个同时有的才会不是空值

（5）通过索引赋值进行索引的更换

a = pd.Series([1,2], index = ['a', 'b'])
a.index = ['c', 'd']
a

a = pd.Series([1,2], index = ['a', 'b'])
a.index = ['c', 'd']
a

（6）通过index和values以及切片获取相对应的数据

x = pd.Series([1,2,3,4], index = ['a','c','e','f'])
x.index
x.values
x.index[0:2]
x.values[0:1]

Index(['a', 'c', 'e', 'f'], dtype='object')
[1 2 3 4]
Index(['a', 'c'], dtype='object')
[1 2]

（7）通过标签索引 loc[] 和数值索引 iloc[]

x = pd.Series([1,2,3,4], index = ['a','c','e','f'])
x.loc['a']
x.iloc[0]
x.iloc[0:2]

1
1
a    1
c    2
dtype: int64

3.排序

（1）sort_index()

a = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
a.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

（2）sort_values()（如果由缺失值，即NaN，都会被放到最低端）

a = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
a.sort_values()

d    0
a    1
b    2
c    3
dtype: int64

（3）计算排名

a = pd.Series([7, -5, 8, 4, 2, 0, 3])
a.rank()

0    6.0
1    1.0
2    7.0
3    5.0
4    3.0
5    2.0
6    4.0
dtype: float64
# 当有两个数据是一样的时候，就会产生小数，即排名一半
a = pd.Series([7, -5, 7, 4, 2, 0, 2])
a.rank()
0    6.5
1    1.0
2    6.5
3    5.0
4    3.5
5    2.0
6    3.5
dtype: float64
# 当有两个数据是一样的时候，利用method='first'会根据标签顺序继续进行排序
a = pd.Series([7, -5, 7, 4, 2, 0, 2])
a.rank(method='first')

0    6.0
1    1.0
2    7.0
3    5.0
4    3.0
5    2.0
6    4.0
dtype: float64

4.唯一值

利用unique()（返回唯一值的列表）

a = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
a.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

利用nunique()（返回唯一值的个数）

a = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
a.nunique()

4

value_counts()获得唯一值和其唯一值对应出现的频数

a = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
a.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

（DataFrame)观察多个列组合的唯一值，使用drop_duplicates(keep= '')以及duplicated()

这两者的区别在于后者返回值时返回的是布尔值，两者同时拥有一下三个参数的用法

keep参数后面可以接三个值：first,last,False

first：表示每一个组合保留第一次出现的位置（即行）

last：表示每一个组合保留最后一次出现的位置（即行）

False：表示把所有的重复组合所在行都去掉

5.替换函数

映射替换（repalce()）

（1）通过字典更换(两种方法)

s = pd.Series(['a', 1, 'b', 2, 1, 1, 'a'])
s.replace({1:'a',2:'b'})

s = pd.Series(['a', 1, 'b', 2, 1, 1, 'a'])
s.replace([1,2],['a','b'])


0    a
1    a
2    b
3    b
4    a
5    a
6    a
dtype: object

（2）指定方向替换（与前对齐，还是与后面对齐）

与前对齐

s = pd.Series(['a', 1, 'b', 2, 1, 1, 'a'])
s.replace([1,2],method  = 'ffill')

0    a
1    a
2    b
3    b
4    b
5    b
6    a
dtype: object

与后对齐

s = pd.Series(['a', 1, 'b', 2, 1, 1, 'a'])
s.replace([1,2],method  = 'bfill')

0    a
1    b
2    b
3    a
4    a
5    a
6    a
dtype: object

（3）正则替换(str.place)

逻辑替换（where,mask)

where()当不指定小于表示的值为多少时，它就会自动为缺失值，当指定后，它会为指定值（其中替换掉的不是满足表达式的值）

s = pd.Series([-1, 1,-2, 100, 6, 10, -3])
s.where(s<10)
s.where(s<10,False)

0   -1.0
1    1.0
2   -2.0
3    NaN
4    6.0
5    NaN
6   -3.0
dtype: float64

0       -1
1        1
2       -2
3    False
4        6
5    False
6       -3
dtype: object

mask()当不指定小于表示的值为多少时，它就会自动为缺失值，当指定后，它会为指定值（但是其中替换掉的是满足表达式的值）

s = pd.Series([-1, 1,-2, 100, 6, 10, -3])
print(s.mask(s<10))
print(s.mask(s<10,True))

0      NaN
1      NaN
2      NaN
3    100.0
4      NaN
5     10.0
6      NaN
dtype: float64
    
0    True
1    True
2    True
3     100
4    True
5      10
6    True
dtype: object

数值替换（round,abs,clip)

round()四舍五入，其中括号中填的值为小数后几位

s = pd.Series([-1.5, 1.234,-2.2654, 100, 6.2334, 10.5555, -3.6666])
s.round(2)

0     -1.50
1      1.23
2     -2.27
3    100.00
4      6.23
5     10.56
6     -3.67
dtype: float64

abs()求其绝对值

s = pd.Series([-1.5, 1.234,-2.2654, 100, 6.2334, 10.5555, -3.6666])
s.abs()

0      1.5000
1      1.2340
2      2.2654
3    100.0000
4      6.2334
5     10.5555
6      3.6666
dtype: float64

clip(2,3)表示其中的数在2，3这范围内的不变，在此之外的小于第一个数的不管有多小都会被替换为第一个数，大于第二个数的不管有多大，都会被替换为第二个数

s = pd.Series([-1.5, 1.234,-2.2654, 100, 6.2334, 10.5555, -3.6666])
s.clip(1,6)

0    1.000
1    1.234
2    1.000
3    6.000
4    6.000
5    6.000
6    1.000
dtype: float64

DataFrame

DataFrame是一个二维表格型数据结构。DataFrame对象既有行索引，又有列索引。其中，行索引表明不同行，横向索引。列索引表名不同列，纵向索引。（同时它可以被看做由Series组成的字典（共用同一个索引）

可以导入

from pandas import DataFrame as df

1.创建方式

（1）二维数组创建

a = df([[1,2,3],[4,5,6],[7,8,9]],index = list(['第一行','第二行','第三行']),columns = list(['第一列','第二列','第三列']))
a

     第一列 第二列 第三列
第一行  1      2    3
第二行  4      5    6
第三行  7      8    9

（2）字典创建

a = df([[1,2,3],[4,5,6],[7,8,9]],index = list(['第一行','第二行','第三行']),columns = list(['第一列','第二列','第三列']))
a

     第一列 第二列 第三列
第一行  1      2    3
第二行  4      5    6
第三行  7      8    9

2.DataFrame属性和操作

1 基础属性

(1)a.shape #行数和列数

(2)a.dtypes #列数据类型

(3)a.ndim #数据维度

(4)a.index #行索引

(5)a.columns #列索引

(6)a.values #对象值

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
a = pd.DataFrame(data)
a.shape
a.dtypes
a.ndim
a.index
a.columns
a.values


(6, 3)

state     object
year       int64
pop      float64
dtype: object
 
2

RangeIndex(start=0, stop=6, step=1)

Index(['state', 'year', 'pop'], dtype='object')

array([['Ohio', 2000, 1.5],
       ['Ohio', 2001, 1.7],
       ['Ohio', 2002, 3.6],
       ['Nevada', 2001, 2.4],
       ['Nevada', 2002, 2.9],
       ['Nevada', 2003, 3.2]], dtype=object)

2 基本操作

(1)a.head(3) #显示前3行

(2)a.tail(3) #显示末尾3行

(3)a.info() #显示信息概述，行数，列数，索引，列非空值个数，列类型等。

(4)a.describe() # 统计信息，均值，最大值，最小值，标准差等。

a.head(3)
a.tail(3)
a.info() 
a.describe() 


   state    year    pop
0   Ohio    2000    1.5
1   Ohio    2001    1.7
2   Ohio    2002    3.6

   state    year    pop
3   Nevada  2001    2.4
4   Nevada  2002    2.9
5   Nevada  2003    3.2

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   state   6 non-null      object 
 1   year    6 non-null      int64  
 2   pop     6 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 272.0+ bytes

            year    pop
count   6.000000    6.000000
mean    2001.500000 2.550000
std     1.048809    0.836062
min     2000.000000 1.500000
25%     2001.000000 1.875000
50%     2001.500000 2.650000
75%     2002.000000 3.125000
max     2003.000000 3.600000

3.索引

（1）根据索引值获取相对应的数据，并进行赋值

student={'name':['小王','晓东','小敏'],
        'sage':[22,21,20],
        'ssex':['man','man','woman']}
a =  pd.DataFrame(student)
a.index
a.values
a.columns
a['name']
a['name'] = '姓名'


RangeIndex(start=0, stop=3, step=1)

array([['小王', 22, 'man'],
       ['晓东', 21, 'man'],
       ['小敏', 20, 'woman']], dtype=object)

Index(['name', 'sage', 'ssex'], dtype='object')

0    小王
1    晓东
2    小敏
Name: name, dtype: object
        
    name    sage    ssex
0   姓名      22     man
1   姓名      21     man
2   姓名      20     woman

（2）标签索引 loc[]

student={'name':['小王','晓东','小敏'],
        'sage':[22,21,20],
        'ssex':['man','man','woman']}
a =  pd.DataFrame(student)
# 输出name列
a.loc[:,['name']]                
# 输出第一行
a.loc[0]
# 输出name和sage
a.loc[0:,['name','sage']]   # 等价于a.loc[:,['name','sage']]  a[['name','sage']]

    name
0   小王
1   晓东
2   小敏

name     小王
sage     22
ssex    man
Name: 0, dtype: object
        
    name    sage
0   小王     22
1   晓东     21
2   小敏     20

（3）位置索引 iloc[]

student={'name':['小王','晓东','小敏'],
        'sage':[22,21,20],
        'ssex':['man','man','woman']}
a =  pd.DataFrame(student)
#获取0号与1号同学的姓名和年龄
a.iloc[[0, 1] , [0, 1]] 
#获取每一位同学的姓名与年龄
a.iloc[0:,[0,2]]

    name    sage
0   小王     22
1   晓东     21

    name    ssex
0   小王     man
1   晓东     man
2   小敏     woman

（5）布尔索引

student={'name':['小王','晓东','小敏'，'小明'],
        'sage':[22,21,20,np.nan],
        'ssex':['man','man','woman','man']}
a =  pd.DataFrame(student)
# 将大于等于21岁的人的年龄设置位True，否则就为False
a['sage'] <= 21
# 输出符合条件的人的信息
a[a['sage']<=21]

0    False
1     True
2     True
Name: sage, dtype: bool
        
    name    sage    ssex
1   晓东     21     man
2   小敏     20     woman



# 选择为空值的行
a[a['sage'].isnull()]

    name    sage    ssex
3   小明     NaN     man



#选择性别为男，年龄大于21的数据
a[(a['ssex'] == 'man') & (a['sage']>21)]

    name    sage    ssex
0   小王     22.0    man


#选择年龄为21到22的人
a[a['sage'].between(21,22)]

    name    sage    ssex
0   小王     22.0     man
1   晓东     21.0     man

（6）对于具体的元素获得

student={'name':['小王','晓东','小敏'],
        'sage':[22,21,20],
        'ssex':['man','man','woman']}
a =  pd.DataFrame(student)
通过loc[]:

a.loc[0,'name']
a.loc[0,'ssex']

'小王'
'man'
通过iloc[]

a.iloc[0,0]
a.iloc[0,2]

'小王'
'man'

（7）删除相对应的数据

student={'name':['小王','晓东','小敏'],
        'sage':[22,21,20],
        'ssex':['man','man','woman']}
a =  pd.DataFrame(student)

通过 del（只能删除列）

del a['sage']

    name    ssex
0   小王     man
1   晓东     man
2   小敏     woman

通过函数drop()删除（只能一列或者一行一行的删）

# 按列删除
a.drop(['name'],axis = 1)
# 按行删除
a.drop([0])
#获取删除行后的值(最后得到的是一个数组)
a.drop(['name'],axis = 1).vaules


   sage ssex
0   22  man
1   21  man
2   20  woman

   name sage ssex
1   晓东  21  man
2   小敏  20  woman

array([[22, 'man'],
       [21, 'man'],
       [20, 'woman']], dtype=object)

（8）添加行

a.loc[3] = ['小王',22,'man']
a

    name    sage    ssex
0   姓名     22   man
1   姓名     21   man
2   姓名     20   woman
3   小王     22   man

（9）添加列（直接添加）

raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
            "evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
            "type": ['grass', 'fire', 'water', 'bug'],
            "hp": [45, 39, 44, 45],
            "pokedex": ['yes', 'no','yes','no']                        
            }
a = pd.DataFrame(raw_data)
a['place'] =['park','street','lake','forest']
a


      name     evolution    type    hp  pokedex place
0   Bulbasaur   Ivysaur    grass    45    yes   park
1   Charmander  Charmeleon  fire    39    no    street
2   Squirtle    Wartortle   water   44    yes   lake
3   Caterpie    Metapod      bug    45    no    forest

（9）更改列的标签

利用rename

inplace：是否替换，默认为False。True表示在原DataFrame上修改，False将修改后的DataFrame作为新的对象返回

student={'name':['小王','晓东','小敏'],
        'sage':[22,21,20],
        'ssex':['man','man','woman']}
a =  pd.DataFrame(student)
a.rename(columns = {'name':"姓名",'sage':"年龄",'ssex':"性别"},inplace = True)
a

    姓名  年龄   性别
0   小王   22   man
1   晓东   21   man
2   小敏   20   woman


直接换（1）

student={'name':['小王','晓东','小敏'],
        'sage':[22,21,20],
        'ssex':['man','man','woman']}
a =  pd.DataFrame(student)
a.rename(columns = {'name':"姓名",'sage':"年龄",'ssex':"性别"},inplace = True)
a

    姓名  年龄   性别
0   小王   22   man
1   晓东   21   man
2   小敏   20   woman

直接换（2）

raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
            "evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
            "type": ['grass', 'fire', 'water', 'bug'],
            "hp": [45, 39, 44, 45],
            "pokedex": ['yes', 'no','yes','no']                        
            }
a = pd.DataFrame(raw_data)
a = a[['name','type','hp','evolution','pokedex']]
a

v       name        type    hp    evolution     pokedex
0     Bulbasaur     grass   45     Ivysaur        yes
1     Charmander    fire    39    Charmeleon       no
2     Squirtle      water   44    Wartortle        yes
3     Caterpie       bug    45     Metapod         no

（10）更改行的标签

reset_index()

其中的主要用法为一下内容：

drop: 重新设置索引后是否将原索引作为新的一列并入DataFrame，默认为False。

inplace: 是否在原DataFrame上改动，默认为False。True表示在原DataFrame上修改，False将修改后的DataFrame作为新的对象返回 level: 如果索引有多个列，仅从索引中删除level指定的列，默认删除所有列 col_level: 如果列名有多个级别，决定被删除的索引将插入哪个级别，默认插入第一级 col_fill: 如果列名有多个级别，决定其他级别如何命名

a = pd.Series(np.random.randint(1,5,100))
b = pd.Series(np.random.randint(1,4,100))
c = pd.Series(np.random.randint(1000,3001,100))
e = pd.concat([a,b,c],axis = 0)
e.reset_index(drop = True,inplace = True)
e

0         3
1         2
2         2
3         3
4         2
       ... 
295    2715
296    2659
297    2979
298    2527
299    2469
Length: 300, dtype: int32

利用set_index()，

将列中的某一列设置为行标签

student={'name':['小王','晓东','小敏'],
        'sage':[22,21,20],
        'ssex':['man','man','woman'],
        'zimu':['a','b','c']}
a = pd.DataFrame(student)
a.set_index('zimu',inplace = True)
a


          name   sage   ssex
zimu            
a         小王    22     man
b         晓东    21     man
c         小敏    20     woman

（11）query()方法

此方法支持把字符串形式的查询表达式直接传入使用

4.合并操作

（1）使用concat，（合并时可以多个合并）

# 按列合并

a1 = pd.DataFrame([[1,2],[5,6]], #设置值
    index=list(['第一行','第二行']),    #设行索引
    columns=list(['第一列','第二列']))  #列索引

a2 = pd.DataFrame([[9,10],[13,14]], #设置值
    index=list(['第一行','第二行']),    #设行索引
    columns=list(['第三列','第四列']))  #列索引   在这里也可以将第三列和第四列该为第一列和第二列，一样的也是加在后面
s = pd.concat([a1, a2], axis=1) 
print(s)


第一列  第二列  第三列  第四列
第一行    1    2    9   10
第二行    5    6   13   14

# 按行合并
a1 = pd.DataFrame([[1,2],[5,6]], #设置值
    index=list(['第一行','第二行']),    #设行索引
    columns=list(['第一列','第二列']))  #列索引

a2 = pd.DataFrame([[9,10],[13,14]], #设置值
    index=list(['第一行','第二行']),    #设行索引
    columns=list(['第一列','第二列']))  #列索引  如果将这里的第一列和第二列改为第三列和第四列，则结果完全不同
s = pd.concat([a1, a2], axis=1) 
print(s)

    第一列  第二列
第一行    1    2
第二行    5    6
第一行    9   10
第二行   13   14

 第一列  第二列   第三列   第四列
第一行  1.0  2.0   NaN   NaN
第二行  5.0  6.0   NaN   NaN
第一行  NaN  NaN   9.0  10.0
第二行  NaN  NaN  13.0  14.0

（2）通过merge合并

a1 = DataFrame([[1,2],[5,6]], #设置值
    index=list(['第一行','第二行']),    #设行索引
    columns=list(['第一列','第二列']))  #列索引

a2 = DataFrame([[2,5],[3,8]], #设置值
    index=list(['第一行','第二行']),    #设行索引
    columns=list(['第三列','第四列']))  #列索引

#(1)默认的为交集inner

a1.merge(a2,left_on="第二列",right_on='第三列',how= 'inner')
#等价于pd.merge(a1,a2,left_on="第二列",right_on='第三列',how= 'inner')

(2,6) & (2,3) = 2

输出：
   第一列  第二列  第三列  第四列
0     1       2       2       5

#（2）并集outer，NaN补全
a1.merge(a2,left_on="第二列",right_on='第三列',how= 'outer')

(2,6) | (2,3) = (2,3,6)

输出：
 第一列  第二列  第三列  第四列
0   1.0    2.0     2.0     5.0
1   5.0    6.0     NaN     NaN
2   NaN    NaN     3.0     8.0

# (3)以左边为准left，NaN补全（左边全部输出，如果右边有交集则输出，否则就输出为空值）
a1.merge(a2,left_on="第二列",right_on='第三列',how= 'left')

   第一列  第二列  第三列  第四列
0    1        2      2.0     5.0
1    5        6      NaN     NaN

# (4)以右边为准right，NaN补全（右边全部输出，如果左边有交集同样会输出整行，否则输出也为空值）
a1.merge(a2,left_on="第二列",right_on='第三列',how= 'right')

   第一列  第二列  第三列  第四列
0    1.0     2.0      2       5
1    NaN     NaN      3       8

#（5）如果两个列表中出现了重复的列名，那么可以通过suffixes参数指定。
f1 = pd.DataFrame({'Name':['San Zhang'],'Grade':[70]})
f2 = pd.DataFrame({'Name':['San Zhang'],'Grade':[80]})
df1.merge(df2, on='Name', how='left', suffixes=['_Chinese','_Math'])

         Name         Grade_Chinese    Grade_Math
0      San Zhang           70              80

（3）当想把一个序列加到一个表的行末或者列末则可以分别使用append()和assign()方法

append()

f1 = pd.DataFrame({'Name':['San Zhang','Si Li'], 'Age':[20,21]})
s = pd.Series(['Wu Wang', 21], index = f1.columns)
f1.append(s, ignore_index=True)


        Name         Age
0     San Zhang       20
1       Si Li         21
2      Wu Wang        21

assign()

s = pd.Series([80, 90])
f1.assign(Grade=s)


      Name       Age    Grade
0   San Zhang    20      80
1    Si Li       21      90

(4）比较不同，输出不同的compare()

a1 = pd.DataFrame({'Name':['San Zhang', 'Si Li', 'Wu Wang'],
                    'Age':[20, 21 ,21],
                    'Class':['one', 'two', 'three']})
a2 = pd.DataFrame({'Name':['San Zhang', 'Li Si', 'Wu Wang'],
                    'Age':[20, 21 ,21],
                    'Class':['one', 'two', 'Three']})
a1.compare(a2)

        Name            Class
    self    other   self    other
1   Si Li   Li Si    NaN      NaN
2   NaN      NaN    three   Three

如果想要完整的显示表中所有元素的比较情况，可以设置keep_shape = True

a1.compare(a2,keep_shape = True)


         Name                Age                Class
    self    other       self    other       self    other
0   NaN      NaN        NaN      NaN         NaN    NaN
1   Si Li   Li Si       NaN      NaN         NaN    NaN
2   NaN      NaN        NaN      NaN       three    Three

（5）组合

5.统计操作

（1）常用统计函数

方法	说明
quantile()	分位数（返回其对应的索引）
.sum()	计算数据的总和，按0轴计算，下同
.count()	非NaN(缺失值)的数量（返回其对应的索引）
.mean() .median()	计算数据的算术平均值、算术中位数
.var() .std()	计算数据的方差、标准差
.min() .max()	计算数据的最小值、最大值
.describe()	输出所有列的统计信息。
.describe(include='all')	汇总所有列
info()	检查缺失值情况
isnull()	查找空值
Percentiles	百分数

统计出现的次数

# 统计动物出现的次数
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df =pd.DataFrame(data,index = labels)
df['animal'].value_counts()

dog      4
cat      4
snake    2
Name: animal, dtype: int64

（2）相关性统计分析

• 协方差>0,X和Y正相关

• 协方差<0,X和Y负相关

• 协方差=0,X和Y独立无关

方法	说明
cov()	计算协方差矩阵

df = df({  "A":[5, 3, 6, 4],  
         "B":[11, 2, 4, 3], 
         "C":[4, 3, 8, 5], 
         "D":[5, 4, 2, 8]})
df.cov()

 A          B         C          D
A  1.666667   2.333333    2.333333    -1.500000
B  2.333333   16.666667   -1.000000   0.000000
C  2.333333   -1.000000   4.666667    -2.333333
D -1.500000    0.000000   -2.333333    6.250000

6.apply方法

apply方法就是将匿名函数应用到由列或行形成的一维数组上。可以快速的处理数据，很方便。

import pandas as pd

a=pd.DataFrame(np.random.randint(0,5,(5,5)),columns=list('abcde'))
# 求每列的最大值与最小值的差
x = a.apply(lambda x:x.max()-x.min())
# 求每行的最大值与最小值的差
y = a.apply(lambda x:x.max()-x.min(), axis=1)
print(x,y)

a    4
b    3
c    4
d    4
e    4
dtype: int64 0    1
1    4
2    3
3    2
4    4
dtype: int64

（1）创建函数将字符串的第一个字母大写,以下为函数，直接用apply()直接调用

capitalizer = lambda x: x.capitalize()
#apply(capitalizer )

（2）applymap

applymap()会对DataFrame中的每一个单元格进行指定的函数操作,用法很简单，以下为具体用法：

def c(x):
    if type(x) is int:
        return x*10
    else:
        return x
        
a.applymap(c).head(10)

7.分组（利用groupby())

一般的方式为：df.groupby(分组依据)[数据来源].使用操作

GroupBy技术是对于数据进行分组计算并将各组计算结果合并的一项技术，它一般分为三个过程：

（1）拆分：将数据进行相对应的分组

（2）应用：通过函数的调用将每一个数据进行处理

（3）合并：将计算的结果进行数据聚合

一般用法：

访问data1的数据并按照key1进行分组

 df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                    'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
a = df['data1'].groupby(df['key1'])
a.mean()
#以上等价于下面两个
#df['data1'].groupby(df['key1']).mean()
#df.groupby('key1')['data1'].mean()

key1
a   -0.608341
b   -0.287105
Name: data1, dtype: float64


student = pd.DataFrame(  [['A', 'male',95,79],
                ['A', 'female',96,90],
                ['B', 'female',85,85],
                ['C', 'male',93,92],
                ['B', 'female',84,90],
                ['B', 'male',88,70],
                ['C', 'male',59,89],
                ['A', 'male', 89,86],
                ['B', 'male',89,74]],    
    columns=list(['班级','性别','数学','语文']))  #列索引
a = student
a.groupby(['班级','性别']).mean()


            数学  语文

班级  性别      
A   female  96.0    90.0
    male    92.0    82.5
B   female  84.5    87.5
    male    88.5    72.0
C   male    76.0    90.5

8.函数用法

（1）agg为聚合函数，一般在数据分组之后对数据进行多重处理，比如：求平均值后求最大最小值

a = pd.DataFrame({'Country': ['China', 'China', 'India', 'India', 'America', 'Japan', 'China', 'India'],

                   'Income': [10000, 10000, 5000, 5002, 40000, 50000, 8000, 5000],

                   'Age': [5000, 4321, 1234, 4010, 250, 250, 4500, 4321]})

a.groupby('Country').Age.agg(['mean','max','min'])


  
            mean             max    min
Country         
America  250.000000      250    250
China    4607.000000     5000   4321
India    3188.333333     4321   1234
Japan    250.000000      250    250

（2）transform方法

9.数据排序

（1）sort_index()：对行和列进行排序（都是对标签进行排序）

a = pd.DataFrame([[1,2,3,4],
                  [5,1,1,1],
                  [2,3,7,8],
                  [7,6,8,5]],
                index = ['b','d','a','c'],
                columns = ['x','z','y','m'])
#对行标签进行排序
a.sort_index()
#对列标签进行排序
a.sort_index(axis = 1)
#对列标签进行排序，排序方式为逆序
a.sort_index(axis=1,ascending=False)


    x   z   y   m
a   2   3   7   8
b   1   2   3   4
c   7   6   8   5
d   5   1   1   1

    m   x   y   z
b   4   1   3   2
d   1   5   1   1
a   8   2   7   3
c   5   7   8   6

    z   y   x   m
b   2   3   1   4
d   1   1   5   1
a   3   7   2   8
c   6   8   7   5

（2）sort_values()，只能由列进行排序

a = pd.DataFrame([[1,2,3,4],
                  [5,1,1,1],
                  [2,3,7,8],
                  [7,6,8,5]],
                index = ['b','d','a','c'],
                columns = ['x','z','y','m'])
# 以下两种没什么区别
a.sort_values(by='z')
a.sort_vlaues(by=['z','x'])

    x   z   y   m
d   5   1   1   1
b   1   2   3   4
a   2   3   7   8
c   7   6   8   5

    x   z   y   m
d   5   1   1   1
b   1   2   3   4
a   2   3   7   8
c   7   6   8   5

（3）两者都用ascending来对排序的方式进行确定

# 升序
ascending = True
# 降序
ascending = False

10.字符串操作

例：

a = pd.DataFrame(     [['101', '东','16:30'],
                ['102', '南','16:30'],
                ['103', '南','16:30'],
                ['104', '北','16:30'],
                ['105', '东','16:30'],
                ['106', '西','16:30'],
                ['107', '西','16:30'],
                ['108', '南', '16:30'],
                ['109', '东','16:30']],    
                columns=list(['房号','房屋朝向','时间']))
# 选择房屋朝向的东和南方向
str = ['东','南']
a1 = a[a['房屋朝向'].isin(str)]
a1

    房号  房屋朝向  时间
0   101    东    16:30
1   102    南    16:30
2   103    南    16:30
4   105    东    16:30
7   108    南    16:30
8   109    东    16:30
    
    
#将房屋朝向的四个方向用1，2，3，4更换
fx = {"东":1,"西":2,"南":3,"北":4}
a['房屋朝向'] = a.房屋朝向.map(fx)
a


    房号   房屋朝向   时间
0   101     1      16:30
1   102     3      16:30
2   103     3      16:30
3   104     4      16:30
4   105     1      16:30
5   106     2      16:30
6   107     2      16:30
7   108     3      16:30
8   109     1     16:30

时间字符串转化时间格式：

(1)to_datetime将字符串格式转换为日期格式

a['时间'] =pd.to_datetime(a['时间'],format='%H:%M') 


    房号  房屋朝向    时间
0   101     1   1900-01-01 16:30:00
1   102     3   1900-01-01 16:30:00
2   103     3   1900-01-01 16:30:00
3   104     4   1900-01-01 16:30:00
4   105     1   1900-01-01 16:30:00
5   106     2   1900-01-01 16:30:00
6   107     2   1900-01-01 16:30:00
7   108     3   1900-01-01 16:30:00
8   109     1   1900-01-01 16:30:00

(2) strptime将字符串格式转换为日期格式

from datetime import datetime

a['时间']= a['时间'].apply(lambda x:datetime.strptime(x,'%H:%M')) 

    房号  房屋朝向    时间
0   101     1   1900-01-01 16:30:00
1   102     3   1900-01-01 16:30:00
2   103     3   1900-01-01 16:30:00
3   104     4   1900-01-01 16:30:00
4   105     1   1900-01-01 16:30:00
5   106     2   1900-01-01 16:30:00
6   107     2   1900-01-01 16:30:00
7   108     3   1900-01-01 16:30:00
8   109     1   1900-01-01 16:30:00

(3)strftime将日期格式转换为字符串格式（转回）

from datetime import datetime

a['时间']= a['时间'].apply(lambda x:datetime.strftime(x,'%H-%M'))

    房号  房屋朝向    时间
0   101     1      16-30
1   102     3      16-30
2   103     3      16-30
3   104     4      16-30
4   105     1      16-30
5   106     2      16-30
6   107     2      16-30
7   108     3      16-30
8   109     1      16-30

11.文件操作

（1）文件读取

pd.read_csv(filepath, sep=',', delimiter=None, header='infer', names=None, index_col=None, prefix=None, nrows=None, encoding=None, skiprows=0)

常见参数如下：

(1)filepath：文件所在处的路径

(2)sep：指定分隔符，默认为逗号’,’

(2)delimiter : str, default None定界符，备选分隔符（如果指定该参数，则sep参数失效）

(3)header：指定哪一行作为表头。默认设置为0（即第一行作为表头），如果没有表头的话，要修改参数，设置header=None

(4)names：指定列的名称，用列表表示。一般我们没有表头，即header=None时，这个用来添加列名就很有用啦。

(5)index_col:指定哪一列数据作为行索引，可以是一列，也可以多列。多列的话，会看到一个分层索引

(6)prefix:给列名添加前缀。如prefix="x",会出来"x1"、"x2"、"x3"

(7)nrows : 需要读取的行数（从文件头开始算起）

(8)encoding:乱码的时候用这个就是了

(9)skiprows :忽略的行数（从文件开始处算起），或需要跳过的行号列表（从0开始）。

（2）保存文件

to_csv(path,sep,na_rep,columns,header,index)

参数解析：

(1)path：字符串，放文件名、相对路径、文件流等。

(2)sep：字符串，分隔符，跟read_csv()的一个意思。

(3)na_rep：字符串，将NaN转换为特定值。

(4)columns：列表，指定哪些列写进去。

(5)header：默认header=0，表示有表头；header= None，表示没有表头。

(6)index：默认True，设置表格有索引；False，设置表格没有索引。

例如：

import numpy as np

df = pd.DataFrame({"a":[1,2,3],

"b":[6,np.nan,6],

"c":[3,4,np.nan]})

df.to_csv(path)

df.to_csv(path,header=None)

df.to_csv(path, columns=["a","c"],index=False)

df.to_csv(path, na_rep=0)

（3）按格式输出

DataFrame.to_html("test.html")

12.行列反转

stack()即“堆叠”，作用是将列旋转到行 unstack()即stack()的反操作，将行旋转到列

具体用法：

a= pd.DataFrame(np.arange(6).reshape((2, 3)),
                    index=pd.Index(['Ohio', 'Colorado']),
                    columns=pd.Index(['one', 'two', 'three']))
print(a)
print()
print()
print(a.stack())
print()
print()
print(a.unstack())



 one  two  three
Ohio        0    1      2
Colorado    3    4      5


Ohio      one      0
          two      1
          three    2
Colorado  one      3
          two      4
          three    5
dtype: int32


one    Ohio        0
       Colorado    3
two    Ohio        1
       Colorado    4
three  Ohio        2
       Colorado    5
dtype: int32

直接行列互换.T（相当于矩阵的转置）

df = pd.DataFrame(data = {'col_0': [1,2,3],
                          'col_1':list('abc'),
                          'col_2': [1.2, 2.2, 3.2]},
                  index = ['row_%d'%i for i in range(3)])
df

df.T


       col_0    col_1   col_2
row_0    1        a      1.2
row_1    2        b      2.2
row_2    3        c      3.2

       row_0    row_1   row_2
col_0      1       2    3
col_1      a       b    c
col_2     1.2     2.2   3.2

13.长表变宽表(pivot)

df = pd.DataFrame({'Class':[1,1,2,2],
                   'Name':['San Zhang','San Zhang','Si Li','Si Li'],
                   'Subject':['Chinese','Math','Chinese','Math'],
                   'Grade':[80,75,90,85]})
df
df.pivot(index = 'Name',columns = 'Subject',values = 'Grade')


   Class    Name       Subject   Grade
0   1      San Zhang    Chinese   80
1   1      San Zhang    Math      75
2   2      Si Li       Chinese    90
3   2      Si Li        Math      85

Subject   Chinese   Math
Name        
San Zhang   80        75
Si Li       90        85

14.缺失值

（1）缺失信息的统计isna()和isnull()，查看表格中是否有缺失值

这两个方法的使用后输出的结果都是一样的

raw_data = {"name": ['Bulbasaur', np.nan,'Squirtle','Caterpie'],
            "evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
            "type": ['grass', 'fire', 'water', 'bug'],
            "hp": [45, np.nan, 44, 45],
            "pokedex": ['yes', 'no',np.nan,'no']                        
            }
a = pd.DataFrame(raw_data)
a.isna()
a.isnull()

    name    evolution   type     hp     pokedex
0   False   False      False    False    False
1   True    False      False    True     False
2   False   False      False    False    True
3   False   False      False    False    False

（2）两个方法检索all(1),any(1),与isna()和notna()组合

举一例

raw_data = {"name": ['Bulbasaur', np.nan,'Squirtle','Caterpie'],
            "evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
            "type": ['grass', 'fire', 'water', 'bug'],
            "hp": [np.nan, np.nan, np.nan, 15],
            "pokedex": ['yes', 'no',np.nan,'no']                        
            }
a = pd.DataFrame(raw_data)
a[a.notna().all(1)]

    name       evolution    type      hp    pokedex
3   Caterpie    Metapod      bug     15.0     no

以上是对于我做数据分析时所遇到的一些pandas操作，后续会继续往里面添加。