Pandas—DataFrame高阶函数（一）

最新推荐文章于 2023-08-06 08:57:38 发布

蓬莱道人

最新推荐文章于 2023-08-06 08:57:38 发布

阅读量1.3k

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/MOU_IT/article/details/78758755

版权

上一篇：Pandas—DataFrame的读取、保存、增、删、查、改

1、DataFrame排序

2、DataFrame的元素计算

3、DataFrame对象合并

4、DataFrame设置索引和还原

5、去除特定列下面的重复行

6、分类变量（categorical variable）转换为“哑变量矩阵”（dummy matrix）或“指标矩阵”（indicator matrix）

1、DataFrame排序

（1）按单列、按多列排序

df
   col1  col2  col3
b     4     6     5
a     1     3     2
c     7     9     8

# 按第一列升序
df.sort_values(by = 'col1',axis = 0,ascending = True,inplace = True)
print (df)
   col1  col2  col3
a     1     3     2
b     4     6     5
c     7     9     8

# 按第一列降序
df.sort_values(by = 'col1',axis = 0,ascending = False,inplace = True)
print (df)
   col1  col2  col3
c     7     9     8
b     4     6     5
a     1     3     2

df
   a  b  c
0  9  4  6
1  2  7  5
2  5 -3  8
3  1  2  3
4  0  2  4
5  7  2  4

# 按照b,c,a的列顺序升序排序
df.sort_values(by = ['b','c','a'],axis = 0,ascending = True,inplace = True)
print (df)
   a  b  c
2  5 -3  8
3  1  2  3
4  0  2  4
5  7  2  4
0  9  4  6
1  2  7  5

注：inplace指定为True时，表示会直接对df中的数据进行排序，函数返回值为None。如果不设置为True（默认为false），则不会对df中数据进行修改，会返回一个新的排序后的df。

（2）按行排序（即按列标题名排序）

df
   col2  col1  col3
b     4     6     5
a     1     3     2
c     7     9     8

# 按行升序(按列标题名来排序)
df.sort_index(axis = 1,ascending = True, inplace =True)
print (df)
   col1  col2  col3
b     6     4     5
a     3     1     2
c     9     7     8

# 按行降序
df.sort_index(axis = 1,ascending = False, inplace =True)
print (df)
   col3  col2  col1
b     5     4     6
a     2     1     3
c     8     7     9

2、DataFrame的元素计算

df2.sum()    # 默认是对每列元素求和
df2.sum(1)   # 行求和
df2.apply(lambda x:x*2)   # 对每个元素乘以２
df2**2    # 对每个元素求平方

df
   col2  col1  col3
b     4     6     5
a     1     3     2
c     7     9     8

# 默认是对每列元素求和
print (df.sum()) 
col2    12
col1    18
col3    15 

# 每行元素求和  
print (df.sum(1))
b    15
a     6
c    24
dtype: int64   

# 对每个元素乘以2
print (df2.apply(lambda x:x*2))
   col2  col1  col3
b     8    12    10
a     2     6     4
c    14    18    16   

# 对每个元素求平方
print (df2**2)    
   col2  col1  col3
b    16    36    25
a     1     9     4
c    49    81    64

3、DataFrame对象合并

（1）DataFrame.join() ：基于索引（索引作为主键）或关键列（该列作为主键）与其它DataFrame的列合并到一起。

函数原型：
  DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
参数：
  other : DataFrame, 带名字的Series或者DataFrame的列表。所有的DataFrame的索引应该相同，如果传入的类型为Series，那么它的名字属性应该被指定。
  on : 列名称，或者列名称的list/tuple，或者类似形状的数组。连接的列，默认使用索引连接。
  how : 指的是合并(连接)的方式有inner(内连接),left(左外连接),right(右外连接),outer(全外连接);默认为left。
  lsuffix : string，当左右两个表的列名有重复时，用于左边表列名的前缀。
  rsuffix : string，当左右两个表的列名有重复时，用于右边表列名的前缀。
  sort : boolean, 默认为False，根据join的键值对结果进行排序，设置为False可以提高性能