【数据科学】03 pandas库常用操作

春杪无蜩

已于 2022-04-14 16:52:16 修改

阅读量1.1k

点赞数

分类专栏：数据科学文章标签： python 数据分析

于 2022-04-14 15:15:37 首次发布

本文链接：https://blog.csdn.net/weixin_47575631/article/details/124170105

版权

数据科学专栏收录该内容

6 篇文章 0 订阅

订阅专栏

文章目录

1. 算术运算与数据对齐
2. 函数应用与映射
3. 重建与更换索引
4. 其他常用操作

1. 算术运算与数据对齐

pandas可以对具有不同索引的两个对象进行算术运算，运算时自动按照索引进行数据对齐，结果的索引是两个对象索引的并集。

1.1 Series/DataFrame的算术运算

（1）以加法运算为例，可以直接相加。

Series的加法运算
相同索引上的元素直接相加，索引不同的元素默认为NA值。

# 【例1】
import pandas as pd
mark1 = pd.Series([90,88,78,95,80],index=['1850001','1850005','1850003','1850007','1850002'])
mark2 = pd.Series([5,7,9],index=['1850001','1850005','1850010'])
mark3 = mark1+mark2

>>> mark3
	1850001    95.0
	1850002     NaN
	1850003     NaN
	1850005    95.0
	1850007     NaN
	1850010     NaN
	dtype: float64

DataFrame的加法运算
相同索引及相同列上的元素直接相加，索引或列不同的元素默认为NA值。

# 【例2】
import pandas as pd
mark = pd.DataFrame({'Math':[90,85,88],'English':[85,92,87],'Computer':[90,80,82]},index=['1850001','1850003','1850005'])
bonus = pd.DataFrame({'Math':[5,5,2],'Physics':[4,3,2]},index=['1850001','1850002','1850003'])
gross_mark = mark+bonus

>>> mark
	         Math  English  Computer
	1850001    90       85        90
	1850003    85       92        80
	1850005    88       87        82
>> bonus
	         Math  Physics
	1850001     5        4
	1850002     5        3
	1850003     2        2
>>> gross_mark
	         Computer  English  Math  Physics
	1850001       NaN      NaN  95.0      NaN
	1850002       NaN      NaN   NaN      NaN
	1850003       NaN      NaN  87.0      NaN
	1850005       NaN      NaN   NaN      NaN

（2）使用对象的方法进行算术运算

使用对象的add、sub、mul、div方法进行算术运算
fill_value参数，可以在算术运算时指定用于填充的值

接例2

>>> mark.add(bonus,fill_value=0)
	         Computer  English  Math  Physics
	1850001      90.0     85.0  95.0      4.0
	1850002       NaN      NaN   5.0      3.0
	1850003      80.0     92.0  87.0      2.0
	1850005      82.0     87.0  88.0      NaN
# 数据对齐后，对于在mark和bonus两个DataFrame中均未出现的数据，使用NaN填充，其余值均使用0填充，再进行add运算。
# 即185002的English、Computer；185005的Physics使用NaN填充

1.2 Series/DataFrame与标量运算

Series/DataFrame与标量运算时，其中的每个元素都与标量进行对应的运算。

1.3 DataFrame与Series进行运算

采用广播的方式进行运算
通过axis参数指定广播方式为行上广播/列上广播
axis=1时，在行上广播；axis=0时，在列上广播；axis缺省为1。
注：在列上广播要使用算术运算方法。

接例2

bonus_row = pd.Series([5,4,3],index=['Math','English','Computer'])
bonus_col = pd.Series([5,4,3],index=['1850001','1850003','1850005'])
>>> mark # 原始数据
	         Math  English  Computer
	1850001    90       85        90
	1850003    85       92        80
	1850005    88       87        82
>>> mark+bonus_row # 行上广播
	         Math  English  Computer
	1850001    95       89        93
	1850003    90       96        83
	1850005    93       91        85
>>> mark.add(bonus_col,axis=0) # 列上广播
	         Math  English  Computer
	1850001    95       90        95
	1850003    89       96        84
	1850005    91       90        85

2. 函数应用与映射

Numpy的ufunc（元素级数组方法）可用于pandas对象，例如np.abs(df)。
DataFrame对象的apply方法将函数应用到各列或行形成的一维数组上。
axis为1时，作用于行；axis为0时，作用于列；axis缺省为0。

接例2

# DataFrame按列应用函数
function_max = lambda x:x.max() # 求每一门课程的最高分
>>> mark.apply(function_max)
	Math        90
	English     92
	Computer    90
	dtype: int64

DataFrame对象的applymap方法将其它函数套用到DataFrame对象的每个元素上。

接例2

# DataFrame每个元素应用函数
function_add = lambda x:x+5 #所有成绩提高5分
>>> mark.applymap(function_add)
	         Math  English  Computer
	1850001    95       90        95
	1850003    90       97        85
	1850005    93       92        87

3. 重建与更换索引

3.1 reindex()方法重新排序和指定索引

接例2

>>> mark
	         Math  English  Computer
	1850001    90       85        90
	1850003    85       92        80
	1850005    88       87        82

>>> mark.reindex(index=['1850005','1850003','1850001'],columns=['Computer','English','Math'])
	         Computer  English  Math
	1850005        82       87    88
	1850003        80       92    85
	1850001        90       85    90

>>> mark.reindex(index=['1850005','1850003','1850001','1850000'],columns=['Computer','English','Math'],fill_value=0) # 传入fill_value代替缺失值
	         Computer  English  Math
	1850005        82       87    88
	1850003        80       92    85
	1850001        90       85    90
	1850000         0        0     0

3.2 rename()方法实现重新标记索引

rename方法使用原索引和新索引组成的字典作为参数传入。

>>> mark.rename(columns={'Math':'M','English':'E','Computer':'C'}) # 重新标记列索引
	          M   E   C
	1850001  90  85  90
	1850003  85  92  80
	1850005  88  87  82
>>> mark.rename(index={'1850001':'01','1850003':'03','1850005':'05'}) # 重新标记行索引
	    Math  English  Computer
	01    90       85        90
	03    85       92        80
	05    88       87        82

3.3 set_index()方法重置数据列为索引

set_index(column_name,drop)

DataFrame对象的set_index方法可以重新设置索引(将数据列设为索引）
使用reset_index方法还原索引，重新变为默认的整型索引
参数drop默认为False，原来的索引会被当做数据列；设置drop=True，删除原来的索引。

接例2

>>> mark.set_index('Math')
	      English  Computer
	Math                   
	90         85        90
	85         92        80
	88         87        82

>>> mark.reset_index() # drop默认为False
	     index  Math  English  Computer
	0  1850001    90       85        90
	1  1850003    85       92        80
	2  1850005    88       87        82

>>> mark.reset_index(drop=True) 
	   Math  English  Computer
	0    90       85        90
	1    85       92        80
	2    88       87        82

4. 其他常用操作

4.1 查看与更改数据类型

查看Series对象的数据类型：Series.dtype
查看DataFrame对象所有列的数据类型：DataFrame.dtypes
转换列的类型：Series.astype()，参数为需要转换成的数据类型

4.2 唯一值

unique方法得到Series对象或DataFrame对象某列的唯一值数组
nunique方法用于统计不同值的个数

4.3 值计数

value_counts方法计算DataFrame对象某列中每个值出现的频率

4.4 成员资格

isin方法用来判断DataFrame对象列的成员资格，可用来选择列中的数据子集

接例2

>>> mark['Math'].isin([85,88])
	1850001    False
	1850003     True
	1850005     True
	Name: Math, dtype: bool

>>> mark.loc[mark['Math'].isin([85,88])]
	         Math  English  Computer
	1850003    85       92        80
	1850005    88       87        82