pandas库（2）：数据预处理_panda(2)数据预处理去掉数据中的面积单位-CSDN博客

本文链接：https://blog.csdn.net/weixin_43824414/article/details/105372148

示例文件：

off_date	off_time	card_id	card_type	device_num	off_station	on_date	on_time	on_station
2016-04-22	08:25:36	0000990771990514	102	22023703	0000037	201604	22075032	0000025
2016-04-22	12:32:43	0000990772079197	101	22011011	0000010	201604	22122620	0000008
2016-04-22	21:11:31	0000990772083185	101	22011315	0000013	201604	22210546	0000012
2016-04-22	18:05:24	0000990772083509	101	22023701	0000037	201604	22175633	0000035
2016-04-22	17:03:04	0000990772083471	101	22040210	0000002	201604	22163715	0000044
2016-04-22	08:07:11	0000997169251538	007	22035506	0000055	201604	22071314	0000011
2016-04-22	10:46:41	0000997169251538	007	22011216	0000012	201604	22094757	0000055
2016-04-22	18:04:25	0000997169148602	007	22023802	0000038	201604	22172720	0000009
2016-04-22	12:38:31	0000997169148602	007	22010814	0000008	201604	22115952	0000037
2016-04-22	09:36:15	0000997169237105	007	22022121	0000021	201604	22091548	0000017
2016-04-22	08:11:36	0000997169396171	007	22079406	0000094	201604	22075540	0000090
2016-04-22	19:28:03	0000997169396171	007	22079006	0000090	201604	22190423	0000095
2016-04-22	08:34:06	0000990172167480	101	29070903	0000109	201604	22080705	0000100

示例代码：

（1）数据的乘除

代码：

div_1000 = subway_info["on_station"] / 1000   #将字段on_station的全部数据除以1000，仅做示例，无实际意义
print(div_1000)

输出：

 0     0.025
 1     0.008
 2     0.012
 3     0.035
 4     0.044
 5     0.011
 6     0.055
 7     0.009
 8     0.037
 9     0.017
 10    0.090
 11    0.095
 12    0.100
 Name: on_station, dtype: float64

（2）数据的最大值

代码：

max_on_station = subway_info["on_station"].max()   #找到字段on_station的最大值
print(max_on_station)

输出：
100

（3）数据排序

代码：

subway_info.sort_values("on_station", inplace=True)   #以字段on_station为基准排序，默认为升序。inplace=True表示在新生成的DataFrame进行操作，不改变源文件
print(subway_info["on_station"])

输出：

 1       8
 7       9
 5      11
 2      12
 9      17
 0      25
 3      35
 8      37
 4      44
 6      55
 10     90
 11     95
 12    100
 Name: on_station, dtype: int64

代码：

subway_info.sort_values("on_station", inplace=True, ascending=False)   #若要改为降序排序，添加ascending=False
print(subway_info["on_station"])

输出：

 12    100
 11     95
 10     90
 6      55
 4      44
 8      37
 3      35
 0      25
 9      17
 2      12
 5      11
 7       9
 1       8
 Name: on_station, dtype: int64

（4）数据缺失的处理

现将编号为4和的数据信息的on_station值删除掉，示例文件变为：

off_date	off_time	card_id	card_type	device_num	off_station	on_date	on_time	on_station
2016-04-22	08:25:36	0000990771990514	102	22023703	0000037	201604	22075032	0000025
2016-04-22	12:32:43	0000990772079197	101	22011011	0000010	201604	22122620	0000008
2016-04-22	21:11:31	0000990772083185	101	22011315	0000013	201604	22210546	0000012
2016-04-22	18:05:24	0000990772083509	101	22023701	0000037	201604	22175633	0000035
2016-04-22	17:03:04	0000990772083471	101	22040210	0000002	201604	22163715
2016-04-22	08:07:11	0000997169251538	007	22035506	0000055	201604	22071314	0000011
2016-04-22	10:46:41	0000997169251538	007	22011216	0000012	201604	22094757	0000055
2016-04-22	18:04:25	0000997169148602	007	22023802	0000038	201604	22172720
2016-04-22	12:38:31	0000997169148602	007	22010814	0000008	201604	22115952	0000037
2016-04-22	09:36:15	0000997169237105	007	22022121	0000021	201604	22091548	0000017
2016-04-22	08:11:36	0000997169396171	007	22079406	0000094	201604	22075540	0000090
2016-04-22	19:28:03	0000997169396171	007	22079006	0000090	201604	22190423	0000095
2016-04-22	08:34:06	0000990172167480	101	29070903	0000109	201604	22080705	0000100

代码：

station = subway_info["on_station"]    #将字段on_station单独拎出
station_is_null = pd.isnull(station)   #用pd.isnull()函数判断on_station值是否缺失
print(station_is_null)

输出：

 0     False
 1     False
 2     False
 3     False
 4      True                           #如果on_station值缺失，则station_is_null值为True
 5     False
 6     False
 7      True 
 8     False
 9     False
 10    False
 11    False
 12    False
 Name: on_station, dtype: bool         #新字段station_is_null的值为bool型

代码：

station = subway_info["on_station"]
station_is_null = pd.isnull(station)
station_null_true = station[station_is_null]   #将station_is_null反代回station，可查询哪些编号的数据缺失
print(station_null_true)
station_null_count = len(station_null_true)    #统计on_station缺失值的个数
print(station_null_count)

输出：

 4   NaN
 7   NaN                               #表示编号4、7的数据信息中on_station值缺失
 Name: on_station, dtype: float64
 2                                     #表示on_station共缺失2个