示例文件:
off_date | off_time | card_id | card_type | device_num | off_station | on_date | on_time | on_station |
---|---|---|---|---|---|---|---|---|
2016-04-22 | 08:25:36 | 0000990771990514 | 102 | 22023703 | 0000037 | 201604 | 22075032 | 0000025 |
2016-04-22 | 12:32:43 | 0000990772079197 | 101 | 22011011 | 0000010 | 201604 | 22122620 | 0000008 |
2016-04-22 | 21:11:31 | 0000990772083185 | 101 | 22011315 | 0000013 | 201604 | 22210546 | 0000012 |
2016-04-22 | 18:05:24 | 0000990772083509 | 101 | 22023701 | 0000037 | 201604 | 22175633 | 0000035 |
2016-04-22 | 17:03:04 | 0000990772083471 | 101 | 22040210 | 0000002 | 201604 | 22163715 | 0000044 |
2016-04-22 | 08:07:11 | 0000997169251538 | 007 | 22035506 | 0000055 | 201604 | 22071314 | 0000011 |
2016-04-22 | 10:46:41 | 0000997169251538 | 007 | 22011216 | 0000012 | 201604 | 22094757 | 0000055 |
2016-04-22 | 18:04:25 | 0000997169148602 | 007 | 22023802 | 0000038 | 201604 | 22172720 | 0000009 |
2016-04-22 | 12:38:31 | 0000997169148602 | 007 | 22010814 | 0000008 | 201604 | 22115952 | 0000037 |
2016-04-22 | 09:36:15 | 0000997169237105 | 007 | 22022121 | 0000021 | 201604 | 22091548 | 0000017 |
2016-04-22 | 08:11:36 | 0000997169396171 | 007 | 22079406 | 0000094 | 201604 | 22075540 | 0000090 |
2016-04-22 | 19:28:03 | 0000997169396171 | 007 | 22079006 | 0000090 | 201604 | 22190423 | 0000095 |
2016-04-22 | 08:34:06 | 0000990172167480 | 101 | 29070903 | 0000109 | 201604 | 22080705 | 0000100 |
示例代码:
(1)数据的乘除
代码:
div_1000 = subway_info["on_station"] / 1000 #将字段on_station的全部数据除以1000,仅做示例,无实际意义
print(div_1000)
输出:
0 0.025
1 0.008
2 0.012
3 0.035
4 0.044
5 0.011
6 0.055
7 0.009
8 0.037
9 0.017
10 0.090
11 0.095
12 0.100
Name: on_station, dtype: float64
(2)数据的最大值
代码:
max_on_station = subway_info["on_station"].max() #找到字段on_station的最大值
print(max_on_station)
输出:
100
(3)数据排序
代码:
subway_info.sort_values("on_station", inplace=True) #以字段on_station为基准排序,默认为升序。inplace=True表示在新生成的DataFrame进行操作,不改变源文件
print(subway_info["on_station"])
输出:
1 8
7 9
5 11
2 12
9 17
0 25
3 35
8 37
4 44
6 55
10 90
11 95
12 100
Name: on_station, dtype: int64
代码:
subway_info.sort_values("on_station", inplace=True, ascending=False) #若要改为降序排序,添加ascending=False
print(subway_info["on_station"])
输出:
12 100
11 95
10 90
6 55
4 44
8 37
3 35
0 25
9 17
2 12
5 11
7 9
1 8
Name: on_station, dtype: int64
(4)数据缺失的处理
现将编号为4和的数据信息的on_station值删除掉,示例文件变为:
off_date | off_time | card_id | card_type | device_num | off_station | on_date | on_time | on_station |
---|---|---|---|---|---|---|---|---|
2016-04-22 | 08:25:36 | 0000990771990514 | 102 | 22023703 | 0000037 | 201604 | 22075032 | 0000025 |
2016-04-22 | 12:32:43 | 0000990772079197 | 101 | 22011011 | 0000010 | 201604 | 22122620 | 0000008 |
2016-04-22 | 21:11:31 | 0000990772083185 | 101 | 22011315 | 0000013 | 201604 | 22210546 | 0000012 |
2016-04-22 | 18:05:24 | 0000990772083509 | 101 | 22023701 | 0000037 | 201604 | 22175633 | 0000035 |
2016-04-22 | 17:03:04 | 0000990772083471 | 101 | 22040210 | 0000002 | 201604 | 22163715 | |
2016-04-22 | 08:07:11 | 0000997169251538 | 007 | 22035506 | 0000055 | 201604 | 22071314 | 0000011 |
2016-04-22 | 10:46:41 | 0000997169251538 | 007 | 22011216 | 0000012 | 201604 | 22094757 | 0000055 |
2016-04-22 | 18:04:25 | 0000997169148602 | 007 | 22023802 | 0000038 | 201604 | 22172720 | |
2016-04-22 | 12:38:31 | 0000997169148602 | 007 | 22010814 | 0000008 | 201604 | 22115952 | 0000037 |
2016-04-22 | 09:36:15 | 0000997169237105 | 007 | 22022121 | 0000021 | 201604 | 22091548 | 0000017 |
2016-04-22 | 08:11:36 | 0000997169396171 | 007 | 22079406 | 0000094 | 201604 | 22075540 | 0000090 |
2016-04-22 | 19:28:03 | 0000997169396171 | 007 | 22079006 | 0000090 | 201604 | 22190423 | 0000095 |
2016-04-22 | 08:34:06 | 0000990172167480 | 101 | 29070903 | 0000109 | 201604 | 22080705 | 0000100 |
代码:
station = subway_info["on_station"] #将字段on_station单独拎出
station_is_null = pd.isnull(station) #用pd.isnull()函数判断on_station值是否缺失
print(station_is_null)
输出:
0 False
1 False
2 False
3 False
4 True #如果on_station值缺失,则station_is_null值为True
5 False
6 False
7 True
8 False
9 False
10 False
11 False
12 False
Name: on_station, dtype: bool #新字段station_is_null的值为bool型
代码:
station = subway_info["on_station"]
station_is_null = pd.isnull(station)
station_null_true = station[station_is_null] #将station_is_null反代回station,可查询哪些编号的数据缺失
print(station_null_true)
station_null_count = len(station_null_true) #统计on_station缺失值的个数
print(station_null_count)
输出:
4 NaN
7 NaN #表示编号4、7的数据信息中on_station值缺失
Name: on_station, dtype: float64
2 #表示on_station共缺失2个