pandas库(2):数据预处理

示例文件:

off_dateoff_timecard_idcard_typedevice_numoff_stationon_dateon_timeon_station
2016-04-2208:25:360000990771990514102220237030000037201604220750320000025
2016-04-2212:32:430000990772079197101220110110000010201604221226200000008
2016-04-2221:11:310000990772083185101220113150000013201604222105460000012
2016-04-2218:05:240000990772083509101220237010000037201604221756330000035
2016-04-2217:03:040000990772083471101220402100000002201604221637150000044
2016-04-2208:07:110000997169251538007220355060000055201604220713140000011
2016-04-2210:46:410000997169251538007220112160000012201604220947570000055
2016-04-2218:04:250000997169148602007220238020000038201604221727200000009
2016-04-2212:38:310000997169148602007220108140000008201604221159520000037
2016-04-2209:36:150000997169237105007220221210000021201604220915480000017
2016-04-2208:11:360000997169396171007220794060000094201604220755400000090
2016-04-2219:28:030000997169396171007220790060000090201604221904230000095
2016-04-2208:34:060000990172167480101290709030000109201604220807050000100

示例代码:

(1)数据的乘除

代码:

div_1000 = subway_info["on_station"] / 1000   #将字段on_station的全部数据除以1000,仅做示例,无实际意义
print(div_1000)

输出:

 0     0.025
 1     0.008
 2     0.012
 3     0.035
 4     0.044
 5     0.011
 6     0.055
 7     0.009
 8     0.037
 9     0.017
 10    0.090
 11    0.095
 12    0.100
 Name: on_station, dtype: float64

(2)数据的最大值

代码:

max_on_station = subway_info["on_station"].max()   #找到字段on_station的最大值
print(max_on_station)

输出:
100

(3)数据排序

代码:

subway_info.sort_values("on_station", inplace=True)   #以字段on_station为基准排序,默认为升序。inplace=True表示在新生成的DataFrame进行操作,不改变源文件
print(subway_info["on_station"])

输出:

 1       8
 7       9
 5      11
 2      12
 9      17
 0      25
 3      35
 8      37
 4      44
 6      55
 10     90
 11     95
 12    100
 Name: on_station, dtype: int64

代码:

subway_info.sort_values("on_station", inplace=True, ascending=False)   #若要改为降序排序,添加ascending=False
print(subway_info["on_station"])

输出:

 12    100
 11     95
 10     90
 6      55
 4      44
 8      37
 3      35
 0      25
 9      17
 2      12
 5      11
 7       9
 1       8
 Name: on_station, dtype: int64

(4)数据缺失的处理

现将编号为4和的数据信息的on_station值删除掉,示例文件变为:

off_dateoff_timecard_idcard_typedevice_numoff_stationon_dateon_timeon_station
2016-04-2208:25:360000990771990514102220237030000037201604220750320000025
2016-04-2212:32:430000990772079197101220110110000010201604221226200000008
2016-04-2221:11:310000990772083185101220113150000013201604222105460000012
2016-04-2218:05:240000990772083509101220237010000037201604221756330000035
2016-04-2217:03:04000099077208347110122040210000000220160422163715
2016-04-2208:07:110000997169251538007220355060000055201604220713140000011
2016-04-2210:46:410000997169251538007220112160000012201604220947570000055
2016-04-2218:04:25000099716914860200722023802000003820160422172720
2016-04-2212:38:310000997169148602007220108140000008201604221159520000037
2016-04-2209:36:150000997169237105007220221210000021201604220915480000017
2016-04-2208:11:360000997169396171007220794060000094201604220755400000090
2016-04-2219:28:030000997169396171007220790060000090201604221904230000095
2016-04-2208:34:060000990172167480101290709030000109201604220807050000100

代码:

station = subway_info["on_station"]    #将字段on_station单独拎出
station_is_null = pd.isnull(station)   #用pd.isnull()函数判断on_station值是否缺失
print(station_is_null)

输出:

 0     False
 1     False
 2     False
 3     False
 4      True                           #如果on_station值缺失,则station_is_null值为True
 5     False
 6     False
 7      True 
 8     False
 9     False
 10    False
 11    False
 12    False
 Name: on_station, dtype: bool         #新字段station_is_null的值为bool型

代码:

station = subway_info["on_station"]
station_is_null = pd.isnull(station)
station_null_true = station[station_is_null]   #将station_is_null反代回station,可查询哪些编号的数据缺失
print(station_null_true)
station_null_count = len(station_null_true)    #统计on_station缺失值的个数
print(station_null_count)

输出:

 4   NaN
 7   NaN                               #表示编号4、7的数据信息中on_station值缺失
 Name: on_station, dtype: float64
 2                                     #表示on_station共缺失2个
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值