python pandas模块计算时间差(排除非工作日)
所需调用以下包
import numpy as np
import pandas as pd
import csv
import datetime
from dateutil.parser import parse
1.读取csv文件
shixiao_data = pd.read_csv(’/soft/data/case_data/shixiao_test.csv’,encoding=‘gbk’)
2.创建非工作日日期集合
nonworkday = [‘2020-01-01’, ‘2020-01-04’, ‘2020-01-05’, ‘2020-01-11’, ‘2020-01-12’, ‘2020-01-18’, ‘2020-01-24’, ‘2020-01-25’, ‘2020-01-26’, ‘2020-01-27’, ‘2020-01-28’, ‘2020-01-29’, ‘2020-01-30’, ‘2020-02-02’, ‘2020-02-08’, ‘2020-02-09’, ‘2020-02-15’, ‘2020-02-16’, ‘2020-02-22’, ‘2020-02-23’, ‘2020-02-29’, ‘2020-03-01’, ‘2020-03-07’, ‘2020-03-08’, ‘2020-03-14’, ‘2020-03-15’, ‘2020-03-21’, ‘2020-03-22’, ‘2020-03-28’, ‘2020-03-29’, ‘2020-04-04’, ‘2020-04-05’, ‘2020-04-06’, ‘2020-04-11’, ‘2020-04-12’, ‘2020-04-18’, ‘2020-04-19’, ‘2020-04-25’, ‘2020-05-01’]
3.计算某一段时间非工作日个数自定义函数:
输入t1,t2,对非工作日集合进行遍历判断
def nonworkday_cnt(created_time, last_second_audit_time):
nonworkday_set = []
create_day = created_time[:10]
last_second_audit_day = last_second_audit_time[:10]
for day in nonworkday:
if ((day >= create_day) & (day <= last_second_audit_day)):
nonworkday_set.append(day)
return(len(nonworkday_set))
#到这一步可以注册udf函数,在spark中调用(spark.udf.register(‘nonworkday_cnt’,nonworkday_cnt,LongType()))
SELECT *, nonworkday_cnt(t1,t2) AS nonworkday_cnt
FROM ***
此处今天介绍纯python 处理数据
4.pandas读取csv文件
(1) lambda函数调用计算某一段时间非工作日个数自定义函数:
shixiao_data['nonworkday_cnt'] = shixiao_data['level_0'].apply(lambda x: nonworkday_cnt(shixiao_data.created_time[x], shixiao_data.last_second_audit_time[x]))
(2)计算t1,t2时间差,运用total_seconds函数实现:
shixiao_data['total_timediff'] = shixiao_data['level_0'].apply(lambda x: (parse(shixiao_data.last_second_audit_time[x]) - parse(shixiao_data.created_time[x])).total_seconds())
print(shixiao_data)
(3)排除非工作日时间差 =(总时间差 - 非工作日天数*
24*
60 )
shixiao_data['real_timediff'] = shixiao_data['level_0'].apply(lambda x: (shixiao_data.total_timediff[x] - shixiao_data.nonworkday_cnt[x]*24*60*60)/60/60)
print(shixiao_data)
以某一列为维度计算平均值:
shixiao_data["real_timediff_avg"] = shixiao_data['real_timediff'].mean()
print('总时效')
print(shixiao_data.real_timediff_avg[1])
That’s all,thanks!!!