2023微信大数据挑战赛—参赛总结
摘要
比赛网址
:https://www.heywhale.com/org/2023bdc/competition/area/647d4732f1a027ece3126fef/content
赛题任务
:可概括为:基于多源数据(日志、指标、追踪)的IT系统故障发现,多维数据多粒度多标签的机器学习分类任务。故障标签包含NO_FAULT后初赛为9个,复赛为24个,最终的评分标准为所有标签预测概率的平均AUC。
数据处理
:初赛数据集完全解压后300GB左右,而复赛数据集则是1.7TB左右,数据的量级对电脑的配置有一定的限制了,当然,在使用多进程多线程技术后,处理数据的时间大大减少,在我本机上勉强能够参与这次的比赛。
特征工程
:初赛由于指标数据不完整且存在垃圾数据对训练结果没提升效果,所以只用到了日志和追踪两部分的数据构建特征,复赛之后指标数据则成为了关键特征,冠军队伍仅使用了这一指标数据,当然这部分数据的处理也是相当复杂且庞大,我本机光读取一遍数据就花了大概8个小时(尽管使用了多进程多线程),再加上特征的构建则需要的时间更多,所以特征的筛选和构建也决定了比赛的名次呀。
特征筛选
:排除了缺失比例大于95%的,以及nunique个数只有一个的特征。
模型选择
:模型1是lightgbm+ovr多标签分类模型,模型2是每个标签各建立一个lightgbm二分类模型。
赛题得分
:SAUC初赛84+分,复赛94+分
最终成绩
:初赛private榜排名47/2176名,复赛41/80名
总结结论
:再次进入到复赛阶段,这次也学到了很多东西,不过依然是陪跑,希望明年的比赛名次能再进一步哈哈哈~
赛题任务
数据处理
在数据处理这一步中,需要先对分散开的label标签文件合并为一个csv文件,以及由于为后续的特征工程准备一些数据层面上的信息统计,找出关键字段的所有枚举项,因为有些数据可能会只出现在训练集中而测试集中却没有的情况,所以这一步的结果可以用在特征工程中,将训练集和测试集的数据对齐再构建特征,以此来减少模型的偏差。
特征工程
- 文件大小特征、文件是否缺失特征、行数特征、去重后行数特征;
- 针对类别字段取nuique特征,以及按类别分组取数值特征的mean、std、ptp、skew、kurt等统计特征;
- 针对message、endpoint_name取关键词当类别特征后取nunique特征,以及按类别分组取数值特征的mean、std、ptp、skew、kurt等统计特征;
- 针对timestamp、time_sub作一阶差分后作为数值给类别字段分组取相关统计特征;
- 再按类别分组作timestamp、time_sub一阶差分作为数值再分组取相关统计特征。
def processing_feature(now_path,user_id):
try:
log, trace, metric = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()
log_file = f"{now_path}/log/{user_id}_log.csv"
trace_file = f"{now_path}/trace/{user_id}_trace.csv"
metric_file = f"{now_path}/metric/{user_id}_metric.csv"
if os.path.exists(log_file):
log = pd.read_csv(log_file).sort_values(by=['timestamp'])
log_start_num = log.shape[0]
col = 'service'
log = log[log[col].isin(train_data_map[col]&test_data_map[col])].reset_index(drop=True)
if os.path.exists(trace_file):
trace = pd.read_csv(trace_file).sort_values(by=['timestamp'])
trace_start_num = trace.shape[0]
col = 'service_name'
trace = trace[trace[col].isin(train_data_map[col]&test_data_map[col])]
col = 'host_ip'
trace = trace[trace[col].isin(train_data_map[col]&test_data_map[col])].reset_index(drop=True)
if os.path.exists(metric_file):
metric = [1]
# metric = pd.read_csv(metric_file).sort_values(by=['timestamp'])
# col = 'tags'
# metric = metric[metric[col].isin(train_data_map[col]&test_data_map[col])].reset_index(drop=True)
feats = {"id" : user_id}
# log
if len(log) > 0:
feats['log_sub_num'] = len(log) - log_start_num
feats['log_length'] = len(log)
feats['log_file_size'] = os.stat(log_file).st_size
for col in ['message','service']:
feats[f'log_{col}_nunique'] = log[col].fillna('').astype(str).nunique()
text_list=['DEBUG','INFO','WARNING','ERROR','None']
log['extract_name']='None'
for text in text_list:
if text == 'None':
log[text] = ~(log['message'].astype(str).str.contains('|'.join(text_list[:-1]),na=False))
else:
log.loc[log['message'].str.contains(text),'extract_name']=text
log[text]=log['message'].astype(str).str.contains(text,na=False)
for stats_func in ['sum', 'mean', 'std', 'skew', 'kurt']:
feats[f'log_message_{text}_{stats_func}'] = log[text].apply(stats_func)
log['extract_name_timestamp_diff1'] = log.groupby('extract_name')['timestamp'].diff(1)
log['service_timestamp_diff1'] = log.groupby('service')['timestamp'].diff(1)
log['message_len'] = log['message'].fillna("").astype(str).str.len()
for col_name in ['service','extract_name']:
for i,tmp in log.groupby(col_name):
feats[f'log_{col_name}_{i}_num']=len(tmp)
for stats_func in ['mean', 'std', 'ptp', 'skew', 'kurt']:
feats[f'log_{col_name}_{i}_timestamp_diff1_{stats_func}']=tmp[f'{col_name}_timestamp_diff1'].dropna().apply(stats_func) if len(tmp) > 1 else np.nan
feats[f'log_{col_name}_{i}_message_len_{stats_func}']=tmp['message_len'].apply(stats_func)
if col_name == 'extract_name':
continue
for text in text_list:
for stats_func in ['sum', 'mean', 'std', 'skew', 'kurt']:
feats[f'log_{col_name}_{i}_{text}_{stats_func}'] = tmp[text].apply(stats_func)
log['service&extract_timestamp_diff1'] = log.groupby(['service','extract_name'])['timestamp'].diff(1)
for g_col,(n1,n2) in zip(
[['service','extract_name']],
[['service','extract']]
):
for (col1,col2),tmp in log.groupby(g_col):
feats[f'log_{n1}&{n2}_{col1}_{col2}_num']=len(tmp)
for stats_func in ['mean', 'std', 'ptp', 'skew', 'kurt']:
feats[f'log_{n1}&{n2}_{col1}_{col2}_timestamp_diff1_{stats_func}']=tmp[f'{n1}&{n2}_timestamp_diff1'].dropna().apply(stats_func) if len(tmp) > 1 else np.nan
feats[f'log_{n1}&{n2}_{col1}_{col2}_message_len_{stats_func}']=tmp['message_len'].apply(stats_func)
# trace
if len(trace) > 0:
feats['trace_sub_num'] = len(trace) - trace_start_num
feats['trace_length'] = len(trace)
feats['trace_file_size'] = os.stat(trace_file).st_size
for stats_func in ['mean','std']:
feats[f"trace_status_code_{stats_func}"] = trace['status_code'].apply(stats_func)
for stats_func in ['nunique']:
for i in ['host_ip', 'service_name', 'endpoint_name', 'trace_id', 'span_id', 'start_time', 'end_time','status_code']:
feats[f"trace_{i}_{stats_func}"] = trace[i].apply(stats_func)
text_list=['GET','POST','DELETE','Mysql','None']
trace['extract_name']='None'
for text in text_list:
if text == 'None':
trace[text] = ~(trace['endpoint_name'].astype(str).str.contains('|'.join(text_list[:-1]),na=False))
else:
trace.loc[trace['endpoint_name'].str.contains(text),'extract_name']=text
trace[text]=trace['endpoint_name'].astype(str).str.contains(text,na=False)
for stats_func in ['sum', 'mean', 'std', 'skew', 'kurt']:
feats[f'trace_endpoint_name_{text}_{stats_func}'] = trace[text].apply(stats_func)
trace['extract_name_timestamp_diff1'] = trace.groupby('extract_name')['timestamp'].diff(1)
trace['status_code_timestamp_diff1'] = trace.groupby('status_code')['timestamp'].diff(1)
trace['host_ip_timestamp_diff1'] = trace.groupby('host_ip')['timestamp'].diff(1)
trace['service_name_timestamp_diff1'] = trace.groupby('service_name')['timestamp'].diff(1)
trace['endpoint_name_timestamp_diff1'] = trace.groupby('endpoint_name')['timestamp'].diff(1)
trace['time_sub']=trace['end_time'].clip(lower=0)-trace['start_time'].clip(lower=0)
for col_name in ['service_name','host_ip','extract_name','status_code','endpoint_name']:
for i,tmp in trace.groupby(col_name):
if (col_name == 'endpoint_name') and (i not in train_data_map[col_name] & test_data_map[col_name]):
continue
feats[f'trace_{col_name}_{i}_num']=len(tmp)
for stats_func in ['mean', 'std', 'ptp', 'skew', 'kurt']:
feats[f'trace_{col_name}_{i}_timestamp_diff1_{stats_func}']=tmp[f'{col_name}_timestamp_diff1'].dropna().apply(stats_func) if len(tmp) > 1 else np.nan
feats[f'trace_{col_name}_{i}_time_sub_{stats_func}']=tmp['time_sub'].apply(stats_func)
if col_name in ['extract_name','endpoint_name']:
continue
for text in text_list:
for stats_func in ['sum', 'mean', 'std', 'skew', 'kurt']:
feats[f'trace_{col_name}_{i}_{text}_{stats_func}'] = tmp[text].apply(stats_func)
trace['service&extract_timestamp_diff1'] = trace.groupby(['service_name','extract_name'])['timestamp'].diff(1)
trace['service&ip_timestamp_diff1'] = trace.groupby(['service_name','host_ip'])['timestamp'].diff(1)
trace['ip&extract_timestamp_diff1'] = trace.groupby(['host_ip','extract_name'])['timestamp'].diff(1)
trace['endpoint&ip_timestamp_diff1'] = trace.groupby(['endpoint_name','host_ip'])['timestamp'].diff(1)
trace['endpoint&service_timestamp_diff1'] = trace.groupby(['endpoint_name','service_name'])['timestamp'].diff(1)
for g_col,(n1,n2) in zip(
[['service_name','extract_name'],['service_name','host_ip'],['host_ip','extract_name'],['endpoint_name','host_ip'],['endpoint_name','service_name']],
[['service','extract'], ['service','ip'], ['ip','extract'], ['endpoint','ip'], ['endpoint','service']]
):
for (col1,col2),tmp in trace.groupby(g_col):
if (n1 == 'service' and n2 == 'ip') and (col1 not in train_data_map['host_ip_to_service_name'][col2] & test_data_map['host_ip_to_service_name'][col2]):
continue
if (n1 == 'endpoint' and n2 == 'ip') and (col1 not in train_data_map['host_ip_to_endpoint_name'][col2] & test_data_map['host_ip_to_endpoint_name'][col2]):
continue
if (n1 == 'endpoint' and n2 == 'service') and (col1 not in train_data_map['service_name_to_endpoint_name'][col2] & test_data_map['service_name_to_endpoint_name'][col2]):
continue
feats[f'trace_{n1}&{n2}_{col1}_{col2}_num']=len(tmp)
for stats_func in ['mean', 'std', 'ptp', 'skew', 'kurt']:
feats[f'trace_{n1}&{n2}_{col1}_{col2}_time_sub_{stats_func}']=tmp['time_sub'].apply(stats_func)
feats[f'trace_{n1}&{n2}_{col1}_{col2}_timestamp_diff1_{stats_func}']=tmp[f'{n1}&{n2}_timestamp_diff1'].dropna().apply(stats_func) if len(tmp) > 1 else np.nan
# metric
if len(metric) > 0:
# use_cols=[
# 'metric_name','service_name',
# 'container','instance','job','kubernetes_io_hostname',
# 'namespace','cpu','interface','kubernetes_pod_name','mode','minor','broadcast','duplex',
# 'operstate'
# ]
# def process(x):
# item=eval(x['tags'])
# output=[]
# for col in use_cols:
# tmp=item[col] if col in item else '无'
# output.append(tmp)
# return output
# metric[use_cols]=metric.apply(process, result_type='expand', axis=1)
feats['metric_has_file'] = 1
# feats['metric_length'] = len(metric)
# feats['metric_file_size'] = os.stat(metric_file).st_size
# for stats_func in ['mean', 'std', 'ptp', 'skew', 'kurt']:
# feats[f'metric_value_{stats_func}'] = metric['value'].apply(stats_func)
# for col in ['timestamp']+use_cols:
# for stats_func in ['mean', 'std', 'ptp', 'skew', 'kurt']:
# feats[f'metric_{col}_value_sum_{stats_func}'] = metric[[col,'value']].groupby(col)['value'].sum().dropna().apply(stats_func)
# feats[f'metric_{col}_value_mean_{stats_func}'] = metric[[col,'value']].groupby(col)['value'].mean().dropna().apply(stats_func)
# tmp=metric[[col,'value']].groupby(col)['value'].std().dropna()
# feats[f'metric_{col}_value_std_{stats_func}'] = tmp.apply(stats_func) if len(tmp) > 0 else np.nan
# tmp=metric[[col,'value']].groupby(col)['value'].skew().dropna()
# feats[f'metric_{col}_value_skew_{stats_func}'] = tmp.apply(stats_func) if len(tmp) > 0 else np.nan
# for col in ['timestamp']+use_cols:
# for i,tmp in metric[[col,'value']].groupby(col):
# if (col == 'kubernetes_pod_name' and i == 'node-exporter-52ljc'):
# continue
# if (col == 'container' and i == 'init-mysql'):
# continue
# feats[f'metric_{col}_{i}_num']=len(tmp)
# for stats_func in ['mean', 'std', 'ptp', 'skew', 'kurt']:
# feats[f'metric_{col}_{i}_value_{stats_func}']=tmp['value'].apply(stats_func)
else:
feats['metric_has_file'] = 0
except Exception as e:
print(f'{user_id=} {str(e)}')
raise
return feats
特征筛选
排除了缺失比例大于95%的,以及nunique个数只有一个的特征。
filter_cols=['id', 'label']
for i in feature.columns:
if feature[i].isna().sum() / len(feature) > 0.95:
filter_cols.append(i)
elif feature[i].nunique() == 1:
filter_cols.append(i)
模型选择
- 模型1是lightgbm+ovr多标签分类模型
def baseline(args,train_X,test_X,y):
kf = MultilabelStratifiedKFold(n_splits=args.n_splits, random_state=args.seed, shuffle=True)
ovr_oof = np.zeros((len(train_X), args.num_classes))
ovr_preds = np.zeros((len(test_X), args.num_classes))
for fold,(train_index, valid_index) in enumerate(kf.split(train_X, y)):
X_train, X_valid = train_X[train_index], train_X[valid_index]
y_train, y_valid = y[train_index], y[valid_index]
clf = OneVsRestClassifier(lgb.LGBMClassifier(**dict(args)))
clf.fit(X_train, y_train)
joblib.dump(clf, f'{args.model_path}/{args.model_type}_fold{fold}.joblib')
ovr_oof[valid_index] = clf.predict_proba(X_valid)
ovr_preds += clf.predict_proba(test_X) / args.n_splits
score = sScore(y_valid, ovr_oof[valid_index])
print(f"fold{fold+1}: Score = {np.mean(score)}")
return ovr_preds,ovr_oof
- 模型2是每个标签各建立一个lightgbm二分类模型。
def tree_train_binary(args,feature_names,train_scaler_X,test_scaler_X,Y,label_i):
score = np.zeros(5)
oof = np.zeros(len(train_scaler_X))
preds = np.zeros(len(test_scaler_X))
feature_importance_df = pd.DataFrame()
kf = StratifiedKFold(n_splits=args.n_splits, shuffle=True, random_state=args.seed)
y=deepcopy(Y)
for fold, (train_idx, val_idx) in enumerate(kf.split(train_scaler_X, y)):
tra = lgb.Dataset(train_scaler_X[train_idx],y[train_idx])
val = lgb.Dataset(train_scaler_X[val_idx],y[val_idx])
model = lgb.train(dict(args), tra, valid_sets=[val], num_boost_round=args.num_boost_round,
callbacks=[lgb.early_stopping(args.early_stopping), lgb.log_evaluation(args.num_boost_round)])
joblib.dump(model, f'{args.model_path}/{args.model_type}_label{label_i}_fold{fold}.joblib')
auc = model.best_score['valid_0']['auc']
score[fold] = auc
print(f'{fold=} {auc=}')
oof[val_idx] = model.predict(train_scaler_X[val_idx], num_iteration=model.best_iteration)
preds += model.predict(test_scaler_X, num_iteration=model.best_iteration) / args.n_splits
fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = feature_names
fold_importance_df["importance"] = model.feature_importance()
fold_importance_df["fold"] = fold + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df])
print('auc mean',np.mean(score))
return preds,oof,feature_importance_df
- 模型融合:分别提交两个模型的预测结果后,根据线上得分按0.65:0.35的权重加权融合预测结果,SAUC进一步得到了提升。
赛题得分
SAUC初赛84+分,复赛94+分,初赛private榜排名47/2176名,复赛41/80名
推荐阅读
前排大佬的分享:只用指标数据进行建模
以上即为本文的全部内容,若需要全部源代码的,请关注公众号《Python王者之路》,回复关键词:20231010
,即可获取。
写在最后
通过观看决赛直播,听到举办方对本次比赛的愿景,我还是非常佩服的,因为目前业界对AI运维的故障发现还不是很成熟,并且随着各种机器的互联最终会形成的万物互联也可能就在不久的将来,所以本次比赛的意义是重大的,让我了解到了这一行目前的最新研究与发展,以及当下所面临的一些挑战。