2023微信大数据挑战赛—参赛总结

七里香还是稻香

于 2023-10-10 13:53:52 发布

阅读量628

点赞数 2

分类专栏：那些年打过的比赛文章标签：微信大数据 2023BDC 参赛总结

本文链接：https://blog.csdn.net/sinat_39629323/article/details/133701640

版权

那些年打过的比赛专栏收录该内容

5 篇文章 2 订阅

订阅专栏

2023微信大数据挑战赛—参赛总结

摘要

比赛网址：https://www.heywhale.com/org/2023bdc/competition/area/647d4732f1a027ece3126fef/content

赛题任务：可概括为：基于多源数据（日志、指标、追踪）的IT系统故障发现，多维数据多粒度多标签的机器学习分类任务。故障标签包含NO_FAULT后初赛为9个，复赛为24个，最终的评分标准为所有标签预测概率的平均AUC。

数据处理：初赛数据集完全解压后300GB左右，而复赛数据集则是1.7TB左右，数据的量级对电脑的配置有一定的限制了，当然，在使用多进程多线程技术后，处理数据的时间大大减少，在我本机上勉强能够参与这次的比赛。

特征工程：初赛由于指标数据不完整且存在垃圾数据对训练结果没提升效果，所以只用到了日志和追踪两部分的数据构建特征，复赛之后指标数据则成为了关键特征，冠军队伍仅使用了这一指标数据，当然这部分数据的处理也是相当复杂且庞大，我本机光读取一遍数据就花了大概8个小时（尽管使用了多进程多线程），再加上特征的构建则需要的时间更多，所以特征的筛选和构建也决定了比赛的名次呀。

特征筛选：排除了缺失比例大于95%的，以及nunique个数只有一个的特征。

模型选择：模型1是lightgbm+ovr多标签分类模型，模型2是每个标签各建立一个lightgbm二分类模型。

赛题得分：SAUC初赛84+分，复赛94+分

最终成绩：初赛private榜排名47/2176名，复赛41/80名

总结结论：再次进入到复赛阶段，这次也学到了很多东西，不过依然是陪跑，希望明年的比赛名次能再进一步哈哈哈~

赛题任务

数据处理

在数据处理这一步中，需要先对分散开的label标签文件合并为一个csv文件，以及由于为后续的特征工程准备一些数据层面上的信息统计，找出关键字段的所有枚举项，因为有些数据可能会只出现在训练集中而测试集中却没有的情况，所以这一步的结果可以用在特征工程中，将训练集和测试集的数据对齐再构建特征，以此来减少模型的偏差。

特征工程

文件大小特征、文件是否缺失特征、行数特征、去重后行数特征；
针对类别字段取nuique特征，以及按类别分组取数值特征的mean、std、ptp、skew、kurt等统计特征；
针对message、endpoint_name取关键词当类别特征后取nunique特征，以及按类别分组取数值特征的mean、std、ptp、skew、kurt等统计特征；
针对timestamp、time_sub作一阶差分后作为数值给类别字段分组取相关统计特征；
再按类别分组作timestamp、time_sub一阶差分作为数值再分组取相关统计特征。

def processing_feature(now_path,user_id):
    try:
        log, trace, metric = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()
        log_file = f"{now_path}/log/{user_id}_log.csv"
        trace_file = f"{now_path}/trace/{user_id}_trace.csv"
        metric_file = f"{now_path}/metric/{user_id}_metric.csv"
        if os.path.exists(log_file):
            log = pd.read_csv(log_file).sort_values(by=['timestamp'])
            log_start_num = log.shape[0]
            col = 'service'
            log = log[log[col].isin(train_data_map[col]&test_data_map[col])].reset_index(drop=True)
        
        if os.path.exists(trace_file):
            trace = pd.read_csv(trace_file).sort_values(by=['timestamp'])
            trace_start_num = trace.shape[0]
            col = 'service_name'
            trace = trace[trace[col].isin(train_data_map[col]&test_data_map[col])]
            col = 'host_ip'
            trace = trace[trace[col].isin(train_data_map[col]&test_data_map[col])].reset_index(drop=True)
            
        if os.path.exists(metric_file):
            metric = [1]
            # metric = pd.read_csv(metric_file).sort_values(by=['timestamp'])
            # col = 'tags'
            # metric = metric[metric[col].isin(train_data_map[col]&test_data_map[col])].reset_index(drop=True)
        
        feats = {"id" : user_id}
        # log
        if len(log) > 0:
            feats['log_sub_num'] = len(log) - log_start_num
            feats['log_length'] = len(log)
            feats['log_file_size'] = os.stat(log_file).st_size

            for col in ['message','service']:
                feats[f'log_{col}_nunique'] = log[col].fillna('').astype(str).nunique()
            
            text_list=['DEBUG','INFO','WARNING','ERROR','None']
            log['extract_name']='None'
            for text in text_list:
                if text == 'None':
                    log[text] = ~(log['message'].astype(str).str.contains('|'.join(text_list[:-1]),na=False))
                else:
                    log.loc[log['message'].str.contains(text),'extract_name']=text
                    log[text]=log['message'].astype(str).str.contains(text,na=False)

                for stats_func in ['sum', 'mean', 'std', 'skew', 'kurt']:
                    feats[f'log_message_{text}_{stats_func}'] = log[text].apply(stats_func)

            log['extract_name_timestamp_diff1'] = log.groupby('extract_name')['timestamp'].diff(1)
            log['service_timestamp_diff1'] = log.groupby('service')['timestamp'].diff(1)
            log['message_len'] = log['message'].fillna("").astype(str).str.len()
            for col_name in ['service','extract_name']:
                for i,tmp in log.groupby(col_name):
                    feats[f'log_{col_name}_{i}_num']=len(tmp)
                    for stats_func in ['mean', 'std', 'ptp', 'skew', 'kurt']:
                        feats[f'log_{col_name}_{i}_timestamp_diff1_{stats_func}']=tmp[f'{col_name}_timestamp_diff1'].dropna().apply(stats_func) if len(tmp) > 1 else np.nan
                        feats[f'log_{col_name}_{i}_message_len_{stats_func}']=tmp['message_len'].apply(stats_func)
                    if col_name == 'extract_name':
                        continue
                    for text in text_list:
                        for stats_func in ['sum', 'mean', 'std', 'skew', 'kurt']:
                            feats[f'log_{col_name}_{i}_{text}_{stats_func}'] = tmp[text].apply(stats_func)

            log['service&extract_timestamp_diff1'] = log.groupby(['service','extract_name'])['timestamp'].diff(1)
            for g_col,(n1,n2) in zip(
                        [['service','extract_name']],
                        [['service','extract']]
                    ):
                for (col1,col2),tmp in log.groupby(g_col):
                    feats[f'log_{n1}&{n2}_{col1}_{col2}_num']=len(tmp)
                    for stats_func in ['mean', 'std', 'ptp', 'skew', 'kurt']:
                        feats[f'log_{n1}&{n2}_{col1}_{col2}_timestamp_diff1_{stats_func}']=tmp[f'{n1}&{n2}_timestamp_diff1'].dropna().apply(stats_func) if len(tmp) > 1 else np.nan
                        feats[f'log_{n1}&{n2}_{col1}_{col2}_message_len_{stats_func}']=tmp['message_len'].apply(stats_func)
            
        # trace
        if len(trace) > 0:
            feats['trace_sub_num'] = len(trace) - trace_start_num
            feats['trace_length'] = len(trace)
            feats['trace_file_size'] = os.stat(trace_file).st_size

            for stats_func in ['mean','std']:
                feats[f"trace_status_code_{stats_func}"] = trace['status_code'].apply(stats_func)

            for stats_func in ['nunique']:
                for i in ['host_ip', 'service_name', 'endpoint_name', 'trace_id', 'span_id', 'start_time', 'end_time','status_code']:
                    feats[f"trace_{i}_{stats_func}"] = trace[i].apply(stats_func)

            text_list=['GET','POST','DELETE','Mysql','None']
            trace['extract_name']='None'
            for text in text_list:
                if text == 'None':
                    trace[text] = ~(trace['endpoint_name'].astype(str).str.contains('|'.join(text_list[:-1]),na=False))
                else:
                    trace.loc[trace['endpoint_name'].str.contains(text),'extract_name']=text
                    trace[text]=trace['endpoint_name'].astype(str).str.contains(text,na=False)

                for stats_func in ['sum', 'mean', 'std', 'skew', 'kurt']:
                    feats[f'trace_endpoint_name_{text}_{stats_func}'] = trace[text].apply(stats_func)

            trace['extract_name_timestamp_diff1'] = trace.groupby('extract_name')['timestamp'].diff(1)
            trace['status_code_timestamp_diff1'] = trace.groupby('status_code')['timestamp'].diff(1)
            trace['host_ip_timestamp_diff1'] = trace.groupby('host_ip')['timestamp'].diff(1)
            trace['service_name_timestamp_diff1'] = trace.groupby('service_name')['timestamp'].diff(1)
            trace['endpoint_name_timestamp_diff1'] = trace.groupby('endpoint_name')['timestamp'].diff(1)
            trace['time_sub']=trace['end_time'].clip(lower=0)-trace['start_time'].clip(lower=0)
            for col_name in ['service_name','host_ip','extract_name','status_code','endpoint_name']:
                for i,tmp in trace.groupby(col_name):
                    if (col_name == 'endpoint_name') and (i not in train_data_map[col_name] & test_data_map[col_name]):
                        continue
                    
                    feats[f'trace_{col_name}_{i}_num']=len(tmp)
                    for stats_func in ['mean', 'std', 'ptp', 'skew', 'kurt']:
                        feats[f'trace_{col_name}_{i}_timestamp_diff1_{stats_func}']=tmp[f'{col_name}_timestamp_diff1'].dropna().apply(stats_func) if len(tmp) > 1 else np.nan
                        feats[f'trace_{col_name}_{i}_time_sub_{stats_func}']=tmp['time_sub'].apply(stats_func)
                    
                    if col_name in ['extract_name','endpoint_name']:
                        continue
                    for text in text_list:
                        for stats_func in ['sum', 'mean', 'std', 'skew', 'kurt']:
                            feats[f'trace_{col_name}_{i}_{text}_{stats_func}'] = tmp[text].apply(stats_func)

            trace['service&extract_timestamp_diff1'] = trace.groupby(['service_name','extract_name'])['timestamp'].diff(1)
            trace['service&ip_timestamp_diff1'] = trace.groupby(['service_name','host_ip'])['timestamp'].diff(1)
            trace['ip&extract_timestamp_diff1'] = trace.groupby(['host_ip','extract_name'])['timestamp'].diff(1)
            trace['endpoint&ip_timestamp_diff1'] = trace.groupby(['endpoint_name','host_ip'])['timestamp'].diff(1)
            trace['endpoint&service_timestamp_diff1'] = trace.groupby(['endpoint_name','service_name'])['timestamp'].diff(1)
            for g_col,(n1,n2) in zip(
                        [['service_name','extract_name'],['service_name','host_ip'],['host_ip','extract_name'],['endpoint_name','host_ip'],['endpoint_name','service_name']],
                        [['service','extract'],          ['service','ip'],          ['ip','extract'],          ['endpoint','ip'],          ['endpoint','service']]
                    ):
                for (col1,col2),tmp in trace.groupby(g_col):
                    if (n1 == 'service' and n2 == 'ip') and (col1 not in train_data_map['host_ip_to_service_name'][col2] & test_data_map['host_ip_to_service_name'][col2]):
                        continue
                    if (n1 == 'endpoint' and n2 == 'ip') and (col1 not in train_data_map['host_ip_to_endpoint_name'][col2] & test_data_map['host_ip_to_endpoint_name'][col2]):
                        continue
                    if (n1 == 'endpoint' and n2 == 'service') and (col1 not in train_data_map['service_name_to_endpoint_name'][col2] & test_data_map['service_name_to_endpoint_name'][col2]):
                        continue
                    feats[f'trace_{n1}&{n2}_{col1}_{col2}_num']=len(tmp)
                    for stats_func in ['mean', 'std', 'ptp', 'skew', 'kurt']:
                        feats[f'trace_{n1}&{n2}_{col1}_{col2}_time_sub_{stats_func}']=tmp['time_sub'].apply(stats_func)
                        feats[f'trace_{n1}&{n2}_{col1}_{col2}_timestamp_diff1_{stats_func}']=tmp[f'{n1}&{n2}_timestamp_diff1'].dropna().apply(stats_func) if len(tmp) > 1 else np.nan

        # metric
        if len(metric) > 0:
            # use_cols=[
            #     'metric_name','service_name',
            #     'container','instance','job','kubernetes_io_hostname',
            #     'namespace','cpu','interface','kubernetes_pod_name','mode','minor','broadcast','duplex',
            #     'operstate'
            # ]
            # def process(x):
            #     item=eval(x['tags'])
            #     output=[]
            #     for col in use_cols:
            #         tmp=item[col] if col in item else '无'
            #         output.append(tmp)
            #     return output
            # metric[use_cols]=metric.apply(process, result_type='expand', axis=1)

            feats['metric_has_file'] = 1
            # feats['metric_length'] = len(metric)
            # feats['metric_file_size'] = os.stat(metric_file).st_size

            # for stats_func in ['mean', 'std', 'ptp', 'skew', 'kurt']:
            #     feats[f'metric_value_{stats_func}'] = metric['value'].apply(stats_func)

            # for col in ['timestamp']+use_cols:
            #     for stats_func in ['mean', 'std', 'ptp', 'skew', 'kurt']:
            #         feats[f'metric_{col}_value_sum_{stats_func}'] = metric[[col,'value']].groupby(col)['value'].sum().dropna().apply(stats_func)
            #         feats[f'metric_{col}_value_mean_{stats_func}'] = metric[[col,'value']].groupby(col)['value'].mean().dropna().apply(stats_func)
            #         tmp=metric[[col,'value']].groupby(col)['value'].std().dropna()
            #         feats[f'metric_{col}_value_std_{stats_func}'] = tmp.apply(stats_func) if len(tmp) > 0 else np.nan
            #         tmp=metric[[col,'value']].groupby(col)['value'].skew().dropna()
            #         feats[f'metric_{col}_value_skew_{stats_func}'] = tmp.apply(stats_func) if len(tmp) > 0 else np.nan

            # for col in ['timestamp']+use_cols:
            #     for i,tmp in metric[[col,'value']].groupby(col):
            #         if (col == 'kubernetes_pod_name' and i == 'node-exporter-52ljc'):
            #             continue
            #         if (col == 'container' and i == 'init-mysql'):
            #             continue
            #         feats[f'metric_{col}_{i}_num']=len(tmp)
            #         for stats_func in ['mean', 'std', 'ptp', 'skew', 'kurt']:
            #             feats[f'metric_{col}_{i}_value_{stats_func}']=tmp['value'].apply(stats_func)
        else:
            feats['metric_has_file'] = 0
    
    except Exception as e:
        print(f'{user_id=}  {str(e)}')
        raise

    return feats

特征筛选

排除了缺失比例大于95%的，以及nunique个数只有一个的特征。

filter_cols=['id', 'label']
    for i in feature.columns:
        if feature[i].isna().sum() / len(feature) > 0.95:
            filter_cols.append(i)
        elif feature[i].nunique() == 1:
            filter_cols.append(i)

模型选择

模型1是lightgbm+ovr多标签分类模型

def baseline(args,train_X,test_X,y):
    kf = MultilabelStratifiedKFold(n_splits=args.n_splits, random_state=args.seed, shuffle=True)

    ovr_oof = np.zeros((len(train_X), args.num_classes))
    ovr_preds = np.zeros((len(test_X), args.num_classes))

    for fold,(train_index, valid_index) in enumerate(kf.split(train_X, y)):
        X_train, X_valid = train_X[train_index], train_X[valid_index]
        y_train, y_valid = y[train_index], y[valid_index]
        clf = OneVsRestClassifier(lgb.LGBMClassifier(**dict(args)))
        clf.fit(X_train, y_train)
        joblib.dump(clf, f'{args.model_path}/{args.model_type}_fold{fold}.joblib')
        ovr_oof[valid_index] = clf.predict_proba(X_valid)
        ovr_preds += clf.predict_proba(test_X) / args.n_splits
        score = sScore(y_valid, ovr_oof[valid_index])
        print(f"fold{fold+1}： Score = {np.mean(score)}")

    return ovr_preds,ovr_oof

模型2是每个标签各建立一个lightgbm二分类模型。

def tree_train_binary(args,feature_names,train_scaler_X,test_scaler_X,Y,label_i):
    score = np.zeros(5)
    oof = np.zeros(len(train_scaler_X))
    preds = np.zeros(len(test_scaler_X))
    feature_importance_df = pd.DataFrame()
    kf = StratifiedKFold(n_splits=args.n_splits, shuffle=True, random_state=args.seed)
    y=deepcopy(Y)
    for fold, (train_idx, val_idx) in enumerate(kf.split(train_scaler_X, y)):
        tra = lgb.Dataset(train_scaler_X[train_idx],y[train_idx])
        val = lgb.Dataset(train_scaler_X[val_idx],y[val_idx])
        model = lgb.train(dict(args), tra, valid_sets=[val], num_boost_round=args.num_boost_round,
                        callbacks=[lgb.early_stopping(args.early_stopping), lgb.log_evaluation(args.num_boost_round)])

        joblib.dump(model, f'{args.model_path}/{args.model_type}_label{label_i}_fold{fold}.joblib')

        auc = model.best_score['valid_0']['auc']
        score[fold] = auc
        print(f'{fold=} {auc=}')

        oof[val_idx] = model.predict(train_scaler_X[val_idx], num_iteration=model.best_iteration)
        preds += model.predict(test_scaler_X, num_iteration=model.best_iteration) / args.n_splits

        fold_importance_df = pd.DataFrame()
        fold_importance_df["Feature"] = feature_names
        fold_importance_df["importance"] = model.feature_importance()
        fold_importance_df["fold"] = fold + 1
        feature_importance_df = pd.concat([feature_importance_df, fold_importance_df])

    print('auc mean',np.mean(score))
    
    return preds,oof,feature_importance_df