关注我的公众号YueTan进行交流探讨
欢迎关注数据比赛方案仓库 https://github.com/hongyingyue/Competition-solutions
特征
features = ['user_questions', 'user_mean', 'content_questions', 'content_mean', 'prior_question_elapsed_time']
user_df = train[train.answered_correctly != -1].groupby('user_id').agg({'answered_correctly': ['count', 'mean']}).reset_index()
user_df.columns = ['user_id', 'user_questions', 'user_mean']
content_df = train[train.answered_correctly != -1].groupby('content_id').agg({'answered_correctly': ['count', 'mean']}).reset_index()
content_df.columns = ['content_id', 'content_questions', 'content_mean']
train['prior_question_elapsed_time'].fillna(mean_prior, inplace = True)
得分: 0.75
使用的五个特征含义分别是
- 用户回答所有问题的个数
- 用户回答问题的正确率
- 问题被回答的个数总和
- 问题被回答的正确率
- 用户在上一个question bundle中平均消耗时间
特征2
FEATS = ['answered_correctly_avg_u', 'answered_correctly_sum_u', 'count_u', 'answered_correctly_avg_c', 'part', 'prior_question_had_explanation', 'prior_question_elapsed_time']
def add_user_feats(df, answered_correctly_sum_u_dict, count_u_dict):
acsu = np.zeros(len(df), dtype=np.int32)
cu = np.zeros(len(df), dtype=np.int32)
for cnt,row in enumerate(tqdm(df[['user_id','answered_correctly']].values)):
acsu[cnt] = answered_correctly_sum_u_dict[row[0]]
cu[cnt] = count_u_dict[row[0]]
answered_correctly_sum_u_dict[row[0]] += row[1]
count_u_dict[row[0]] += 1
user_feats_df = pd.DataFrame({'answered_correctly_sum_u':acsu, 'count_u':cu})
user_feats_df['answered_correctly_avg_u'] = user_feats_df['answered_correctly_sum_u'] / user_feats_df['count_u']
df = pd.concat([df, user_feats_df], axis=1)
return df
# answered correctly average for each content
content_df = train[['content_id','answered_correctly']].groupby(['content_id']).agg(['mean']).reset_index()
content_df.columns = ['content_id', 'answered_correctly_avg_c']
# changing dtype to avoid lightgbm error
train['prior_question_had_explanation'] = train.prior_question_had_explanation.fillna(False).astype('int8')
得分 0.76
使用的7个特征是
- 使用的特征是增量特征,截止到目前该用户的回答正确率
- 截止到目前该用户回答正确的题目个数
- 截止到目前该用户回答问题所有个数
- content在所有回答中的正确率
- part
- 是否有解释
- 平均消耗时间
特征3
FEATURES = ['prior_question_elapsed_time', 'prior_question_had_explanation', 'part',
'answered_correctly_u_avg', 'elapsed_time_u_avg', 'explanation_u_avg',
'answered_correctly_q_avg', 'elapsed_time_q_avg', 'explanation_q_avg',
'answered_correctly_uq_count', 'timestamp_u_recency_1', 'timestamp_u_recency_2', 'timestamp_u_recency_3',
'timestamp_u_incorrect_recency']
使用的特征14个
特征4
https://www.kaggle.com/code/a763337092/lgb1215
features_dict = {
#'user_id',
'timestamp':'float16',#
'user_interaction_count':'int16',
'user_interaction_timestamp_mean':'float32',
'lagtime':'float32',#
'lagtime2':'float32',
'lagtime3':'float32',
#'lagtime_mean':'int32',
'content_id':'int16',
'task_container_id':'int16',
'user_lecture_sum':'int16',#
'user_lecture_lv':'float16',##
'prior_question_elapsed_time':'float32',#
'delta_prior_question_elapsed_time':'int32',#
'user_correctness':'float16',#
'user_uncorrect_count':'int16',#
'user_correct_count':'int16',#
#'content_correctness':'float16',
'content_correctness_std':'float16',
'content_correct_count':'int32',
'content_uncorrect_count':'int32',#
'content_elapsed_time_mean':'float16',
'content_had_explanation_mean':'float16',
'content_explation_false_mean':'float16',
'content_explation_true_mean':'float16',
'task_container_correctness':'float16',
'task_container_std':'float16',
'task_container_cor_count':'int32',#
'task_container_uncor_count':'int32',#
'attempt_no':'int8',#
'part':'int8',
'part_correctness_mean':'float16',
'part_correctness_std':'float16',
'part_uncor_count':'int32',
'part_cor_count':'int32',
'tags0': 'int8',
'tags1': 'int8',
'tags2': 'int8',
'tags3': 'int8',
'tags4': 'int8',
'tags5': 'int8',
# 'tags6': 'int8',
# 'tags7': 'int8',
# 'tags0_correctness_mean':'float16',
# 'tags1_correctness_mean':'float16',
# 'tags2_correctness_mean':'float16',
# 'tags4_correctness_mean':'float16',
# 'bundle_id':'int16',
# 'bundle_correctness_mean':'float16',
# 'bundle_uncor_count':'int32',
# 'bundle_cor_count':'int32',
'part_bundle_id':'int32',
'content_sub_bundle':'int8',
'prior_question_had_explanation':'int8',
'explanation_mean':'float16', #
#'explanation_var',#
'explanation_false_count':'int16',#
'explanation_true_count':'int16',#
# 'community':'int8',
# 'part_1',
# 'part_2',
# 'part_3',
# 'part_4',
# 'part_5',
# 'part_6',
# 'part_7',
# 'type_of_concept',
# 'type_of_intention',
# 'type_of_solving_question',
# 'type_of_starter'
}
categorical_columns= [
#'user_id',
'content_id',
'task_container_id',
'part',
# 'community',
'tags0',
'tags1',
'tags2',
'tags3',
'tags4',
'tags5',
#'tags6',
#'tags7',
#'bundle_id',
'part_bundle_id',
'content_sub_bundle',
'prior_question_had_explanation',
# 'part_1',
# 'part_2',
# 'part_3',
# 'part_4',
# 'part_5',
# 'part_6',
# 'part_7',
# 'type_of_concept',
# 'type_of_intention',
# 'type_of_solving_question',
# 'type_of_starter'
]
features=list(features_dict.keys())
top solution
https://www.kaggle.com/competitions/riiid-test-answer-prediction/discussion/209597
- Firstly, the data are sorted by [‘user_id’, ‘timestamp’, ‘content_id’]
- created features in different array via self-designed rolling function or self-designed cumlative function
- use catboost
['content_id',
'prior_question_elapsed_time',
'prior_question_had_explanation',
'correct_answer',
'user_count',
'user_sum',
'user_mean',
'item_count',
'item_sum',
'item_mean',
'answer_ratio_0',
'answer_ratio_1',
'answer_ratio_2',
'bundle_id',
'part',
'le_tag',
'question_correct_user_ablility_mean',
'question_correct_user_ablility_median',
'question_wrong_user_ablility_mean',
'question_wrong_user_ablility_median',
'word2vec_0',
'word2vec_1',
'word2vec_2',
'word2vec_3',
'word2vec_4',
'svd_0',
'svd_1',
'svd_2',
'svd_3',
'svd_4',
'tags_w2v_correct_mean_0',
'tags_w2v_wrong_mean_0',
'tags_w2v_correct_mean_1',
'tags_w2v_wrong_mean_1',
'tags_w2v_correct_mean_2',
'tags_w2v_wrong_mean_2',
'tags_w2v_correct_mean_3',
'tags_w2v_wrong_mean_3',
'tags_w2v_correct_mean_4',
'tags_w2v_wrong_mean_4',
'real_time_wrong_mean',
'real_time_wrong_median',
'real_time_correct_mean',
'real_time_correct_median',
'task_set_distance_wrong_mean',
'task_set_distance_wrong_median',
'task_set_distance_correct_mean',
'task_set_distance_correct_median',
'mean_0_ratio',
'mean_1_ratio',
'mean_3_ratio',
'mean_4_ratio',
'mean_5_ratio',
'mean_6_ratio',
'mean_7_ratio',
'mean_8_ratio',
'mean_9_ratio',
'mean_10_ratio',
'user_d1',
'user_d2',
'task_set_distance',
'user_diff_mean',
'user_diff_std',
'user_diff_min',
'user_diff_max',
'task_set_item_mean',
'task_set_item_min',
'task_set_item_max',
'task_set_distance2',
'task_distance_shift',
'task_set_distance_diff',
'task_distance_diff_shift',
'container_mean_1',
'container_mean_5',
'container_std_5',
'container_mean_10',
'container_std_10',
'container_mean_20',
'container_std_20',
'container_mean_30',
'container_std_30',
'container_mean_40',
'container_std_40',
'prior_question_elapsed_time_mean_1',
'prior_question_elapsed_time_mean_5',
'prior_question_elapsed_time_mean_10',
'prior_question_elapsed_time_mean_20',
'prior_question_elapsed_time_mean_30',
'prior_question_elapsed_time_mean_40',
'item_mean_mean_30',
'item_mean_mean_40',
'task_set_distance_mean_1',
'task_set_distance_mean_5',
'task_set_distance_mean_10',
'task_set_distance_mean_20',
'task_set_distance_mean_30',
'begin_time_diff',
'end_time_diff',
'part_time_diff_mean',
'part_session_mean',
'part_session_sum',
'part_session_count',
'full_group0_item_mean_mean',
'full_group0_item_mean_median',
'full_group0_task_set_distance_median',
'full_group0_timestamp_mean',
'full_group0_timestamp_median',
'full_group1_item_mean_mean',
'full_group1_item_mean_median',
'full_group1_task_set_distance_median',
'full_group1_timestamp_median',
'part_sum',
'part_count',
'part_mean',
'part_sum_global_ratio',
'part_sum_1',
'part_sum_5',
'part_mean_5',
'part_sum_10',
'part_mean_10',
'cum_answer0_mean_item_mean',
'cum_answer0_median_item_mean',
'cum_answer0_median_task_set_distance',
'cum_answer1_mean_item_mean',
'cum_answer1_median_item_mean',
'cum_answer1_mean_task_set_distance',
'cum_answer1_median_task_set_distance',
'cum_answer0_time_diff',
'cum_answer1_time_diff',
'global_task_set_shift1',
'global_task_set_shift2',
'global_task_set_shift4',
'global_task_set_shift5',
'cum_answer0_mean_wrong_time_diff',
'cum_answer0_median_wrong_time_diff',
'cum_answer1_mean_right_time_diff',
'content_correct_mean',
'content_correct_sum',
'content_correct_count',
'hard_answer0_time',
'hard_answer1_time',
'full_bundle_item_mean_mean',
'full_bundle_item_mean_median',
'full_bundle_task_set_distance_mean',
'full_bundle_task_set_distance_median',
'full_bundle_timestamp_mean',
'full_bundle_timestamp_median',
'bundle_sum',
'bundle_mean',
'bundle_count',
'user_trend_mean',
'user_trend_median',
'user_trend_roll_user_ans_sum',
'user_trend_roll_user_ans_mean',
'user_trend_roll_user_ans_count',
'user_trend_roll_item_ans_mean',
'user_trend_roll_item_ans_count',
'div_ratio1',
'div_ratio2',
'div_ratio3',
'new_Feat0',
'new_Feat1',
'new_Feat2',
'new_Feat3',
'part_time_wrong_div',
'part_time_right_div',
'diff_lag_median_div',
'diff_item_median_div',
'diff_time_median_div',
'diff_item_mean_div',
'diff_task_set_mean_div',
'diff_timestamp_mean_div',
'last_20_frequent_answer',
'last_20_frequent_answer_count',
'last_20_frequent_answer_mean',
'last_20_frequent_answer_sum',
'last_user_same_answer_tf',
'last_item_same_answer_tf',
'last_right_time_diff',
'last_wrong_time_diff',
'last_5_part_time_div',
'last_10_part_time_div',
'last_20_part_time_div']
https://www.kaggle.com/code/tomooinubushi/62nd-solution-lightgbm-single-model-lb-0-801