Knowledge Tracing 资源帖1

介绍知识追踪的常见数据集和代码,博客等等等,我是勤快的搬运工,好好看

数据集

Knowledge Tracing Benchmark Dataset

There are some datasets which are suitable for this task,

KDD Cup 2010  https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp

ASSISTments ASSISTments (google.com)

 OLI Engineering Statics 2011  https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=507

JunyiAcademy Math Practicing Log [Annotation]  

slepemapy.cz  https://www.fi.muni.cz/adaptivelearning/?a=data

synthetic  synthetic (github.com)

math2015  

EdNet

 pisa2015math

 workbankr

 critlangacq

The following datasets are prov

The following datasets are provided by EduData ktbd:

Dataset NameDescription
syntheticThe dataset used in Deep Knowledge Tracing, original dataset can be found in github
assistment_2009_2010The dataset used in Deep Knowledge Tracing, original dataset can be found in github
junyiPart of preprocessed dataset of junyi, which only includes 1000 most active student interaction sequences .

详细见 EduData/ktbd.md at master · bigdata-ustc/EduData (github.com)

数据格式

知识跟踪任务中,有一种流行的格式(我们将其称为三行(tl)格式)来表示交互序列记录:

5

419,419,419,665,665

1,1,1,0,0

可在深度知识跟踪中找到。 以这种格式,三行由一个交互序列组成。 第一行表示交互序列的长度,第二行表示练习ID,后跟第三行,其中每个元素代表正确答案(即1)或错误答案(即0)

以便处理 某些特殊符号难以以上述格式存储的问题,我们提供了另一种名为json序列的格式来表示交互序列记录:

[[419,1],[419,1],[419,  1],[665、0],[665、0]]序列中的每一项代表一个交互。 该项目的第一个元素是练习ID(在某些作品中,练习ID不是一对一映射到一个知识单元(ku)/概念,但是在junyi中,一个练习包含一个ku),第二个元素是练习ID。 指示学习者是否正确回答了练习,0表示错误,1表示正确1行,一条json记录,对应于学习者的交互顺序。
    我们提供了用于转换两种格式的工具:

# convert tl sequence to json sequence, by default, the exercise tag and answer will be converted into int type
edudata tl2json $src $tar
# convert tl sequence to json sequence without converting
edudata tl2json $src $tar False
# convert json sequence to tl sequence
edudata json2tl $src $tar

Dataset Preprocess

https://github.com/ckyeungac/deepknowledgetracing/blob/master/notebooks/ProcessSkillBuilder0910.ipynb

EduData/ASSISTments2015.ipynb at master · bigdata-ustc/EduData (github.com)

ASSISTments2015 Data Analysis

Data Description

Column Description

FieldAnnotation
user idId of the student
log idUnique ID of the logged actions
sequence idId of the problem set
correct

Correct on the fisrt attempt or Incorrect on the first attempt, or asked for help

import numpy as np

import pandas as pd

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

path = "2015_100_skill_builders_main_problems.csv"
data = pd.read_csv(path, encoding = "ISO-8859-15",low_memory=False)

Record Examples

pd.set_option('display.max_columns', 500)
data.head()
user_idlog_idsequence_idcorrect
05012116747803570140.0
15012116747804370141.0
25012116747805370141.0
35012116747806970141.0
45096416747804170141.0

General features

data.describe()
user_idlog_idsequence_idcorrect
count708631.0000007.086310e+05708631.000000708631.000000
mean296232.9782761.695323e+0822683.4748210.725502
std48018.6502473.608096e+0641593.0280180.437467
min50121.0000001.509145e+085898.0000000.000000
25%279113.0000001.660355e+087020.0000000.000000
50%299168.0000001.704579e+089424.0000001.000000
75%335647.0000001.723789e+0814442.0000001.000000
max362374.0000001.754827e+08236309.0000001.000000
print("The number of records: "+ str(len(data['log_id'].unique())))
The number of records: 708631

print('Part of missing values for every column')
print(data.isnull().sum() / len(data))
Part of missing values for every column
user_id        0.0
log_id         0.0
sequence_id    0.0
correct        0.0
dtype: float64

具体实现代码收集;

https://github.com/seewoo5/KT

DKT (Deep Knowledge Tracing)

DatasetACC (%)AUC (%)Hyper Parameters
ASSISTments200977.02 ± 0.0781.81 ± 0.10input_dim=100, hidden_dim=100
ASSISTments201574.94 ± 0.0472.94 ± 0.05input_dim=100, hidden_dim=100
ASSISTmentsChall68.67 ± 0.0972.29 ± 0.06input_dim=100, hidden_dim=100
STATICS81.27 ± 0.0682.87 ± 0.10input_dim=100, hidden_dim=100
Junyi Academy85.480.58input_dim=100, hidden_dim=100
EdNet-KT172.7276.99input_dim=100, hidden_dim=100
  • All models are trained with batch size 2048 and sequence size 200.

DKVMN (Dynamic Key-Value Memory Network)

DatasetACC (%)AUC (%)Hyper Parameters
ASSISTments200975.61 ± 0.2179.56 ± 0.29key_dim = 50, value_dim = 200, summary_dim = 50, concept_num = 20, batch_size = 1024
ASSISTments201574.71 ± 0.0271.57 ± 0.08key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 20, batch_size = 2048
ASSISTmentsChall67.16 ± 0.0567.38 ± 0.07key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 20, batch_size = 2048
STATICS80.66 ± 0.0981.16 ± 0.08key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 50, batch_size = 1024
Junyi Academy85.0479.68key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 50, batch_size = 512
EdNet-KT172.3276.48key_dim = 100, value_dim = 100, summary_dim = 100, concept_num = 100, batch_size = 256
  • Due to memory issues, not all models are trained with batch size 2048.

NPA (Neural Padagogical Agency)

DatasetACC (%)AUC (%)Hyper Parameters
ASSISTments200977.11 ± 0.0881.82 ± 0.13input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
ASSISTments201575.02 ± 0.0572.94 ± 0.08input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
ASSISTmentsChall69.34 ± 0.0373.26 ± 0.03input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
STATICS81.38 ± 0.1483.1 ± 0.25input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
Junyi Academy85.5781.10input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
EdNet-KT173.0577.58input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
  • All models are trained with batch size 2048 and sequence size 200.

SAKT (Self-Attentive Knowledge Tracing)

DatasetACC (%)AUC (%)Hyper Parameters
ASSISTments200976.36 ± 0.1580.78 ± 0.10hidden_dim=100, seq_size=100, batch_size=512
ASSISTments201574.57 ± 0.0771.49 ± 0.03hidden_dim=100, seq_size=50, batch_size=512
ASSISTmentsChall67.53 ± 0.0669.70 ± 0.32hidden_dim=100, seq_size=200, batch_size=512
STATICS80.45 ± 0.1380.30 ± 0.31hidden_dim=100, seq_size=500, batch_size=128
Junyi Academy85.2780.36hidden_dim=100, seq_size=200, batch_size=512
EdNet-KT172.4476.60hidden_dim=200, seq_size=200, batch_size=512

https://github.com/bigdata-ustc/TKT

Knowledge Tracing models implemented by mxnet-gluon. For convenient dataset downloading and preprocessing of knowledge tracing task, visit Edudata for handy api.

Visit https://base.ustc.edu.cn for more of our works.

Performance in well-known Dataset

With EduData, we test the models performance, the AUC result is listed as follows:

model namesyntheticassistment_2009_2010junyi
DKT0.64387489588814870.74425734655419420.8305416859735839
DKT+0.80622213837904890.74834240879190350.8497422607539136
EmbedDKT0.48581687046606360.72855723019775860.8194401881889697
EmbedDKT+0.73409961818761870.74909008763560510.8405445812109871
DKVMNTBATBATBA

The f1 scores are listed as follows:

model namesyntheticassistment_2009_2010junyi
DKT0.58132374745843960.71343805080243690.7732850122818582
DKT+0.70418044633703870.71376277133438190.7928075377114897
EmbedDKT0.47168213111993860.70950251340796560.7681817174082963
EmbedDKT+0.63169536256582910.71017906049902280.7903592922756097
DKVMNTBATBATBA

The information of the benchmark datasets can be found in EduData docs.

In addition, all models are trained 20 epochs with batch_size=16, where the best result is reported. We use adam with learning_rate=1e-3. We also apply bucketing to accelerate the training speed. Moreover, each sample length is limited to 200. The hyper-parameters are listed as follows:

model namesynthetic - 50assistment_2009_2010 - 124junyi-835
DKThidden_num=int(100);dropout=float(0.5)hidden_num=int(200);dropout=float(0.5)hidden_num=int(900);dropout=float(0.5)
DKT+lr=float(0.2);lw1=float(0.001);lw2=float(10.0)lr=float(0.1);lw1=float(0.003);lw2=float(3.0)lr=float(0.01);lw1=float(0.001);lw2=float(1.0)
EmbedDKThidden_num=int(100);latent_dim=int(35);dropout=float(0.5)hidden_num=int(200);latent_dim=int(75);dropout=float(0.5)hidden_num=int(900);latent_dim=int(600);dropout=float(0.5)
EmbedDKT+lr=float(0.2);lw1=float(0.001);lw2=float(10.0)lr=float(0.1);lw1=float(0.003);lw2=float(3.0)lr=float(0.01);lw1=float(0.001);lw2=float(1.0)
DKVMNhidden_num=int(50);key_embedding_dim=int(10);value_embedding_dim=int(10);key_memory_size=int(5);key_memory_state_dim=int(10);value_memory_size=int(5);value_memory_state_dim=int(10);dropout=float(0.5)hidden_num=int(50);key_embedding_dim=int(50);value_embedding_dim=int(200);key_memory_size=int(50);key_memory_state_dim=int(50);value_memory_size=int(50);value_memory_state_dim=int(200);dropout=float(0.5)hidden_num=int(600);key_embedding_dim=int(50);value_embedding_dim=int(200);key_memory_size=int(20);key_memory_state_dim=int(50);value_memory_size=int(20);value_memory_state_dim=int(200);dropout=float(0.5)

The number after - in the first row indicates the knowledge units number in the dataset. The datasets we used can be either found in basedata-ktbd or be downloaded by:

pip install EduData
edudata download ktbd

Trick

  • DKT: hidden_num is usually set to the nearest hundred number to the ku_num
  • EmbedDKT: latent_dim is usually set to a value litter than or equal to \sqrt(hidden_num * ku_num)
  • DKVMN: key_embedding_dim = key_memory_state_dim and value_embedding_dim = value_memory_state_dim

Notice

Some interfaces of pytorch may change with version changing, such as

import torch
torch.nn.functional.one_hot
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值