介绍知识追踪的常见数据集和代码,博客等等等,我是勤快的搬运工,好好看
数据集
Knowledge Tracing Benchmark Dataset
There are some datasets which are suitable for this task,
KDD Cup 2010 https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp
ASSISTments ASSISTments (google.com)
OLI Engineering Statics 2011 https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=507
JunyiAcademy Math Practicing Log [Annotation]
slepemapy.cz https://www.fi.muni.cz/adaptivelearning/?a=data
synthetic synthetic (github.com)
math2015
EdNet
pisa2015math
workbankr
critlangacq
The following datasets are prov
The following datasets are provided by EduData ktbd
:
Dataset Name | Description |
---|---|
synthetic | The dataset used in Deep Knowledge Tracing, original dataset can be found in github |
assistment_2009_2010 | The dataset used in Deep Knowledge Tracing, original dataset can be found in github |
junyi | Part of preprocessed dataset of junyi, which only includes 1000 most active student interaction sequences . |
详细见 EduData/ktbd.md at master · bigdata-ustc/EduData (github.com)
数据格式
知识跟踪任务中,有一种流行的格式(我们将其称为三行(tl)格式)来表示交互序列记录:
5
419,419,419,665,665
1,1,1,0,0
可在深度知识跟踪中找到。 以这种格式,三行由一个交互序列组成。 第一行表示交互序列的长度,第二行表示练习ID,后跟第三行,其中每个元素代表正确答案(即1)或错误答案(即0)
以便处理 某些特殊符号难以以上述格式存储的问题,我们提供了另一种名为json序列的格式来表示交互序列记录:
[[419,1],[419,1],[419, 1],[665、0],[665、0]]序列中的每一项代表一个交互。 该项目的第一个元素是练习ID(在某些作品中,练习ID不是一对一映射到一个知识单元(ku)/概念,但是在junyi中,一个练习包含一个ku),第二个元素是练习ID。 指示学习者是否正确回答了练习,0表示错误,1表示正确1行,一条json记录,对应于学习者的交互顺序。
我们提供了用于转换两种格式的工具:
# convert tl sequence to json sequence, by default, the exercise tag and answer will be converted into int type edudata tl2json $src $tar # convert tl sequence to json sequence without converting edudata tl2json $src $tar False # convert json sequence to tl sequence edudata json2tl $src $tar
Dataset Preprocess
EduData/ASSISTments2015.ipynb at master · bigdata-ustc/EduData (github.com)
ASSISTments2015 Data Analysis
Data Description
Column Description
Field | Annotation |
---|---|
user id | Id of the student |
log id | Unique ID of the logged actions |
sequence id | Id of the problem set |
correct | Correct on the fisrt attempt or Incorrect on the first attempt, or asked for help |
import numpy as np
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
path = "2015_100_skill_builders_main_problems.csv"
data = pd.read_csv(path, encoding = "ISO-8859-15",low_memory=False)
Record Examples
pd.set_option('display.max_columns', 500)
data.head()
user_id | log_id | sequence_id | correct | |
---|---|---|---|---|
0 | 50121 | 167478035 | 7014 | 0.0 |
1 | 50121 | 167478043 | 7014 | 1.0 |
2 | 50121 | 167478053 | 7014 | 1.0 |
3 | 50121 | 167478069 | 7014 | 1.0 |
4 | 50964 | 167478041 | 7014 | 1.0 |
General features
data.describe()
user_id | log_id | sequence_id | correct | |
---|---|---|---|---|
count | 708631.000000 | 7.086310e+05 | 708631.000000 | 708631.000000 |
mean | 296232.978276 | 1.695323e+08 | 22683.474821 | 0.725502 |
std | 48018.650247 | 3.608096e+06 | 41593.028018 | 0.437467 |
min | 50121.000000 | 1.509145e+08 | 5898.000000 | 0.000000 |
25% | 279113.000000 | 1.660355e+08 | 7020.000000 | 0.000000 |
50% | 299168.000000 | 1.704579e+08 | 9424.000000 | 1.000000 |
75% | 335647.000000 | 1.723789e+08 | 14442.000000 | 1.000000 |
max | 362374.000000 | 1.754827e+08 | 236309.000000 | 1.000000 |
print("The number of records: "+ str(len(data['log_id'].unique())))
The number of records: 708631
print('Part of missing values for every column')
print(data.isnull().sum() / len(data))
Part of missing values for every column
user_id 0.0
log_id 0.0
sequence_id 0.0
correct 0.0
dtype: float64
具体实现代码收集;
https://github.com/seewoo5/KT
DKT (Deep Knowledge Tracing)
- Paper: https://web.stanford.edu/~cpiech/bio/papers/deepKnowledgeTracing.pdf
- Model: RNN, LSTM (only LSTM is implemented)
- GitHub: https://github.com/chrispiech/DeepKnowledgeTracing (Lua)
- Performances:
Dataset | ACC (%) | AUC (%) | Hyper Parameters |
---|---|---|---|
ASSISTments2009 | 77.02 ± 0.07 | 81.81 ± 0.10 | input_dim=100, hidden_dim=100 |
ASSISTments2015 | 74.94 ± 0.04 | 72.94 ± 0.05 | input_dim=100, hidden_dim=100 |
ASSISTmentsChall | 68.67 ± 0.09 | 72.29 ± 0.06 | input_dim=100, hidden_dim=100 |
STATICS | 81.27 ± 0.06 | 82.87 ± 0.10 | input_dim=100, hidden_dim=100 |
Junyi Academy | 85.4 | 80.58 | input_dim=100, hidden_dim=100 |
EdNet-KT1 | 72.72 | 76.99 | input_dim=100, hidden_dim=100 |
- All models are trained with batch size 2048 and sequence size 200.
DKVMN (Dynamic Key-Value Memory Network)
- Paper: http://papers.www2017.com.au.s3-website-ap-southeast-2.amazonaws.com/proceedings/p765.pdf
- Model: Extension of Memory-Augmented Neural Network (MANN)
- Github: https://github.com/jennyzhang0215/DKVMN (MxNet)
- Performances:
Dataset | ACC (%) | AUC (%) | Hyper Parameters |
---|---|---|---|
ASSISTments2009 | 75.61 ± 0.21 | 79.56 ± 0.29 | key_dim = 50, value_dim = 200, summary_dim = 50, concept_num = 20, batch_size = 1024 |
ASSISTments2015 | 74.71 ± 0.02 | 71.57 ± 0.08 | key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 20, batch_size = 2048 |
ASSISTmentsChall | 67.16 ± 0.05 | 67.38 ± 0.07 | key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 20, batch_size = 2048 |
STATICS | 80.66 ± 0.09 | 81.16 ± 0.08 | key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 50, batch_size = 1024 |
Junyi Academy | 85.04 | 79.68 | key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 50, batch_size = 512 |
EdNet-KT1 | 72.32 | 76.48 | key_dim = 100, value_dim = 100, summary_dim = 100, concept_num = 100, batch_size = 256 |
- Due to memory issues, not all models are trained with batch size 2048.
NPA (Neural Padagogical Agency)
- Paper: https://arxiv.org/abs/1906.10910
- Model: Bi-LSTM + Attention
- Performances:
Dataset | ACC (%) | AUC (%) | Hyper Parameters |
---|---|---|---|
ASSISTments2009 | 77.11 ± 0.08 | 81.82 ± 0.13 | input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200 |
ASSISTments2015 | 75.02 ± 0.05 | 72.94 ± 0.08 | input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200 |
ASSISTmentsChall | 69.34 ± 0.03 | 73.26 ± 0.03 | input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200 |
STATICS | 81.38 ± 0.14 | 83.1 ± 0.25 | input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200 |
Junyi Academy | 85.57 | 81.10 | input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200 |
EdNet-KT1 | 73.05 | 77.58 | input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200 |
- All models are trained with batch size 2048 and sequence size 200.
SAKT (Self-Attentive Knowledge Tracing)
- Paper: https://files.eric.ed.gov/fulltext/ED599186.pdf
- Model: Transformer (1-layer, only encoder with subsequent mask)
- Github: https://github.com/shalini1194/SAKT (Tensorflow)
- Performances:
Dataset | ACC (%) | AUC (%) | Hyper Parameters |
---|---|---|---|
ASSISTments2009 | 76.36 ± 0.15 | 80.78 ± 0.10 | hidden_dim=100, seq_size=100, batch_size=512 |
ASSISTments2015 | 74.57 ± 0.07 | 71.49 ± 0.03 | hidden_dim=100, seq_size=50, batch_size=512 |
ASSISTmentsChall | 67.53 ± 0.06 | 69.70 ± 0.32 | hidden_dim=100, seq_size=200, batch_size=512 |
STATICS | 80.45 ± 0.13 | 80.30 ± 0.31 | hidden_dim=100, seq_size=500, batch_size=128 |
Junyi Academy | 85.27 | 80.36 | hidden_dim=100, seq_size=200, batch_size=512 |
EdNet-KT1 | 72.44 | 76.60 | hidden_dim=200, seq_size=200, batch_size=512 |
https://github.com/bigdata-ustc/TKT
Knowledge Tracing models implemented by mxnet-gluon. For convenient dataset downloading and preprocessing of knowledge tracing task, visit Edudata for handy api.
Visit https://base.ustc.edu.cn for more of our works.
Performance in well-known Dataset
With EduData
, we test the models performance, the AUC result is listed as follows:
model name | synthetic | assistment_2009_2010 | junyi |
---|---|---|---|
DKT | 0.6438748958881487 | 0.7442573465541942 | 0.8305416859735839 |
DKT+ | 0.8062221383790489 | 0.7483424087919035 | 0.8497422607539136 |
EmbedDKT | 0.4858168704660636 | 0.7285572301977586 | 0.8194401881889697 |
EmbedDKT+ | 0.7340996181876187 | 0.7490900876356051 | 0.8405445812109871 |
DKVMN | TBA | TBA | TBA |
The f1 scores are listed as follows:
model name | synthetic | assistment_2009_2010 | junyi |
---|---|---|---|
DKT | 0.5813237474584396 | 0.7134380508024369 | 0.7732850122818582 |
DKT+ | 0.7041804463370387 | 0.7137627713343819 | 0.7928075377114897 |
EmbedDKT | 0.4716821311199386 | 0.7095025134079656 | 0.7681817174082963 |
EmbedDKT+ | 0.6316953625658291 | 0.7101790604990228 | 0.7903592922756097 |
DKVMN | TBA | TBA | TBA |
The information of the benchmark datasets can be found in EduData docs.
In addition, all models are trained 20 epochs with batch_size=16
, where the best result is reported. We use adam
with learning_rate=1e-3
. We also apply bucketing
to accelerate the training speed. Moreover, each sample length is limited to 200. The hyper-parameters are listed as follows:
model name | synthetic - 50 | assistment_2009_2010 - 124 | junyi-835 |
---|---|---|---|
DKT | hidden_num=int(100);dropout=float(0.5) | hidden_num=int(200);dropout=float(0.5) | hidden_num=int(900);dropout=float(0.5) |
DKT+ | lr=float(0.2);lw1=float(0.001);lw2=float(10.0) | lr=float(0.1);lw1=float(0.003);lw2=float(3.0) | lr=float(0.01);lw1=float(0.001);lw2=float(1.0) |
EmbedDKT | hidden_num=int(100);latent_dim=int(35);dropout=float(0.5) | hidden_num=int(200);latent_dim=int(75);dropout=float(0.5) | hidden_num=int(900);latent_dim=int(600);dropout=float(0.5) |
EmbedDKT+ | lr=float(0.2);lw1=float(0.001);lw2=float(10.0) | lr=float(0.1);lw1=float(0.003);lw2=float(3.0) | lr=float(0.01);lw1=float(0.001);lw2=float(1.0) |
DKVMN | hidden_num=int(50);key_embedding_dim=int(10);value_embedding_dim=int(10);key_memory_size=int(5);key_memory_state_dim=int(10);value_memory_size=int(5);value_memory_state_dim=int(10);dropout=float(0.5) | hidden_num=int(50);key_embedding_dim=int(50);value_embedding_dim=int(200);key_memory_size=int(50);key_memory_state_dim=int(50);value_memory_size=int(50);value_memory_state_dim=int(200);dropout=float(0.5) | hidden_num=int(600);key_embedding_dim=int(50);value_embedding_dim=int(200);key_memory_size=int(20);key_memory_state_dim=int(50);value_memory_size=int(20);value_memory_state_dim=int(200);dropout=float(0.5) |
The number after -
in the first row indicates the knowledge units number in the dataset. The datasets we used can be either found in basedata-ktbd or be downloaded by:
pip install EduData
edudata download ktbd
Trick
- DKT:
hidden_num
is usually set to the nearest hundred number to theku_num
- EmbedDKT:
latent_dim
is usually set to a value litter than or equal to\sqrt(hidden_num * ku_num)
- DKVMN:
key_embedding_dim = key_memory_state_dim
andvalue_embedding_dim = value_memory_state_dim
Notice
Some interfaces of pytorch may change with version changing, such as
import torch
torch.nn.functional.one_hot