Knowledge Tracing 资源帖1

最新推荐文章于 2025-03-03 18:10:49 发布

sereasuesue

最新推荐文章于 2025-03-03 18:10:49 发布

阅读量3.2k

点赞数 6

分类专栏：知识追踪文章标签：知识追踪数据 KDD assistment

本文链接：https://blog.csdn.net/sereasuesue/article/details/110549345

版权

知识追踪专栏收录该内容

33 篇文章

订阅专栏

介绍知识追踪的常见数据集和代码，博客等等等，我是勤快的搬运工，好好看

数据集

Knowledge Tracing Benchmark Dataset

There are some datasets which are suitable for this task,

KDD Cup 2010 https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp

ASSISTments ASSISTments (google.com)

OLI Engineering Statics 2011 https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=507

JunyiAcademy Math Practicing Log [Annotation]

slepemapy.cz https://www.fi.muni.cz/adaptivelearning/?a=data

synthetic synthetic (github.com)

math2015

EdNet

pisa2015math

workbankr

critlangacq

The following datasets are prov

The following datasets are provided by EduData ktbd:

Dataset Name	Description
synthetic	The dataset used in Deep Knowledge Tracing, original dataset can be found in github
assistment_2009_2010	The dataset used in Deep Knowledge Tracing, original dataset can be found in github
junyi	Part of preprocessed dataset of junyi, which only includes 1000 most active student interaction sequences .

详细见 EduData/ktbd.md at master · bigdata-ustc/EduData (github.com)

数据格式

知识跟踪任务中，有一种流行的格式（我们将其称为三行（tl）格式）来表示交互序列记录：

5

419,419,419,665,665

1,1,1,0,0

可在深度知识跟踪中找到。以这种格式，三行由一个交互序列组成。第一行表示交互序列的长度，第二行表示练习ID，后跟第三行，其中每个元素代表正确答案（即1）或错误答案（即0）

以便处理某些特殊符号难以以上述格式存储的问题，我们提供了另一种名为json序列的格式来表示交互序列记录：

[[419，1]，[419，1]，[419， 1]，[665、0]，[665、0]]序列中的每一项代表一个交互。该项目的第一个元素是练习ID（在某些作品中，练习ID不是一对一映射到一个知识单元（ku）/概念，但是在junyi中，一个练习包含一个ku），第二个元素是练习ID。指示学习者是否正确回答了练习，0表示错误，1表示正确1行，一条json记录，对应于学习者的交互顺序。
我们提供了用于转换两种格式的工具：

# convert tl sequence to json sequence, by default, the exercise tag and answer will be converted into int type
edudata tl2json $src $tar
# convert tl sequence to json sequence without converting
edudata tl2json $src $tar False
# convert json sequence to tl sequence
edudata json2tl $src $tar

Dataset Preprocess

https://github.com/ckyeungac/deepknowledgetracing/blob/master/notebooks/ProcessSkillBuilder0910.ipynb

EduData/ASSISTments2015.ipynb at master · bigdata-ustc/EduData (github.com)

ASSISTments2015 Data Analysis

Data Description

Column Description

Field	Annotation
user id	Id of the student
log id	Unique ID of the logged actions
sequence id	Id of the problem set
correct	Correct on the fisrt attempt or Incorrect on the first attempt, or asked for help

import numpy as np

import pandas as pd

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

path = "2015_100_skill_builders_main_problems.csv"
data = pd.read_csv(path, encoding = "ISO-8859-15",low_memory=False)

Record Examples

pd.set_option('display.max_columns', 500)
data.head()

	user_id	log_id	sequence_id	correct
0	50121	167478035	7014	0.0
1	50121	167478043	7014	1.0
2	50121	167478053	7014	1.0
3	50121	167478069	7014	1.0
4	50964	167478041	7014	1.0

General features

data.describe()

	user_id	log_id	sequence_id	correct
count	708631.000000	7.086310e+05	708631.000000	708631.000000
mean	296232.978276	1.695323e+08	22683.474821	0.725502
std	48018.650247	3.608096e+06	41593.028018	0.437467
min	50121.000000	1.509145e+08	5898.000000	0.000000
25%	279113.000000	1.660355e+08	7020.000000	0.000000
50%	299168.000000	1.704579e+08	9424.000000	1.000000
75%	335647.000000	1.723789e+08	14442.000000	1.000000
max	362374.000000	1.754827e+08	236309.000000	1.000000

print("The number of records: "+ str(len(data['log_id'].unique())))

The number of records: 708631

print('Part of missing values for every column')
print(data.isnull().sum() / len(data))

Part of missing values for every column
user_id        0.0
log_id         0.0
sequence_id    0.0
correct        0.0
dtype: float64

具体实现代码收集;

https://github.com/seewoo5/KT

DKT (Deep Knowledge Tracing)

Paper: https://web.stanford.edu/~cpiech/bio/papers/deepKnowledgeTracing.pdf
Model: RNN, LSTM (only LSTM is implemented)
GitHub: https://github.com/chrispiech/DeepKnowledgeTracing (Lua)
Performances:

Dataset	ACC (%)	AUC (%)	Hyper Parameters
ASSISTments2009	77.02 ± 0.07	81.81 ± 0.10	input_dim=100, hidden_dim=100
ASSISTments2015	74.94 ± 0.04	72.94 ± 0.05	input_dim=100, hidden_dim=100
ASSISTmentsChall	68.67 ± 0.09	72.29 ± 0.06	input_dim=100, hidden_dim=100
STATICS	81.27 ± 0.06	82.87 ± 0.10	input_dim=100, hidden_dim=100
Junyi Academy	85.4	80.58	input_dim=100, hidden_dim=100
EdNet-KT1	72.72	76.99	input_dim=100, hidden_dim=100

All models are trained with batch size 2048 and sequence size 200.

DKVMN (Dynamic Key-Value Memory Network)

Paper: http://papers.www2017.com.au.s3-website-ap-southeast-2.amazonaws.com/proceedings/p765.pdf
Model: Extension of Memory-Augmented Neural Network (MANN)
Github: https://github.com/jennyzhang0215/DKVMN (MxNet)
Performances:

Dataset	ACC (%)	AUC (%)	Hyper Parameters
ASSISTments2009	75.61 ± 0.21	79.56 ± 0.29	key_dim = 50, value_dim = 200, summary_dim = 50, concept_num = 20, batch_size = 1024
ASSISTments2015	74.71 ± 0.02	71.57 ± 0.08	key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 20, batch_size = 2048
ASSISTmentsChall	67.16 ± 0.05	67.38 ± 0.07	key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 20, batch_size = 2048
STATICS	80.66 ± 0.09	81.16 ± 0.08	key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 50, batch_size = 1024
Junyi Academy	85.04	79.68	key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 50, batch_size = 512
EdNet-KT1	72.32	76.48	key_dim = 100, value_dim = 100, summary_dim = 100, concept_num = 100, batch_size = 256

Due to memory issues, not all models are trained with batch size 2048.

NPA (Neural Padagogical Agency)

Paper: https://arxiv.org/abs/1906.10910
Model: Bi-LSTM + Attention
Performances:

Dataset	ACC (%)	AUC (%)	Hyper Parameters
ASSISTments2009	77.11 ± 0.08	81.82 ± 0.13	input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
ASSISTments2015	75.02 ± 0.05	72.94 ± 0.08	input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
ASSISTmentsChall	69.34 ± 0.03	73.26 ± 0.03	input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
STATICS	81.38 ± 0.14	83.1 ± 0.25	input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
Junyi Academy	85.57	81.10	input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
EdNet-KT1	73.05	77.58	input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200

All models are trained with batch size 2048 and sequence size 200.

SAKT (Self-Attentive Knowledge Tracing)

Paper: https://files.eric.ed.gov/fulltext/ED599186.pdf
Model: Transformer (1-layer, only encoder with subsequent mask)
Github: https://github.com/shalini1194/SAKT (Tensorflow)
Performances:

Dataset	ACC (%)	AUC (%)	Hyper Parameters
ASSISTments2009	76.36 ± 0.15	80.78 ± 0.10	hidden_dim=100, seq_size=100, batch_size=512
ASSISTments2015	74.57 ± 0.07	71.49 ± 0.03	hidden_dim=100, seq_size=50, batch_size=512
ASSISTmentsChall	67.53 ± 0.06	69.70 ± 0.32	hidden_dim=100, seq_size=200, batch_size=512
STATICS	80.45 ± 0.13	80.30 ± 0.31	hidden_dim=100, seq_size=500, batch_size=128
Junyi Academy	85.27	80.36	hidden_dim=100, seq_size=200, batch_size=512
EdNet-KT1	72.44	76.60	hidden_dim=200, seq_size=200, batch_size=512

https://github.com/bigdata-ustc/TKT

Knowledge Tracing models implemented by mxnet-gluon. For convenient dataset downloading and preprocessing of knowledge tracing task, visit Edudata for handy api.

Visit https://base.ustc.edu.cn for more of our works.

Performance in well-known Dataset

With EduData, we test the models performance, the AUC result is listed as follows:

model name	synthetic	assistment_2009_2010	junyi
DKT	0.6438748958881487	0.7442573465541942	0.8305416859735839
DKT+	0.8062221383790489	0.7483424087919035	0.8497422607539136
EmbedDKT	0.4858168704660636	0.7285572301977586	0.8194401881889697
EmbedDKT+	0.7340996181876187	0.7490900876356051	0.8405445812109871
DKVMN	TBA	TBA	TBA

The f1 scores are listed as follows:

model name	synthetic	assistment_2009_2010	junyi
DKT	0.5813237474584396	0.7134380508024369	0.7732850122818582
DKT+	0.7041804463370387	0.7137627713343819	0.7928075377114897
EmbedDKT	0.4716821311199386	0.7095025134079656	0.7681817174082963
EmbedDKT+	0.6316953625658291	0.7101790604990228	0.7903592922756097
DKVMN	TBA	TBA	TBA

The information of the benchmark datasets can be found in EduData docs.

In addition, all models are trained 20 epochs with batch_size=16, where the best result is reported. We use adam with learning_rate=1e-3. We also apply bucketing to accelerate the training speed. Moreover, each sample length is limited to 200. The hyper-parameters are listed as follows:

model name	synthetic - 50	assistment_2009_2010 - 124	junyi-835
DKT	`hidden_num=int(100);dropout=float(0.5)`	`hidden_num=int(200);dropout=float(0.5)`	`hidden_num=int(900);dropout=float(0.5)`
DKT+	`lr=float(0.2);lw1=float(0.001);lw2=float(10.0)`	`lr=float(0.1);lw1=float(0.003);lw2=float(3.0)`	`lr=float(0.01);lw1=float(0.001);lw2=float(1.0)`
EmbedDKT	`hidden_num=int(100);latent_dim=int(35);dropout=float(0.5)`	`hidden_num=int(200);latent_dim=int(75);dropout=float(0.5)`	`hidden_num=int(900);latent_dim=int(600);dropout=float(0.5)`
EmbedDKT+	`lr=float(0.2);lw1=float(0.001);lw2=float(10.0)`	`lr=float(0.1);lw1=float(0.003);lw2=float(3.0)`	`lr=float(0.01);lw1=float(0.001);lw2=float(1.0)`
DKVMN	`hidden_num=int(50);key_embedding_dim=int(10);value_embedding_dim=int(10);key_memory_size=int(5);key_memory_state_dim=int(10);value_memory_size=int(5);value_memory_state_dim=int(10);dropout=float(0.5)`	`hidden_num=int(50);key_embedding_dim=int(50);value_embedding_dim=int(200);key_memory_size=int(50);key_memory_state_dim=int(50);value_memory_size=int(50);value_memory_state_dim=int(200);dropout=float(0.5)`	`hidden_num=int(600);key_embedding_dim=int(50);value_embedding_dim=int(200);key_memory_size=int(20);key_memory_state_dim=int(50);value_memory_size=int(20);value_memory_state_dim=int(200);dropout=float(0.5)`

The number after - in the first row indicates the knowledge units number in the dataset. The datasets we used can be either found in basedata-ktbd or be downloaded by:

pip install EduData
edudata download ktbd

Trick

DKT: hidden_num is usually set to the nearest hundred number to the ku_num
EmbedDKT: latent_dim is usually set to a value litter than or equal to \sqrt(hidden_num * ku_num)
DKVMN: key_embedding_dim = key_memory_state_dim and value_embedding_dim = value_memory_state_dim

Notice

Some interfaces of pytorch may change with version changing, such as

import torch
torch.nn.functional.one_hot