代码复现——Exercise-Aware Knowledge Tracing for Student Performance Prediction——1、数据预处理

最新推荐文章于 2022-11-27 16:52:38 发布

置顶胡歌爱亦菲

最新推荐文章于 2022-11-27 16:52:38 发布

阅读量1.1k

点赞数 2

分类专栏：知识追踪文章标签：深度学习自然语言处理 python

本文链接：https://blog.csdn.net/A_ACM/article/details/120155754

版权

知识追踪专栏收录该内容

2 篇文章 2 订阅

订阅专栏

由于科研要求，本小白需要这篇论文的代码，但是由于大量搜索未果，于是本小白打算复现一下。

本文主要讲解模型细节，详细代码见GitHub

通过阅读论文和查阅讲解，对论文有所了解。

首先是练习嵌入，分为以下几步

1. word2vec(这篇讲解只讲原理不讲推导，清晰明了，这位作者还写了一篇关于word2vec的实战训练，并且提供了数据集)，word2vec将练习ei中每个单词w转化为预训练的单词向量。
在这里插入图片描述

2、文本生成(文本生成主要参考了这篇文章)，为什么说第二步是文本生成呢？因为论文中并没有明确说明双向LSTM的Label的什么。所以本人猜测单词w1对应的label就是单词w2，w2对应的label是w3，以此类推。如果是这样，那么就和文本生成非常像了。
在这里插入图片描述

3、如何提取中间层(隐藏层)和如何使用max-pooling，把二维矩阵转化为向量。我们需要提取出中间层的信息，并进行max-pooling。

在这里插入图片描述

1、word2vec部分代码：

### 2.2. Word2vec 训练

# 用生成器的方式读取文件里的句子
# 适合读取大容量文件，而不用加载到内存
class MySentences(object):
    def __init__(self, fname):
        self.fname = fname

    def __iter__(self):
        for line in open(self.fname, 'r'):
            yield line.split()


# 模型训练函数并获取文字向量
def w2vTrain(Config):
    sentences = MySentences(Config.poetry_file)
    w2v_model = word2vec.Word2Vec(sentences,
                                  min_count=Config.MIN_COUNT,
                                  workers=Config.CPU_NUM,
                                  vector_size=Config.VEC_SIZE,
                                  window=Config.CONTEXT_WINDOW
                                  )
    w2v_model.save(Config.ModelDir + Config.model_output)
    # print(w2v_model.wv.index_to_key)
    word_vector_dict = {}
    for word in w2v_model.wv.index_to_key:
        word_vector_dict[word] = list(w2v_model.wv[word])
        # print(word_vector_dict[word])
    vector_file = "./ipynb_garbage_files/word_vector.txt"
    with open(vector_file, 'w', encoding='utf-8')as f:
        f.write(str(word_vector_dict))
        
class Config(object):
    '''
    模型参数配置。预先定义模型参数和加载语料以及模型保存名称
    '''
    poetry_file = "./bioCorpus_5000.txt"
    model_output = "test_w2v_model"
    ModelDir = "./ipynb_garbage_files/"
    file_txt = "./ipynb_garbage_files/file_data.txt"

    MIN_COUNT = 4
    CPU_NUM = 2  # 需要预先安装 Cython 以支持并行
    VEC_SIZE = 20
    CONTEXT_WINDOW = 5  # 提取目标词上下文距离最长5个词

输出结果(每个单词对应一个20维向量)

在这里插入图片描述

2、文本生成代码：

def preprocess_data():
    # 把文本数据筛选一边
    w2v_model = word2vec.Word2Vec.load(Config.ModelDir + Config.model_output)

    # 测试循环几次
    test_num=0
    with open(Config.file_txt, 'w') as f:
        for sent in corpus:
            if test_num!=5001:
                # word2vec后的新数据
                text_stream = []#过滤后的数据
                text_len = 0
                for word in sent:
                    if word in w2v_model.wv.index_to_key:#是否在word2vec生成的列表里
                        text_stream.append(word)
                        text_len += 1
                if text_len < 4: continue
                # 构造数据集

                # 训练数据
                x = []
                y = []
                for i in range(0, len(text_stream) - sw_steps):
                    given = text_stream[i:i+sw_steps]#步长为1，每一个单词对应下一个单词
                    predict = text_stream[i + sw_steps]
                    x.append(w2v_model.wv[given].tolist())
                    y.append(w2v_model.wv[predict].tolist())
                x = np.array(x)
                y = np.array(y)
                # print("!!!!!!!!!!!!")
                # print(y)

                # 生成模型
                model = Sequential()
                # model.add(Embedding(3800,32,input_length=380))
                # model.add(Dropout(0.5))
                model.add(Bidirectional(LSTM(40, input_shape=(x.shape[1], x.shape[2]), return_sequences=True),
                                        merge_mode='concat'))#双向lstm层
                model.add(Dropout(0.5))#Dropout层
                model.add(Flatten())#Flatten()层
                model.add(Dense(Config.VEC_SIZE, activation='sigmoid'))#Dense层

                # 训练模型
                es = EarlyStopping(monitor='val_acc', patience=5)
                model.compile(loss="mse", optimizer="adam", metrics=['accuracy'])

                batch_size = 64
                epochs = 20

                model.fit(x, y,
                          validation_split=0.1,
                          batch_size=batch_size,
                          epochs=epochs,
                          callbacks=[es],
                          shuffle=True)#训练模型
               	#下面接提取中间层和max-pooling的代码

3、提取中间层和max-pooling的代码：

                # 获得模型的隐藏层状态，进行最大化池，最终结果作为训练题目
                layer_model = Model(inputs=model.input, outputs=model.layers[0].output)#输出中间层
                feature = layer_model.predict(x)

                # maxpooling隐藏层，然后结合答题结果输出
                hide_i = feature.shape[0]
                hide_j = feature.shape[1]
                hide_k = feature.shape[2]

                for k in range(0, hide_k):
                    feature_maxn = feature[0][0][k]
                    for j in range(0, hide_j):
                        for i in range(0, hide_i):
                            feature_maxn = max(feature_maxn, feature[i][j][k])
                    #f.write(str(feature_maxn) + ' ')
                #f.write('\n')
            test_num+=1
    f.close()

为什么0层就是中间层（隐藏层）呢，本模型是四层结构如图：
在这里插入图片描述
如果你想输出Flatten层的结果，你可以把前三个看成一个新模型，然后输出。因为我只想要LSTM层的输出，所以我们前1个层看成模型并输出。

4、最终运算结果

在这里插入图片描述
每个题目都由原来的汉字变成了一个80维的向量（因为数据是随便找的一个文本，所以并没有学生答题结果的数据，那就随机生成吧）

下一篇文章讲学生嵌入

胡歌爱亦菲

关注

2
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
代码复现——Exercise-Aware Knowledge Tracing for Student Performance Prediction——1、数据预处理

由于科研要求，本小白需要这篇论文的代码，但是由于大量搜索未果，于是本小白打算复现一下。本文主要讲解模型细节，详细代码见GitHub通过阅读论文和查阅讲解，对论文有所了解。首先是练习嵌入，分为以下几步1. word2vec(这篇讲解只讲原理不讲推导，清晰明了，这位作者还写了一篇关于word2vec的实战训练，并且提供了数据集)，word2vec将练习ei中每个单词w转化为预训练的单词向量。2、文本生成(文本生成主要参考了这篇文章)，为什么说第二步是文本生成呢？因为论文中并没有明确说明双向LSTM
复制链接

扫一扫

专栏目录