深度学习与自然语言处理第三次作业

最新推荐文章于 2022-06-02 11:32:28 发布

Shezzaaaa

最新推荐文章于 2022-06-02 11:32:28 发布

阅读量274

点赞数

本文链接：https://blog.csdn.net/qq_41537299/article/details/116203394

版权

本文介绍了在给定数据库上通过sklearn的LDA模型进行无监督学习，以识别小说篇章主题。通过随机抽取K本小说的M个段落作为训练数据，N个段落作为测试样本，模型成功地将段落归类到相应的书籍。实验中，我们调整了主题数K，验证了不同粒度主题划分的效果。

摘要由CSDN通过智能技术生成

题目：在给定的数据库上利用Topic Model做无监督学习，学习到主题的分布。可以在数据库中随机选定K本小说，在每本小说中随机抽出M个段落作为训练数据，并抽出N个段落作为测试，利用topic model和其他的分类器对给定的段落属于哪一本小说进行分类。其中K至少为3.

实验原理

LDA模型
在LDA模型中，一篇文档生成的方式如下：
从狄利克雷分布中取样生成文档 i 的主题分布
从主题的多项式分布中取样生成文档i第 j 个词的主题
从狄利克雷分布中取样生成主题对应的词语分布
从词语的多项式分布中采样最终生成词语
基于sklearn的LDA主题模型
本实验采用python中的sklearn机器学习库实现自然语言处理中的LDA主题建模。
本实验所用到的scikit-learn LDA主题模型主要参数有：
（1）n_topics: 隐含主题数K,需要调参。K的大小取决于我们对主题划分的需求，比如我们只需要类似区分是动物，植物，还是非生物这样的粗粒度需求，那么K值可以取的很小，个位数即可。如果我们的目标是类似区分不同的动物以及不同的植物，不同的非生物这样的细粒度需求，则K值需要取的很大，比如上千上万。
（2）max_iter ：EM算法的最大迭代次数。
（3）random_state: 随机数种子。填0或不填时得到的随机数每次否不一样。

实验过程

语料库预处理：
读取文件，随机选取文件中文章作为测试集，其余文章作为训练集。分别删除文章中换行符号以及非中文单词，整理成dic类。

def read_data():
    path = "./data"
    catalog = "E:\文档\第二学期学校文件\coursework\coursework1\jyxstxtqj_downcc.com\inf.txt"
    with open(catalog, "r", encoding = 'utf-8') as f:
        all_files = f.readline().split(",")
        print(all_files)

    train_files_dict = dict()
    test_files_dict = dict()

    test_num = 10
    test_length = 20

    for name in all_files:
        with open(os.path.join(path, name + ".txt"), "r", encoding = 'ANSI') as f:
            file_read = f.readlines()
            train_num = len(file_read) - test_num

            choice_index = np.random.choice(len(file_read), test_num + train_num, replace=False)

            train_text = ""
            for train in choice_index[0:train_num]:
                line = file_read[train]
                line = re.sub('\s', '', line)
                line = re.sub('[\u0000-\u4DFF]', '', line)
                line = re.sub('[\u9FA6-\uFFFF]', '', line)
                if len(line) == 0:
                    continue
                train_text += line

            train_files_dict[name] = train_text

            for test in choice_index[train_num:test_num + train_num]:
                if test + test_length >= len(file_read):
                    continue
                test_line = ""
                for i in range(test, test + test_length):
                    line = file_read[i]
                    line = re.sub('\s', '', line)
                    line = re.sub('[\u0000-\u4DFF]', '', line)
                    line = re.sub('[\u9FA6-\uFFFF]', '', line)
                    test_line += line
                if not name in test_files_dict.keys():
                    test_files_dict[name] = [test_line]
                else:
                    test_files_dict[name].append(test_line)
    return train_files_dict, test_files_dict

利用jieba对语料进行分词：

train_texts_dict, test_texts_dict = read_data()
train_terms_list = []
train_terms_dict = dict()
name_list = []

for name in train_texts_dict.keys():
    text = train_texts_dict[name]
    seg_list = list(jieba.cut(text, cut_all=False)) # 使用精确模式
    terms_string = ""
    for term in seg_list:
        terms_string += term+" "
    train_terms_dict[name] = terms_string
    train_terms_list.append(terms_string)
    name_list.append(name)
    print("finished to calculate the train " + name + " text")

test_terms_dict = dict()
for name in test_texts_dict.keys():
        text_list = test_texts_dict[name]
        for text in text_list:
            seg_list = list(jieba.cut(text, cut_all=False)) # 使用精确模式
            terms_string = ""
            for term in seg_list:
                terms_string += term+" "
            if not name in test_terms_dict.keys():
                test_terms_dict[name] = [terms_string]
            else:
                test_terms_dict[name].append(terms_string)
        print("finished to calculate the test "+name+" text")

由于LDA是基于词频统计的，需要把词转化为词频向量：

cnt_vector = CountVectorizer(max_features=500)
cnt_tf_train = cnt_vector.fit_transform(train_terms_list)

输出即为所有文档中各个词的词频向量。主题数设为20.迭代次数设为3000。代码如下：

lda = LatentDirichletAllocation(n_components=35,  # 主题个数
                                max_iter=3000,    # EM算法的最大迭代次数
                                random_state=0)
target = lda.fit_transform(cnt_tf_train)
print("target: ", target)

实验结果

target: [[9.32286039e-01 9.14411139e-06 9.14411139e-06 9.14411119e-06
9.14411119e-06 9.14411119e-06 9.14411139e-06 9.14411119e-06
9.14411140e-06 9.14411139e-06 9.14411119e-06 6.75493669e-02
9.14411140e-06 9.14411140e-06 9.14411119e-06 9.14411119e-06
9.14411138e-06 9.14411119e-06 9.14411139e-06 9.14411140e-06] 由此可见，第1个文档较大概率属于第1个主题
[8.64594129e-04 1.21232697e-06 1.21232697e-06 1.21232694e-06
1.21232694e-06 1.21232694e-06 1.21232697e-06 1.21232694e-06
9.98666889e-01 1.21232697e-06 1.21232694e-06 4.47906876e-04
1.21232697e-06 1.21232697e-06 1.21232694e-06 1.21232694e-06
1.21232697e-06 1.21232694e-06 1.21232697e-06 1.21232697e-06] 第2个文档较大概率属于第9个主题
[2.72406746e-03 1.25269332e-06 1.25269332e-06 1.25269329e-06
1.25269329e-06 1.25269329e-06 1.25269332e-06 1.25269329e-06
1.25269332e-06 1.25269332e-06 1.25269329e-06 7.77215134e-04
4.17995737e-03 1.25269332e-06 1.25269329e-06 1.25269329e-06
1.25269332e-06 1.25269329e-06 1.25269332e-06 9.92298717e-01] 第3个文档较大概率属于第20个主题
[9.99954676e-01 2.38549624e-06 2.38549623e-06 2.38549618e-06
2.38549618e-06 2.38549618e-06 2.38549624e-06 2.38549618e-06
2.38549624e-06 2.38549623e-06 2.38549618e-06 2.38549624e-06
2.38549624e-06 2.38549667e-06 2.38549618e-06 2.38549618e-06
2.38549624e-06 2.38549618e-06 2.38549624e-06 2.38549624e-06] 第4个文档较大概率属于第1个主题
[1.10450900e-03 4.31618666e-07 9.98623120e-01 4.31618656e-07
4.31618656e-07 4.31618656e-07 4.31618666e-07 4.31618656e-07
1.70096485e-05 4.31618666e-07 4.31618656e-07 2.48455533e-04
4.31618666e-07 4.31618667e-07 4.31618656e-07 4.31618656e-07
4.31618667e-07 4.31618656e-07 4.31618666e-07 4.31618666e-07] 第5个文档较大概率属于第3个主题
[8.36934486e-02 2.14961312e-05 4.37053996e-02 2.14961307e-05
2.14961307e-05 2.14961307e-05 3.66118436e-03 2.14961307e-05
8.20073883e-01 2.14961312e-05 2.14961307e-05 2.07435604e-02
2.14961312e-05 1.39571204e-02 2.14961307e-05 2.14961307e-05
2.14961313e-05 2.14961307e-05 2.14961312e-05 1.38859538e-02] 第6个文档较大概率属于第9个主题
[1.61972699e-03 9.97945371e-01 6.43566920e-07 6.43566905e-07
6.43566905e-07 6.43566905e-07 6.43566920e-07 6.43566905e-07
6.43566920e-07 6.43566921e-07 6.43566905e-07 4.23960970e-04
6.43566920e-07 6.43566920e-07 6.43566905e-07 6.43566905e-07
6.43566920e-07 6.43566905e-07 6.43566920e-07 6.43566920e-07] 第7个文档较大概率属于第2个主题
[1.03358122e-03 1.65970796e-01 5.74659814e-07 5.74659801e-07
5.74659801e-07 5.74659801e-07 5.74659815e-07 5.74659801e-07
5.74659815e-07 8.31812207e-01 5.74659801e-07 1.17422070e-03
5.74659814e-07 5.74659815e-07 5.74659801e-07 5.74659801e-07
5.74659815e-07 5.74659801e-07 5.74659814e-07 5.74659815e-07] 第8个文档较大概率属于第10个主题
[9.63704013e-04 1.18396441e-06 1.18396441e-06 1.18396439e-06
1.18396439e-06 1.18396439e-06 1.18396441e-06 1.18396439e-06
1.18396441e-06 1.18396441e-06 1.18396439e-06 6.05326798e-04
9.98410842e-01 1.18396442e-06 1.18396439e-06 1.18396439e-06
1.18396441e-06 1.18396439e-06 1.18396441e-06 1.18396441e-06] 第9个文档较大概率属于第13个主题
[3.09426530e-03 4.61450441e-07 4.61450441e-07 4.61450431e-07
4.61450431e-07 4.61450431e-07 9.96063366e-01 4.61450431e-07
1.49302181e-04 4.61450441e-07 4.61450431e-07 4.79969008e-04
4.61450441e-07 1.53266389e-04 4.61450431e-07 4.61450431e-07
4.61450441e-07 4.61450431e-07 4.61450441e-07 5.33705792e-05] 第10个文档较大概率属于第7个主题
[3.28970058e-03 1.41282851e-06 1.41282851e-06 1.41282848e-06
1.41282848e-06 1.41282848e-06 1.41282852e-06 1.41282848e-06
1.41282851e-06 1.41282851e-06 1.41282848e-06 1.37075272e-04
1.41282851e-06 9.96549206e-01 1.41282848e-06 1.41282848e-06
1.41282851e-06 1.41282848e-06 1.41282851e-06 1.41282852e-06] 第11个文档较大概率属于第14个主题
[1.73857597e-03 5.03109227e-07 5.03109226e-07 5.03109215e-07
5.03109215e-07 5.03109215e-07 5.03109227e-07 5.03109215e-07
5.03109226e-07 5.03109226e-07 5.03109215e-07 5.27873523e-05
5.03109226e-07 3.06365829e-05 5.03109215e-07 5.03109215e-07
9.98169950e-01 5.03109215e-07 5.03109226e-07 5.03109227e-07] 第12个文档较大概率属于第17个主题
[3.05150115e-01 4.45751994e-06 6.66448821e-04 4.45751984e-06
4.45751984e-06 4.45751984e-06 8.50225638e-04 4.45751984e-06
4.45751994e-06 4.45751994e-06 4.45751984e-06 2.20764374e-01
4.45751994e-06 4.63942036e-04 4.45751984e-06 4.45751984e-06
4.45751994e-06 4.45751984e-06 4.45751994e-06 4.72042489e-01] 第13个文档较大概率属于第20个主题
[1.67677645e-03 1.38370835e-04 5.74653210e-07 5.74653197e-07
5.74653197e-07 5.74653197e-07 5.74653210e-07 5.74653197e-07
5.74653210e-07 2.70924194e-04 5.74653197e-07 5.75194891e-04
5.74653210e-07 3.14435026e-05 5.74653197e-07 5.74653197e-07
5.74653210e-07 5.74653197e-07 9.97263987e-01 3.58328477e-05] 第14个文档较大概率属于第19个主题
[3.86665265e-01 1.81752094e-05 1.81752094e-05 1.81752090e-05
1.81752090e-05 1.81752090e-05 1.81752094e-05 1.81752090e-05
1.81752094e-05 1.81752094e-05 1.81752090e-05 4.50815078e-01
1.81752094e-05 1.25743269e-03 1.81752090e-05 1.81752090e-05
1.81752094e-05 1.81752090e-05 1.81752094e-05 1.60971421e-01] 第15个文档较大概率属于第12个主题
[6.73588881e-01 4.18060210e-05 4.18060210e-05 4.18060201e-05
4.18060201e-05 4.18060201e-05 4.18060211e-05 4.18060201e-05
6.76059584e-02 4.18060211e-05 4.18060201e-05 4.18060212e-05
3.48345868e-02 4.18060211e-05 4.18060201e-05 4.18060201e-05
4.18060211e-05 4.18060201e-05 4.18060211e-05 2.23301678e-01]] 第16个文档较大概率属于第1个主题