自然语言处理----实验一：语言模型

在下不狂

已于 2024-03-16 23:14:13 修改

阅读量2.1k

点赞数 33

分类专栏：自然语言处理实验文章标签：自然语言处理语言模型人工智能

于 2024-03-16 22:51:15 首次发布

本文链接：https://blog.csdn.net/m0_62131485/article/details/136771890

版权

自然语言处理实验专栏收录该内容

1 篇文章 1 订阅

订阅专栏

实验目的：

理解并实践语言模型。

实验内容：

1.用python编程实践语言模型（uni-gram和bi-gram）,加入平滑技术。
2.计算test.txt中句子的PPL，对比uuni-gram和bi-gram语言模型效果。

实验过程中遇到和解决的问题：

问题1： 进行数据预处理需要的操作。
解决1： 上网查找资料，进行简单的数据预处理可以将英文全部转换为小写，去除除“’”外的标点符号，根据要求用nltk.tokenize.word_tokenize来进行分词。

问题2： 标点作为文本的一部分，需要去除吗？
解决2： 查资料以及结合思考得，标点所包含的语义信息很少，去除标点还有提高效率减少噪音等好处，所以需要去除标点。

问题3： 加一平滑具体是怎么做的：是直接从训练集上直接生成词表，表中单词的次数加一；还是先将测试集的未登录词先加入词表，然后词表单词次数加一。
解决3： 通过查询资料以及结合思考，使用训练集直接生成词表，表中单词次数加一并且计算出每个单词出现的概率。至于测试集中出现的未登录词，使用一个小概率来代替未登录词的概率。此概率一般是 $\frac{1}{N+V}$ ，其中 $N$ 为训练集所有单词的数量， $V$ 为词表的大小。

问题4： 计算句子概率时连乘导致浮点数下溢。
解决4： 将乘法转换为对数的加法。困惑度计算使用的公式为 $PP=2^{-\frac{1}{N}\sum_{i=1}^{N}log_{2}P(w_{i})}$ ， $P(w_{i})$ 是一个词的概率， $w_{i}$ 是句子的第 $i$ 个词， $N$ 是句子的长度。

问题5： 如果一个文本中有若干句子，怎样评估困惑度。
解决5： 如果一个文本中有若干句子，衡量整个文本的困惑度通常是指计算文本的平均困惑度，即将每个句子的困惑度进行平均。

问题6： bi_gram中加一平滑应该怎么做？
解决6： 通过上网查询得，平滑后的bigram概率为 $P(w_{i}|w_{i-1})=\frac{C(w_{i-1},w_{i})+1}{C(w_{i})+V}$ ， $C ()$ 是这个/对词在训练数据中出现的次数， $V$ 是词汇表中的词的数量。如果 $w_{1}$ 不在训练集中，则使用1/(不同前词的个数)。

实验步骤：

一、uni_gram：

1，数据预处理。去除标点符号（除了“’”，就是英语的上单引号，因为它很有可能是单词的一部分），将句子全部化为小写字母，将句子分割为单词。输入为训练集文本，输出为单词列表。

# 数据预处理，将文本处理成单词列表
def preprocess_text(text):
    sentences = text.split("__eou__")  # 分句
    sentences.pop()
    words = []
    for sentence in sentences:
        sentence = re.sub(r"[^\w\s']", "", sentence).lower()  # 去除标点，改为小写
        words += word_tokenize(sentence)  # 分词
    return words

2，构建词汇表：使用单词列表统计每个唯一单词的出现次数。输出为字典，每个项的格式为{word:count}。

# 构建词汇表，vocab为字典，每个项的格式为{word:count}
def build_vocab(words):
    vocab = Counter(words)
    return vocab

3，计算概率，并使用加一平滑：对于词汇表中的每个单词，计算其在语料库中出现的概率。使用加一平滑，每个单词出现次数加一。所以概率为(单词的出现次数+1)/(语料库中所有单词的总数+词汇表中单词个数)。

# 计算unigram概率（加一平滑）
def calculate_unigram_probs(vocab, total_words):
    unigram_probs = {}
    for word, count in vocab.items():
        unigram_probs[word] = (count + 1) / (total_words + len(vocab))
    return unigram_probs

4，测试文本处理。去除标点符号（除了“’”），将句子全部化为小写字母，将句子分割为单词。输入为训练集文本，输出为二维列表，每个句子的单词存为列表，再将句子存在文本列表中。

# 数据预处理，将文本处理成二维列表，text列表存储sentence列表，sentence存这个句子的单词
def preprocess_text2(text):
    sentences = text.split("__eou__")
    sentences.pop()
    text = []
    for sentence in sentences:
        sentence = re.sub(r"[^\w\s']", "", sentence).lower()
        text.append(word_tokenize(sentence))
    return text

5,计算困惑度。困惑度的公式见解决4，未登录词的公式见解决3。

# 计算句子困惑度
def sentence_perplexity(text, unigram_probs, vocab, total_words):
    perplexity = []
    for sentence in text:
        prob = 0
        for word in sentence:
            if word in unigram_probs:
                prob += log2(unigram_probs[word])
            else:
                prob += log2(1 / (len(vocab) + total_words))  # 未知单词的概率
        perplexity.append(pow(2, -(prob / len(sentence))))
    return perplexity

6，评估文本的困惑度。方法见解决5。

# 评估文本困惑度
def text_perplexity(perplexity):
    return sum(perplexity) / len(perplexity)

二、bi_gram：
1，数据预处理并构建词汇表，将每个句子前面加上“<beg>”，后面加上“</end>”。
2，将训练集处理成二维列表，计算bigram词频。输出为二维字典，一级索引为前词，二级索引为后词。

# 计算bigram词频
def calculate_bigram(text):
    bigram_counts = defaultdict(dict)
    for sentence in text:
        for i in range(len(sentence) - 1):
            if sentence[i + 1] not in bigram_counts[sentence[i]]:
                bigram_counts[sentence[i]][sentence[i + 1]] = 1
            else:
                bigram_counts[sentence[i]][sentence[i + 1]] += 1
    return bigram_counts

3，计算概率，使用加一平滑。每对词的概率计算方法见解决6。

# 计算bigram概率
def calculate_bigram_probs(bigram_counts,vocab):
    bigram_probs = defaultdict(dict)
    for prev_word, list in bigram_counts.items():
        for back_word, count in list.items():
            bigram_probs[prev_word][back_word] = (count + 1) /(
                vocab[prev_word] + len(vocab))
    return bigram_probs

4，处理测试文本。
5，计算句子困惑度。未登录词的处理方式见解决6。

# 计算句子困惑度
def sentence_perplexity(text, bigram_probs, vocab, bigram_counts):
    perplexity = []
    for sentence in text:
        prob = 0
        for i in range(len(sentence) - 1):
            if sentence[i] not in vocab:  # w1是未登录词
                prob += len(vocab)
            elif sentence[i + 1] not in bigram_probs[sentence[i]]: 
                # w1不是未登录词而w2是
                prob += log2(1 / (vocab[sentence[i]] + len(vocab)))
            else:
                # 都不是未登录词
                prob += log2(bigram_probs[sentence[i]][sentence[i + 1]])
        perplexity.append(pow(2, -(prob / (len(sentence) - 1))))
    return perplexity

6，评估文本的困惑度。

实验结果：

1，使用unigram模型时的困惑度。
在这里插入图片描述

2，使用bigram模型时的困惑度。
在这里插入图片描述
3，bigram的效果比较好。

附源代码

uni_gram.py

from nltk.tokenize import word_tokenize
from collections import Counter
from math import log2
import re


# 数据预处理，将文本处理成单词列表
def preprocess_text(text):
    sentences = text.split("__eou__")  # 分句
    sentences.pop()
    words = []
    for sentence in sentences:
        sentence = re.sub(r"[^\w\s']", "", sentence).lower()  # 去除标点，改为小写
        words += word_tokenize(sentence)  # 分词
    return words


# 数据预处理，将文本处理成二维列表，text列表存储sentence列表，sentence存这个句子的单词
def preprocess_text2(text):
    sentences = text.split("__eou__")
    sentences.pop()
    text = []
    for sentence in sentences:
        sentence = re.sub(r"[^\w\s']", "", sentence).lower()
        text.append(word_tokenize(sentence))
    return text


# 构建词汇表，vocab为字典，每个项的格式为{word:count}
def build_vocab(words):
    vocab = Counter(words)
    return vocab


# 计算unigram概率（加一平滑）
def calculate_unigram_probs(vocab, total_words):
    unigram_probs = {}
    for word, count in vocab.items():
        unigram_probs[word] = (count + 1) / (total_words + len(vocab))
    return unigram_probs


# 计算句子困惑度
def sentence_perplexity(text, unigram_probs, vocab, total_words):
    perplexity = []
    for sentence in text:
        prob = 0
        for word in sentence:
            if word in unigram_probs:
                prob += log2(unigram_probs[word])
            else:
                prob += log2(1 / (len(vocab) + total_words))  # 未知单词的概率
        perplexity.append(pow(2, -(prob / len(sentence))))
    return perplexity


# 评估文本困惑度
def text_perplexity(perplexity):
    return sum(perplexity) / len(perplexity)


# 加载数据
with open("train_LM.txt", "r", encoding="utf-8") as file:
    train_text = file.read()
with open("test_LM.txt", "r", encoding="utf-8") as file:
    test_text = file.read()

words = preprocess_text(train_text)  # 单词列表
vocab = build_vocab(words)  # 词汇表
unigram_probs = calculate_unigram_probs(vocab, len(words))  # unigram概率
test_text = preprocess_text2(test_text)  # text二维列表
perplexity = sentence_perplexity(test_text, unigram_probs, vocab, len(words))  # 句子困惑度列表
test_perplexity = text_perplexity(perplexity)  # 文本困惑度

print(test_perplexity)

bi_gram.py

from nltk.tokenize import word_tokenize
from collections import Counter, defaultdict
from math import log2
import re


# 预处理文本
def preprocess_text(text):
    sentences = text.split("__eou__")
    sentences.pop()
    words = []
    for sentence in sentences:
        sentence = re.sub(r"[^\w\s']", "", sentence).lower()
        words += word_tokenize(sentence)
        words.append("<beg>")
    return words


def preprocess_text2(text):
    sentences = text.split("__eou__")
    sentences.pop()
    words = []
    for sentence in sentences:
        sentence = re.sub(r"[^\w\s']", "", sentence).lower()
        words.append(["<beg>"] + word_tokenize(sentence) + ["</end>"])
    return words


# 构建词汇表
def build_vocab(words):
    vocab = Counter(words)
    return vocab


# 计算bigram词频
def calculate_bigram(text):
    bigram_counts = defaultdict(dict)
    for sentence in text:
        for i in range(len(sentence) - 1):
            if sentence[i + 1] not in bigram_counts[sentence[i]]:
                bigram_counts[sentence[i]][sentence[i + 1]] = 1
            else:
                bigram_counts[sentence[i]][sentence[i + 1]] += 1
    return bigram_counts


# 计算bigram概率
def calculate_bigram_probs(bigram_counts,vocab):
    bigram_probs = defaultdict(dict)
    for prev_word, list in bigram_counts.items():
        for back_word, count in list.items():
            bigram_probs[prev_word][back_word] = (count + 1) /(
                vocab[prev_word] + len(vocab))
    return bigram_probs


# 计算句子困惑度
def sentence_perplexity(text, bigram_probs, vocab, bigram_counts):
    perplexity = []
    for sentence in text:
        prob = 0
        for i in range(len(sentence) - 1):
            if sentence[i] not in vocab:  # w1是未登录词
                prob += len(vocab)
            elif sentence[i + 1] not in bigram_probs[sentence[i]]: 
                # w1不是未登录词而w2是
                prob += log2(1 / (vocab[sentence[i]] + len(vocab)))
            else:
                # 都不是未登录词
                prob += log2(bigram_probs[sentence[i]][sentence[i + 1]])
        perplexity.append(pow(2, -(prob / (len(sentence) - 1))))
    return perplexity


def text_perplexity(perplexity):
    return sum(perplexity) / len(perplexity)


# 示例文本
with open("train_LM.txt", "r", encoding="utf-8") as file:
    text = file.read()
with open("test_LM.txt", "r", encoding="utf-8") as file:
    test_text = file.read()

words = preprocess_text(text)
vocab = build_vocab(words)

train_text = preprocess_text2(text)
bigram_counts = calculate_bigram(train_text)
bigram_probs = calculate_bigram_probs(bigram_counts,vocab)
test_text = preprocess_text2(test_text)
perplexity = sentence_perplexity(test_text, bigram_probs, vocab, bigram_counts)
test_perplexity = text_perplexity(perplexity)

print(test_perplexity)