NLP实验二：二元文法模型

最新推荐文章于 2023-02-25 18:23:20 发布

-meteor-

最新推荐文章于 2023-02-25 18:23:20 发布

阅读量1.1k

点赞数

分类专栏：自然语言处理课程实验文章标签：自然语言处理人工智能 c++ nlp

本文链接：https://blog.csdn.net/weixin_51502608/article/details/125562953

版权

自然语言处理课程实验专栏收录该内容

4 篇文章 4 订阅

订阅专栏

实验目的

熟练掌握语言模型的基本概念，深入理解n 元文法(n-gram)模型。
熟练掌握参数估计的方法，实现在语料库中对句子中的词进行词频统计，输出句子的出现概率。
附加：使用代码实现数据平滑。

实验内容

用免费的中文分词语料库，如人民日报语料库PKU，使用语料库中的常见词编写一个句子，使用二元语法（即每个词只与和它相邻的前一个词有关）在语料库中对句子中的词进行词频统计，输出句子的出现概率。
举例:
假设语料为:
\langleBOS\ket 商品和服务 \langleEOS\ket
\langleBO\mathrm{\ }S\ket 商品和服物美价廉 \langleEOS\ket
\langleBOS\ket 服务和货币 \langleEOS\ket
词频统计:
⬚P( 商品 ∣BOS)=2/3⬚P( 和 ∣ 商品 )=1/2⬚P( 服务 ∣ 和 )=1/2⬚P(EOS∣ 服务 )=1/2⬚P(⟨BOS⟩ 商品和服务 ⟨EOS⟩)=1/12

实验环境

操作系统：macOS Monterey 12.4
IDE：CLion
中文编码：GB 18030

实验过程

	变量定义
C++中的map使用哈希表实现，能做到以O(1)的时间复杂度完成插入和查询。我定义了变量oneCount 用于记录历史串Wi-1Wi 在给定语料中出现的次数。allCount用于用于记录Wi-1 在给定语料中出现的次数（即不管Wi是什么）。fz,fm用于储存答案的分子与分母，值得一提的是，在较大的数据集中，当需要计算概率的语句很长时，分子和分母会很大（远超1018），所以我使用了数组进行高精度存储。

typedef long long LL;
map<pair<string,string>,LL> oneCount;
map<string,LL> allCount;
vector<LL> fz,fm;

	程序思路
首先调用readFile()函数读入中文分词语料库，创建ifstream对象ifs，使用getlion方法按行读入字符串至read，再用read实例化stringstream 对象str，这样就可以根据空格提取该句子的每个词了，储存至sentence中。此外及得在sentence前后添加BOS和EOS。
遍历sentence，由于考虑2-gram，将sentence[i]统计到allCount中，将sentence[i]+sentence[i+1]统计到oneCount中。


void readFile(){
        ifstream ifs("/Users/a26012/Desktop/大二下/NLP/实验/实验2/实验二——二元文法     模型/pku_training.txt"); // 绝对路径
        ifs.unsetf(ios_base::skipws);

        string read,word;
        vector<string> sentence;

        while(getline(ifs, read))
        {
            sentence.clear();
            sentence.emplace_back("BOS");
            stringstream str(read);
            while(str>>word){
                sentence.push_back(word);
            }
            sentence.emplace_back("EOS");
            for(int i=0;i<sentence.size()-1;i++){
                allCount[sentence[i]]++;
                oneCount[{sentence[i],sentence[i+1]}]++;
            }
        }
    }

调用readSentence函数，从终端输入一个句子，输出词频统计和句子的出现概率。用同样的方法，将句子的词汇拆分，存入sentence数组中，并初始化分子分母为1。按词遍历整个句子，这里我使用了加1法(Additive smoothing)进行数据平滑，公式为：p\left(\mathcal{W}_i\mid\mathcal{W}_{i-1}\right)=\frac{1+c\left(w_{i-1}w_i\right)}{\left|V\right|+\sum_{w_i}\hairsp c\left(w_{i-1}w_i\right)}。其中，V 为被考虑语料的词汇量(全部可能的基元数)，在程序中为allCount.size()。遍历的过程中同时对p\left(\mathcal{W}_i\mid\mathcal{W}_{i-1}\right)进行输出，并将最终概率的分子分母分别乘以对应数，最后对整个句子的概率进行输出

void readSentence(){
    string read,word;
    getline(cin, read);
    stringstream str(read);
    fz.clear(),fm.clear();
    fz.push_back(1);
    fm.push_back(1);

    vector<string> sentence;
    sentence.emplace_back("BOS");
    while(str>>word){
        sentence.push_back(word);
    }
    sentence.emplace_back("EOS");

    for(int i=0;i<sentence.size()-1;i++){
        LL up=oneCount[{sentence[i],sentence[i+1]}];
        LL down=allCount[sentence[i]];

        //数据平滑
        up++;
        down+=allCount.size();

        printf("P(%s|%s) = %lld/%lld\n",sentence[i].c_str(),
               sentence[i+1].c_str(),up,down);

        fz=mul(fz,up);
        fm=mul(fm,down);
    }

    printf("P(<BOS>%s<EOS>) = ",read.c_str());
    if(fz.size()==1&&fm.size()==1){
        LL a=fz[0],b=fm[0];
        LL tt=gcd(a,b);
        fz[0]=a/tt;
        fm[0]=b/tt;
    }

    reverse(fm.begin(),fm.end());
    reverse(fz.begin(),fz.end());
    printNumber(fz);
    printf("/");
    printNumber(fm);
}


   下面是高精度乘法的实现与输出
vector<LL> mul(vector<LL> &A, LL B)
{
    vector<LL> C;
    LL t=0;
    for(int i=0;i<A.size()||t;i++){
        if(i < A.size()) t+=A[i]*B;
        C.push_back(t%1000000);
        t/=1000000;
    }
    while(C.size()>1&&C.back()==0) C.pop_back();
    return C;
}
void printNumber(const vector<LL>& c){
    for(long long i : c){
        cout<<i;
    }
}

结果展示

在小数据集上的效果如下
语料为:\left\langleBOS\right\rangle 商品 和 服务 \left\langleEOS\right\rangle
\langleBOS\ket 商品 和服 物美价廉 \langleEOS\ket
\langleBOS\ket 服务 和 货币 \langleEOS\ket）
 
图1 小数据集上的结果

在pku_training上的效果如下
 
图2 pku_training上的结果



数据平滑前的结果：
输入“我的名字叫王梓懿”，因为数据集中并没有相关词汇，所以概率为0/0。
 
图3 使用数据平滑前


使用数据平滑后的结果：
可以看到，在经过数据平滑处理后，输入“我的名字叫王梓懿”，计算得到语句出现的概率为677950/78610983356647820720528172000。
 
图4 使用数据平滑后

代码

#include <iostream>
#include <fstream>
#include <sstream>
#include <map>
#include <vector>
using namespace std;
typedef long long LL;

map<pair<string,string>,LL> oneCount;       //count(a+b)
map<string,LL> allCount;   //count(a+w)
vector<LL> fz,fm;

vector<LL> mul(vector<LL> &A, LL B)
{
    vector<LL> C;
    LL t=0;
    for(int i=0;i<A.size()||t;i++){
        if(i < A.size()) t+=A[i]*B;
        C.push_back(t%1000000);
        t/=1000000;
    }
    while(C.size()>1&&C.back()==0) C.pop_back();
    return C;
}

long long gcd(long long a,long long b){
    return b?gcd(b,a%b):a;
}
void printNumber(const vector<LL>& c){
    for(long long i : c){
        cout<<i;
    }
}
void readFile(){
    ifstream ifs("/Users/a26012/Desktop/大二下/NLP/实验/实验2/实验二——二元文法模型/pku_training.txt"); // 绝对路径
    ifs.unsetf(ios_base::skipws);

    string read,word;
    vector<string> sentence;
    while(getline(ifs, read))
    {
        sentence.clear();
        sentence.emplace_back("BOS");
        stringstream str(read);
        while(str>>word){
            sentence.push_back(word);
        }
        sentence.emplace_back("EOS");
        for(int i=0;i<sentence.size()-1;i++){
            allCount[sentence[i]]++;
            oneCount[{sentence[i],sentence[i+1]}]++;
        }
    }
}
void readSentence(){
    string read,word;
    getline(cin, read);
    stringstream str(read);
    fz.clear(),fm.clear();
    fz.push_back(1);
    fm.push_back(1);

    vector<string> sentence;
    sentence.emplace_back("BOS");
    while(str>>word){
        sentence.push_back(word);
    }
    sentence.emplace_back("EOS");

    for(int i=0;i<sentence.size()-1;i++){
        LL up=oneCount[{sentence[i],sentence[i+1]}];
        LL down=allCount[sentence[i]];

        printf("P(%s|%s) = %lld/%lld\n",sentence[i].c_str(),
               sentence[i+1].c_str(),up,down);

        fz=mul(fz,up);
        fm=mul(fm,down);
    }
    printf("P(<BOS>%s<EOS>) = ",read.c_str());
    if(fz.size()==1&&fm.size()==1){
        LL a=fz[0],b=fm[0];
        LL tt=gcd(a,b);
        fz[0]=a/tt;
        fm[0]=b/tt;
    }
    reverse(fm.begin(),fm.end());
    reverse(fz.begin(),fz.end());
    printNumber(fz);
    printf("/");
    printNumber(fm);
}
int main()
{
    readFile();
    readSentence();

    return 0;
}

-meteor-

关注

0
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
NLP实验二：二元文法模型

熟练掌握语言模型的基本概念，深入理解n 元文法(n-gram)模型。熟练掌握参数估计的方法，实现在语料库中对句子中的词进行词频统计，输出句子的出现概率。附加：使用代码实现数据平滑。用免费的中文分词语料库，如人民日报语料库PKU，使用语料库中的常见词编写一个句子，使用二元语法（即每个词只与和它相邻的前一个词有关）在语料库中对句子中的词进行词频统计，输出句子的出现概率。举例:假设语料为:\langleBOS\ket 商品和服务 \langleEOS\ket\langleBO\mathrm{\ }
复制链接

扫一扫