C++实现Chi-square 特征词选择算法

最新推荐文章于 2024-06-10 09:47:39 发布

cucmakeit

最新推荐文章于 2024-06-10 09:47:39 发布

阅读量3.1k

点赞数

分类专栏：数据挖掘文章标签：文本分类 CHI 数据挖掘

数据挖掘专栏收录该内容

13 篇文章 0 订阅

订阅专栏

作者：finallyliuyu(转载请标明原作者与出处)

在文本分类问题中，离不开特征词选择模块。特征选择是特征降维的关键步骤。

首先我们给出一般性的特征词选择模块的伪代码描述：

（此图摘自 C.D. Maning Introduction to InformationRetrieval 原版p251页或者王斌译版p188页）

此处仅赘述两点，其他还劳请读者自己去看书

1。上面的伪代码给出的是算法是针对某一个类别，按照某种测度（如IG,CHI-square）遴选出 top k个特征词；伪代码中的 ComputeFeatureUtility(D,t,c)。就是在计算上文提到的“某种测度”

2。针对某个分类问题，如何遴选出全部的特征词？

方法有很多，这里仅指出一种：假设有N个类别，共需要选取K个特征词，那么每个类别需要选取的特征词数目为K/N。

下面给出Chi-square的计算公式（出处同上，原版书p256页，王斌译作p192页）：

上面的公式和下面的公式是等价的，可以由下面的公式推导出上面的公式，在计算机实现上，我们通常采用上面的公式。

可以说上面的两个公式，通通是在构造一个chi-square 分布的检测统计量（test statistic）（在数理统计中 chi-square 常常用于检测两个事件之间的独立性，如果独立则 chi-square=0 相关知识请查阅数理统计关于假设检验的相关章节）

下面开始讲解chi-square特征词选择法的具体实现

主流的contingency table的定义。

针对某一个term t 和类别c

N11:该词出现在该类的多少篇文章中；

N10：该词出现的文章有多少篇不再该类中；

N01：该类别中有多少篇文章不含有该词；

N00:训练语料库中共有多少篇文章即不含该词，也不包含在该类中。

在给出实现代码之前，先来看一段对程序实现会有启发作用的话：

（出处同上，p257页）

这段话引出了一个数据结构：它保存了一个词在每个类别中出现和不出现的情况：比如有n个类别，那么这个数据结构的每一行保存的是：N11,N01。在我的代码中，我把这个数据结构亦称作是contingency table和主流的contingency table定义可能会稍有区别，不过既然有了N11,N01,在根据程序中其他的数据结构很容易能够得到主流定义模式下的contingency table。

下面开始给出实现代码（如果程序中的一些函数的代码我没有给出，请参阅《K-means文本聚类系列（已经完成）》里面的相关函数）

用到的主要数据结构：

1。词典：保存一个词在训练语料集合中的每篇文章中出现的次数数据类型map<string,vector<pair<int,int>> >

2。contingency table（功能见上面叙述）数据类型：map<pair<string,string>,pair<int,int> >

map的键由两个部分组成第一个string代表term, 第二个string代表类别，值中的第一个int 是 N11,第二个int 是N01

获得contingency table的函数

 
        /************************************************************************/ 
       
        /* 获得每个词的ContingencyTable   
       
        *顶层map的键值为词的(term Text,classLabel) 
       
        内层map的键值为类别名称 
       
        pair<int,int>的第一个int表示某一类别c中含有term t的文章数目,第二个int表示该类别中不含有term t的文章数目 
       
        */ 
       
        /************************************************************************/ 
       
        map<pair<string,string>,pair< 
        int 
        , 
        int 
        > >Preprocess::GetContingencyTable(map<string,vector<pair< 
        int 
        , 
        int 
        >> > &mymap, vector<string> classLabels) 
       
        {    
       
        clock_t  
        start,finish; 
       
        double  
        totaltime; 
       
        start= 
        clock 
        (); 
       
        map<string,vector< 
        int 
        > >articleIdsEachClass=GetArticleIdinEachClass(classLabels); 
       
        map<pair<string,string>,pair< 
        int 
        , 
        int 
        > >EntireContigencytable; 
       
        //对于词袋子模型中的每个词 
       
        for 
        (map<string,vector<pair< 
        int 
        , 
        int 
        > > >::iterator it=mymap.begin();it !=mymap.end();++it) 
       
        {    
        //对于每个类别 
       
        if 
        (it->first!= 
        "" 
        ||it->first!= 
        " " 
        ) 
       
        { 
       
        for 
        (map<string,vector< 
        int 
        > >::iterator it1=articleIdsEachClass.begin();it1!=articleIdsEachClass.end();it1++) 
       
        {    
       
        int  
        cntTheClass=(it1->second).size(); 
        //该类别共有文章数目 
       
        int  
        termInTheClass=0; 
        //该词在该类中出现的次数 
       
        for 
        (vector<pair< 
        int 
        , 
        int 
        > >::iterator it2=(it->second).begin();it2!=(it->second).end();it2++) 
       
        { 
       
        termInTheClass+=count((it1->second).begin(),it1->second.end(),it2->first); 
       
        } 
       
        int  
        termAbsentInTheClass=cntTheClass-termInTheClass; 
       
        pair<string,string> compoundKey=make_pair(it->first,it1->first); 
       
        pair< 
        int 
        , 
        int 
        > valueInfo=make_pair(termInTheClass,termAbsentInTheClass); 
       
        EntireContigencytable[compoundKey]=valueInfo; 
       
        termInTheClass=0; 
        //清空计数； 
       
        } 
       
        } 
       
        } 
       
        finish= 
        clock 
        (); 
       
        totaltime=( 
        double 
        )(finish-start)/CLOCKS_PER_SEC; 
       
        cout<< 
        "建立contingencyTable的时间为" 
        <<totaltime<<endl; 
       
        return  
        EntireContigencytable; 
       
        }

由于构造contingency table 要远比将构造好的contingency table序列化到硬盘，然后需要的时候读取到内存的时间长（我的机器上：建立contingency table 历时233.41sec，将contingency table从硬盘序列化到内存的时间为0.954 sec）所有这里给出了针对contingency table序列化和反序列化的函数

 
        /************************************************************************/ 
       
        /* 将关联表保存到本地硬盘                                                                     */ 
       
        /************************************************************************/ 
       
        void  
        Preprocess::SaveContingencyTable(map<pair<string,string>,pair< 
        int 
        , 
        int 
        > >&contingencyTable) 
       
        {   
       
        ofstream outfile( 
        "F:\\Cluster\\contingency.dat" 
        ,ios::binary); 
       
        for 
        (map<pair<string,string>, pair< 
        int 
        , 
        int 
        > >::iterator it=contingencyTable.begin();it!=contingencyTable.end();it++) 
       
        { 
       
        outfile<<(it->first).first<< 
        " " 
        <<(it->first).second<< 
        " " 
        <<(it->second).first<< 
        " " 
        <<(it->second).second<<endl; 
       
        } 
       
        outfile.close(); 
       
        } 
       
        /************************************************************************/ 
       
        /* 将关联表信息从硬盘加载到内存                                                                     */ 
       
        /************************************************************************/ 
       
        void  
        Preprocess::LoadContingencyTable(map<pair<string,string>,pair< 
        int 
        , 
        int 
        > >&contingencyTable) 
       
        {   
       
        clock_t  
        start,finish; 
       
        double  
        totaltime; 
       
        start= 
        clock 
        (); 
       
        ifstream infile( 
        "F:\\Cluster\\contingency.dat" 
        ,ios::binary); 
       
        string termtext= 
        "" 
        ; 
       
        string classLabel= 
        "" 
        ; 
       
        int  
        presentNum=0; 
        //该term 在该classLabel下的文章中出现的次数(不计算出现重数) 
       
        int  
        absentNum=0; 
        //该classLabel下的文章中不含有该term的文章数目 
       
        while 
        (!infile.eof()) 
       
        { 
       
        infile>>termtext; 
       
        infile>>classLabel; 
       
        infile>>presentNum; 
       
        infile>>absentNum; 
       
        pair<string, string> compoundKey=make_pair(termtext,classLabel); 
       
        pair< 
        int 
        , 
        int 
        > valinfo=make_pair(presentNum,absentNum); 
       
        contingencyTable[compoundKey]=valinfo; 
       
        } 
       
        infile.close(); 
       
        finish= 
        clock 
        (); 
       
        totaltime=( 
        double 
        )(finish-start)/CLOCKS_PER_SEC; 
       
        cout<< 
        "将contingencyTable加载到内存的时间为" 
        <<totaltime<<endl; 
       
        }

计算chi-square值的函数：

 
        /************************************************************************/ 
       
        /* 计算CHI-square 值                                                */ 
       
        /************************************************************************/ 
       
        double  
        Preprocess:: CalChiSquareValue( 
        double  
        N11, 
        double  
        N10, 
        double  
        N01, 
        double  
        N00) 
       
        { 
       
        double  
        chiSquare=0; 
       
        chiSquare=(N11+N10+N01+N00)* 
        pow 
        ((N11*N00-N10*N01),2)/((N11+N01)*(N11+N10)*(N10+N00)*(N01+N00)); 
       
        return  
        chiSquare; 
       
        }

针对每个类别计算所有词的chi-square并按照chi-square值按从高到低排列：

 
        计算词袋子中的每一个词对某一类别的卡方值 
       
        /************************************************************************/ 
       
        vector<pair<string, 
        double 
        > > Preprocess::ChiSquareFeatureSelectionForPerclass(map<string,vector<pair< 
        int 
        , 
        int 
        >> >&mymap,map<pair<string,string>,pair< 
        int 
        , 
        int 
        > > &contingencyTable,string classLabel) 
       
        {    
        int  
        N=endIndex-beginIndex+1; 
        //总共的文章数目 
       
        vector<string>tempvector; 
        //词袋子中的所有词 
       
        vector<pair<string, 
        double 
        > > chisquareInfo; 
       
        for 
        (map<string,vector<pair< 
        int 
        , 
        int 
        >>>::iterator it=mymap.begin();it!=mymap.end();++it) 
       
        { 
       
        tempvector.push_back(it->first); 
       
        } 
       
        //计算卡方值 
       
        for 
        (vector<string>::iterator ittmp=tempvector.begin();ittmp!=tempvector.end();ittmp++) 
       
        { 
       
        int  
        N1=mymap[*ittmp].size(); 
       
        pair<string,string> compoundKey=make_pair(*ittmp,classLabel); 
       
        double  
        N11= 
        double 
        (contingencyTable[compoundKey].first); 
       
        double  
        N01= 
        double 
        (contingencyTable[compoundKey].second); 
       
        double  
        N10= 
        double 
        (N1-N11); 
       
        double  
        N00= 
        double 
        (N-N1-N01); 
       
        double  
        chiValue=CalChiSquareValue(N11,N10,N01,N00); 
       
        chisquareInfo.push_back(make_pair(*ittmp,chiValue)); 
       
        } 
       
        //按照卡方值从大到小将这些词排列起来 
       
        stable_sort(chisquareInfo.begin(),chisquareInfo.end(),isLarger); 
       
        /*ofstream outfile("F:\\Cluster\\other.dat"); 
       
        int finalKeyWordsCount=0; 
       
        for(vector<pair<string,double> >::size_type j=0;j<chisquareInfo.size();j++) 
       
        { 
       
        outfile<<chisquareInfo[j].first<<";"<<chisquareInfo[j].second<<endl; 
       
        finalKeyWordsCount++; 
       
        } 
       
        outfile.close();*/ 
       
        return  
        chisquareInfo; 
       
        }

针对整个分类问题的chi-square特征词选择法。在本例中，共有三个类别

 
        /************************************************************************/ 
       
        /* 卡方特征词选择算法                                                                     */ 
       
        /************************************************************************/ 
       
        void  
        Preprocess::ChiSquareFeatureSelection(map<string,vector<pair< 
        int 
        , 
        int 
        >> > &mymap,map<pair<string,string>,pair< 
        int 
        , 
        int 
        > > &contingencyTable, 
        int  
        N) 
       
        { 
       
        clock_t  
        start,finish; 
       
        double  
        totaltime; 
       
        start= 
        clock 
        (); 
       
        int  
        N1=18693; 
       
        int  
        N2=23822; 
       
        int  
        N3=15717; 
       
        int  
        threshold1=N1*N/(N1+N2+N3); 
       
        int  
        threshold2=N2*N/(N1+N2+N3); 
       
        int  
        threshold3=N3*N/(N1+N2+N3); 
       
        string classlabel1= 
        "xxxx" 
        ; 
       
        string classlabel2= 
        "yyyy" 
        ; 
       
        string classlabel3= 
        "zzzz" 
        ; 
       
        vector<string> classLabels; 
       
        classLabels.push_back( 
        "xxxx" 
        ); 
       
        classLabels.push_back( 
        "yyyy" 
        ); 
       
        classLabels.push_back( 
        "zzzz" 
        ); 
       
        vector<pair<string, 
        double 
        >>chisquareInfo1; 
       
        vector<pair<string, 
        double 
        >>chisquareInfo2; 
       
        vector<pair<string, 
        double 
        >>chisquareInfo3; 
       
        chisquareInfo1=ChiSquareFeatureSelectionForPerclass(mymap,contingencyTable,classlabel1); 
       
        chisquareInfo2=ChiSquareFeatureSelectionForPerclass(mymap,contingencyTable,classlabel2); 
       
        chisquareInfo3=ChiSquareFeatureSelectionForPerclass(mymap,contingencyTable,classlabel3); 
       
        //stable_sort(chisquareInfo2.begin(),chisquareInfo2.end(),isLarger); 
       
        //stable_sort(chisquareInfo3.begin(),chisquareInfo3.end(),isLarger); 
       
        cout<< 
        "finish ChiSquare Calculation" 
        <<endl; 
       
        set<string>finalKeywords; 
       
        for 
        (vector<pair<string, 
        double 
        > >::size_type j=0;j<threshold1;j++) 
       
        { 
       
        finalKeywords.insert(chisquareInfo1[j].first); 
       
        } 
       
        for 
        (vector<pair<string, 
        double 
        > >::size_type j=0;j<threshold2;j++) 
       
        { 
       
        finalKeywords.insert(chisquareInfo2[j].first); 
       
        } 
       
        for 
        (vector<pair<string, 
        double 
        > >::size_type j=0;j<threshold2;j++) 
       
        { 
       
        finalKeywords.insert(chisquareInfo3[j].first); 
       
        } 
       
        ofstream outfile(featurewordsAddress); 
       
        int  
        finalKeyWordsCount=finalKeywords.size(); 
       
        for  
        (set<string>::iterator it=finalKeywords.begin();it!=finalKeywords.end();it++) 
       
        { 
       
        outfile<<*it<<endl; 
       
        } 
       
        outfile.close(); 
       
        cout<< 
        "最后共选择特征词" 
        <<finalKeyWordsCount<<endl; 
       
        finish= 
        clock 
        (); 
       
        totaltime=( 
        double 
        )(finish-start)/CLOCKS_PER_SEC; 
       
        cout<< 
        "遴选特征词共有了" 
        <<totaltime<<endl; 
       
        }