数据挖掘-基于贝叶斯算法及KNN算法的newsgroup18828文本分类器的JAVA实现（上）

最新推荐文章于 2021-02-27 06:11:32 发布

小飞侠-2

最新推荐文章于 2021-02-27 06:11:32 发布

阅读量228

点赞数

文章标签：人工智能 java

(update 2012.12.28 关于本项目下载及运行的常见问题 FAQ见newsgroup18828文本分类器、文本聚类器、关联分析频繁模式挖掘算法的Java实现工程下载及运行FAQ)

本文主要内容如下：
对newsgroup文档集进行预处理，提取出30095 个特征词

计算每篇文档中的特征词的TF*IDF值，实现文档向量化，在KNN算法中使用

用JAVA实现了KNN算法及朴素贝叶斯算法的newsgroup文本分类器

1、Newsgroup文档集介绍

Newsgroups最早由Lang于1995收集并在[Lang 1995]中使用。它含有20000篇左右的Usenet文档，几乎平均分配20个不同的新闻组。除了其中4.5%的文档属于两个或两个以上的新闻组以外，其余文档仅属于一个新闻组，因此它通常被作为单标注分类问题来处理。Newsgroups已经成为文本分及聚类中常用的文档集。美国MIT大学Jason Rennie对Newsgroups作了必要的处理，使得每个文档只属于一个新闻组，形成Newsgroups-18828。

2、Newsgroup文档预处理

要做文本分类首先得完成文本的预处理，预处理的主要步骤如下

STEP ONE:英文词法分析，去除数字、连字符、标点符号、特殊字符，所有大写字母转换成小写，可以用正则表达式

String res[] = line.split("[^a-zA-Z]");

STEP TWO:去停用词，过滤对分类无价值的词

STEP THRE:词根还原stemming,基于Porter算法

文档预处理类 DataPreProcess.java如下

[java]view plaincopy 
    
 packagecom.pku.yangliu; 
  
 importjava.io.BufferedReader; 
 importjava.io.File; 
 importjava.io.FileReader; 
 importjava.io.FileWriter; 
 importjava.io.IOException; 
 importjava.util.ArrayList; 
  
 /** 
 *Newsgroups文档集预处理类 
 */ 
 publicclassDataPreProcess{ 
  
 /**输入文件调用处理数据函数 
 *@paramstrDirnewsgroup文件目录的绝对路径 
 *@throwsIOException 
 */ 
 publicvoiddoProcess(StringstrDir)throwsIOException{ 
 FilefileDir=newFile(strDir); 
 if(!fileDir.exists()){ 
 System.out.println("Filenotexist:"+strDir); 
 return; 
 } 
 StringsubStrDir=strDir.substring(strDir.lastIndexOf('/')); 
 StringdirTarget=strDir+"/../../processedSample_includeNotSpecial"+subStrDir; 
 FilefileTarget=newFile(dirTarget); 
 if(!fileTarget.exists()){//注意processedSample需要先建立目录建出来，否则会报错，因为母目录不存在 
 fileTarget.mkdir(); 
 } 
 File[]srcFiles=fileDir.listFiles(); 
 String[]stemFileNames=newString[srcFiles.length]; 
 for(inti=0;i<srcFiles.length;i++){ 
 StringfileFullName=srcFiles[i].getCanonicalPath(); 
 StringfileShortName=srcFiles[i].getName(); 
 if(!newFile(fileFullName).isDirectory()){//确认子文件名不是目录如果是可以再次递归调用 
 System.out.println("Beginpreprocess:"+fileFullName); 
 StringBuilderstringBuilder=newStringBuilder(); 
 stringBuilder.append(dirTarget+"/"+fileShortName); 
 createProcessFile(fileFullName,stringBuilder.toString()); 
 stemFileNames[i]=stringBuilder.toString(); 
 } 
 else{ 
 fileFullName=fileFullName.replace("\\","/"); 
 doProcess(fileFullName); 
 } 
 } 
 //下面调用stem算法 
 if(stemFileNames.length>0&&stemFileNames[0]!=null){ 
 Stemmer.porterMain(stemFileNames); 
 } 
 } 
  
 /**进行文本预处理生成目标文件 
 *@paramsrcDir源文件文件目录的绝对路径 
 *@paramtargetDir生成的目标文件的绝对路径 
 *@throwsIOException 
 */ 
 privatestaticvoidcreateProcessFile(StringsrcDir,StringtargetDir)throwsIOException{ 
 //TODOAuto-generatedmethodstub 
 FileReadersrcFileReader=newFileReader(srcDir); 
 FileReaderstopWordsReader=newFileReader("F:/DataMiningSample/stopwords.txt"); 
 FileWritertargetFileWriter=newFileWriter(targetDir); 
 BufferedReadersrcFileBR=newBufferedReader(srcFileReader);//装饰模式 
 BufferedReaderstopWordsBR=newBufferedReader(stopWordsReader); 
 Stringline,resLine,stopWordsLine; 
 //用stopWordsBR够着停用词的ArrayList容器 
 ArrayList<String>stopWordsArray=newArrayList<String>(); 
 while((stopWordsLine=stopWordsBR.readLine())!=null){ 
 if(!stopWordsLine.isEmpty()){ 
 stopWordsArray.add(stopWordsLine); 
 } 
 } 
 while((line=srcFileBR.readLine())!=null){ 
 resLine=lineProcess(line,stopWordsArray); 
 if(!resLine.isEmpty()){ 
 //按行写，一行写一个单词 
 String[]tempStr=resLine.split("");//\s 
 for(inti=0;i<tempStr.length;i++){ 
 if(!tempStr[i].isEmpty()){ 
 targetFileWriter.append(tempStr[i]+"\n"); 
 } 
 } 
 } 
 } 
 targetFileWriter.flush(); 
 targetFileWriter.close(); 
 srcFileReader.close(); 
 stopWordsReader.close(); 
 srcFileBR.close(); 
 stopWordsBR.close(); 
 } 
  
 /**对每行字符串进行处理，主要是词法分析、去停用词和stemming 
 *@paramline待处理的一行字符串 
 *@paramArrayList<String>停用词数组 
 *@returnString处理好的一行字符串，是由处理好的单词重新生成，以空格为分隔符 
 *@throwsIOException 
 */ 
 privatestaticStringlineProcess(Stringline,ArrayList<String>stopWordsArray)throwsIOException{ 
 //TODOAuto-generatedmethodstub 
 //step1英文词法分析，去除数字、连字符、标点符号、特殊字符，所有大写字母转换成小写，可以考虑用正则表达式 
 Stringres[]=line.split("[^a-zA-Z]"); 
 //这里要小心，防止把有单词中间有数字和连字符的单词截断了，但是截断也没事 
  
 StringresString=newString(); 
 //step2去停用词 
 //step3stemming,返回后一起做 
 for(inti=0;i<res.length;i++){ 
 if(!res[i].isEmpty()&&!stopWordsArray.contains(res[i].toLowerCase())){ 
 resString+=""+res[i].toLowerCase()+""; 
 } 
 } 
 returnresString; 
 } 
  
 /** 
 *@paramargs 
 *@throwsIOException 
 */ 
 publicvoidBPPMain(String[]args)throwsIOException{ 
 //TODOAuto-generatedmethodstub 
 DataPreProcessdataPrePro=newDataPreProcess(); 
 dataPrePro.doProcess("F:/DataMiningSample/orginSample"); 
  
 } 
  
 } 

steming的porter算法可以Google，有C及JAVA的实现版本，点击下载 porter算法JAVA版本

2、特征词的选取

首先统计经过预处理后在所有文档中出现不重复的单词一共有87554个，对这些词进行统计发现：
出现次数大于等于1次的词有87554个
出现次数大于等于3次的词有36456个
出现次数大于等于4次的词有30095个
特征词的选取策略：
策略一：保留所有词作为特征词共计87554个
策略二：选取出现次数大于等于4次的词作为特征词共计30095个
特征词的选取策略：采用策略一，后面将对两种特征词选取策略的计算时间和平均准确率做对比

测试集与训练集的创建类CreateTrainAndTestSample.java如下

[java]view plaincopy 
     
 packagecom.pku.yangliu; 
  
 importjava.io.BufferedReader; 
 importjava.io.File; 
 importjava.io.FileReader; 
 importjava.io.FileWriter; 
 importjava.io.IOException; 
 importjava.util.SortedMap; 
 importjava.util.TreeMap; 
  
 /**创建训练样例集合与测试样例集合 
 * 
 */ 
 publicclassCreateTrainAndTestSample{ 
  
 voidfilterSpecialWords()throwsIOException{ 
 //TODOAuto-generatedmethodstub 
 Stringword; 
 ComputeWordsVectorcwv=newComputeWordsVector(); 
 StringfileDir="F:/DataMiningSample/processedSample_includeNotSpecial"; 
 SortedMap<String,Double>wordMap=newTreeMap<String,Double>(); 
 wordMap=cwv.countWords(fileDir,wordMap); 
 cwv.printWordMap(wordMap);//把wordMap输出到文件 
 File[]sampleDir=newFile(fileDir).listFiles(); 
 for(inti=0;i<sampleDir.length;i++){ 
 File[]sample=sampleDir[i].listFiles(); 
 StringtargetDir="F:/DataMiningSample/processedSampleOnlySpecial/"+sampleDir[i].getName(); 
 FiletargetDirFile=newFile(targetDir); 
 if(!targetDirFile.exists()){ 
 targetDirFile.mkdir(); 
 } 
 for(intj=0;j<sample.length;j++){ 
 StringfileShortName=sample[j].getName(); 
 if(fileShortName.contains("stemed")){ 
 targetDir="F:/DataMiningSample/processedSampleOnlySpecial/"+sampleDir[i].getName()+"/"+fileShortName.substring(0,5); 
 FileWritertgWriter=newFileWriter(targetDir); 
 FileReadersamReader=newFileReader(sample[j]); 
 BufferedReadersamBR=newBufferedReader(samReader); 
 while((word=samBR.readLine())!=null){ 
 if(wordMap.containsKey(word)){ 
 tgWriter.append(word+"\n"); 
 } 
 } 
 tgWriter.flush(); 
 tgWriter.close(); 
 } 
 } 
 } 
 } 
  
 voidcreateTestSamples(StringfileDir,doubletrainSamplePercent,intindexOfSample,StringclassifyResultFile)throwsIOException{ 
 //TODOAuto-generatedmethodstub 
 Stringword,targetDir; 
 FileWritercrWriter=newFileWriter(classifyResultFile);//测试样例正确类目记录文件 
 File[]sampleDir=newFile(fileDir).listFiles(); 
 for(inti=0;i<sampleDir.length;i++){ 
 File[]sample=sampleDir[i].listFiles(); 
 doubletestBeginIndex=indexOfSample*(sample.length*(1-trainSamplePercent));//测试样例的起始文件序号 
 doubletestEndIndex=(indexOfSample+1)*(sample.length*(1-trainSamplePercent));//测试样例集的结束文件序号 
 for(intj=0;j<sample.length;j++){ 
 FileReadersamReader=newFileReader(sample[j]); 
 BufferedReadersamBR=newBufferedReader(samReader); 
 StringfileShortName=sample[j].getName(); 
 StringsubFileName=fileShortName; 
 if(j>testBeginIndex&&j<testEndIndex){//序号在规定区间内的作为测试样本，需要为测试样本生成类别-序号文件，最后加入分类的结果，一行对应一个文件，方便统计准确率 
 targetDir="F:/DataMiningSample/TestSample"+indexOfSample+"/"+sampleDir[i].getName(); 
 crWriter.append(subFileName+""+sampleDir[i].getName()+"\n"); 
  
 } 
 else{//其余作为训练样本 
 targetDir="F:/DataMiningSample/TrainSample"+indexOfSample+"/"+sampleDir[i].getName(); 
 } 
 targetDir=targetDir.replace("\\","/"); 
 FiletrainSamFile=newFile(targetDir); 
 if(!trainSamFile.exists()){ 
 trainSamFile.mkdir(); 
 } 
 targetDir+="/"+subFileName; 
 FileWritertsWriter=newFileWriter(newFile(targetDir)); 
 while((word=samBR.readLine())!=null){ 
 tsWriter.append(word+"\n"); 
 } 
 tsWriter.flush(); 
 tsWriter.close(); 
 } 
 } 
 crWriter.flush(); 
 crWriter.close(); 
 } 
 } 

3、贝叶斯算法描述及实现

根据朴素贝叶斯公式，每个测试样例属于某个类别的概率 = 所有测试样例包含特征词类条件概率P(tk|c)之积 * 先验概率P(c)

在具体计算类条件概率和先验概率时，朴素贝叶斯分类器有两种模型

（1）多元分布模型( multinomial model ) –以单词为粒度，也就是说，考虑每个文件里面重复出现多次的单词。注意多项分布其实是从二项分布拓展出来的，如果采用多项分布模型，那么每个单词表示变量就不再是二值变量（出现/不出现），而是每个单词在文件中出现的次数
类条件概率P(tk|c)=(类c下单词tk在各个文档中出现过的次数之和+1)/（类c下单词总数+训练样本中不重复特征词总数）
先验概率P(c)=类c下的单词总数/整个训练样本的单词总数
（2）伯努利模型（Bernoulli model） –以文件为粒度，或者说是采用二项分布模型，伯努利实验即N次独立重复随机实验，只考虑事件发生/不发生，所以每个单词的表示变量是布尔型的
类条件概率P(tk|c)=（类c下包含单词tk的文件数+1）/(类c下文件总数+2)（注意：开始此处错写成了单词，多谢网友提醒后更正）
先验概率P(c)=类c下文件总数/整个训练样本的文件总数
本分类器选用多元分布模型计算，根据《Introduction to Information Retrieval 》，多元分布模型计算准确率更高

贝叶斯算法的实现有以下注意点：

(1) 计算概率用到了BigDecimal类实现任意精度计算
(2) 用交叉验证法做十次分类实验，对准确率取平均值
(3) 根据正确类目文件和分类结果文计算混淆矩阵并且输出
(4) Map<String,Double> cateWordsProb key为“类目_单词”, value为该类目下该单词的出现次数，避免重复计算

贝叶斯算法实现类如下 NaiveBayesianClassifier.java

[java]view plaincopy 
     
 packagecom.pku.yangliu; 
  
 importjava.io.BufferedReader; 
 importjava.io.File; 
 importjava.io.FileReader; 
 importjava.io.FileWriter; 
 importjava.io.IOException; 
 importjava.math.BigDecimal; 
 importjava.util.Iterator; 
 importjava.util.Map; 
 importjava.util.Set; 
 importjava.util.SortedSet; 
 importjava.util.TreeMap; 
 importjava.util.TreeSet; 
 importjava.util.Vector; 
  
 /**利用朴素贝叶斯算法对newsgroup文档集做分类，采用十组交叉测试取平均值 
 *采用多项式模型,stanford信息检索导论课件上面言多项式模型比伯努利模型准确度高 
 *类条件概率P(tk|c)=(类c下单词tk在各个文档中出现过的次数之和+1)/(类c下单词总数+|V|) 
 * 
 */ 
 publicclassNaiveBayesianClassifier{ 
  
 /**用贝叶斯法对测试文档集分类 
 *@paramtrainDir训练文档集目录 
 *@paramtestDir测试文档集目录 
 *@paramclassifyResultFileNew分类结果文件路径 
 *@throwsException 
 */ 
 privatevoiddoProcess(StringtrainDir,StringtestDir, 
 StringclassifyResultFileNew)throwsException{ 
 //TODOAuto-generatedmethodstub 
 Map<String,Double>cateWordsNum=newTreeMap<String,Double>();//保存训练集每个类别的总词数 
 Map<String,Double>cateWordsProb=newTreeMap<String,Double>();//保存训练样本每个类别中每个属性词的出现词数 
 cateWordsProb=getCateWordsProb(trainDir); 
 cateWordsNum=getCateWordsNum(trainDir); 
 doubletotalWordsNum=0.0;//记录所有训练集的总词数 
 Set<Map.Entry<String,Double>>cateWordsNumSet=cateWordsNum.entrySet(); 
 for(Iterator<Map.Entry<String,Double>>it=cateWordsNumSet.iterator();it.hasNext();){ 
 Map.Entry<String,Double>me=it.next(); 
 totalWordsNum+=me.getValue(); 
 } 
 //下面开始读取测试样例做分类 
 Vector<String>testFileWords=newVector<String>(); 
 Stringword; 
 File[]testDirFiles=newFile(testDir).listFiles(); 
 FileWritercrWriter=newFileWriter(classifyResultFileNew); 
 for(inti=0;i<testDirFiles.length;i++){ 
 File[]testSample=testDirFiles[i].listFiles(); 
 for(intj=0;j<testSample.length;j++){ 
 testFileWords.clear(); 
 FileReaderspReader=newFileReader(testSample[j]); 
 BufferedReaderspBR=newBufferedReader(spReader); 
 while((word=spBR.readLine())!=null){ 
 testFileWords.add(word); 
 } 
 //下面分别计算该测试样例属于二十个类别的概率 
 File[]trainDirFiles=newFile(trainDir).listFiles(); 
 BigDecimalmaxP=newBigDecimal(0); 
 StringbestCate=null; 
 for(intk=0;k<trainDirFiles.length;k++){ 
 BigDecimalp=computeCateProb(trainDirFiles[k],testFileWords,cateWordsNum,totalWordsNum,cateWordsProb); 
 if(k==0){ 
 maxP=p; 
 bestCate=trainDirFiles[k].getName(); 
 continue; 
 } 
 if(p.compareTo(maxP)==1){ 
 maxP=p; 
 bestCate=trainDirFiles[k].getName(); 
 } 
 } 
 crWriter.append(testSample[j].getName()+""+bestCate+"\n"); 
 crWriter.flush(); 
 } 
 } 
 crWriter.close(); 
 } 
  
 /**统计某类训练样本中每个单词的出现次数 
 *@paramstrDir训练样本集目录 
 *@returnMap<String,Double>cateWordsProb用"类目_单词"对来索引的map,保存的val就是该类目下该单词的出现次数 
 *@throwsIOException 
 */ 
 publicMap<String,Double>getCateWordsProb(StringstrDir)throwsIOException{ 
 Map<String,Double>cateWordsProb=newTreeMap<String,Double>(); 
 FilesampleFile=newFile(strDir); 
 File[]sampleDir=sampleFile.listFiles(); 
 Stringword; 
 for(inti=0;i<sampleDir.length;i++){ 
 File[]sample=sampleDir[i].listFiles(); 
 for(intj=0;j<sample.length;j++){ 
 FileReadersamReader=newFileReader(sample[j]); 
 BufferedReadersamBR=newBufferedReader(samReader); 
 while((word=samBR.readLine())!=null){ 
 Stringkey=sampleDir[i].getName()+"_"+word; 
 if(cateWordsProb.containsKey(key)){ 
 doublecount=cateWordsProb.get(key)+1.0; 
 cateWordsProb.put(key,count); 
 } 
 else{ 
 cateWordsProb.put(key,1.0); 
 } 
 } 
 } 
 } 
 returncateWordsProb; 
 } 
  
 /**计算某一个测试样本属于某个类别的概率 
 *@paramMap<String,Double>cateWordsProb记录每个目录中出现的单词及次数 
 *@paramFiletrainFile该类别所有的训练样本所在目录 
 *@paramVector<String>testFileWords该测试样本中的所有词构成的容器 
 *@paramdoubletotalWordsNum记录所有训练样本的单词总数 
 *@paramMap<String,Double>cateWordsNum记录每个类别的单词总数 
 *@returnBigDecimal返回该测试样本在该类别中的概率 
 *@throwsException 
 *@throwsIOException 
 */ 
 privateBigDecimalcomputeCateProb(FiletrainFile,Vector<String>testFileWords,Map<String,Double>cateWordsNum,doubletotalWordsNum,Map<String,Double>cateWordsProb)throwsException{ 
 //TODOAuto-generatedmethodstub 
 BigDecimalprobability=newBigDecimal(1); 
 doublewordNumInCate=cateWordsNum.get(trainFile.getName()); 
 BigDecimalwordNumInCateBD=newBigDecimal(wordNumInCate); 
 BigDecimaltotalWordsNumBD=newBigDecimal(totalWordsNum); 
 for(Iterator<String>it=testFileWords.iterator();it.hasNext();){ 
 Stringme=it.next(); 
 Stringkey=trainFile.getName()+"_"+me; 
 doubletestFileWordNumInCate; 
 if(cateWordsProb.containsKey(key)){ 
 testFileWordNumInCate=cateWordsProb.get(key); 
 }elsetestFileWordNumInCate=0.0; 
 BigDecimaltestFileWordNumInCateBD=newBigDecimal(testFileWordNumInCate); 
 BigDecimalxcProb=(testFileWordNumInCateBD.add(newBigDecimal(0.0001))).divide(totalWordsNumBD.add(wordNumInCateBD),10,BigDecimal.ROUND_CEILING); 
 probability=probability.multiply(xcProb); 
 } 
 BigDecimalres=probability.multiply(wordNumInCateBD.divide(totalWordsNumBD,10,BigDecimal.ROUND_CEILING)); 
 returnres; 
 } 
  
 /**获得每个类目下的单词总数 
 *@paramtrainDir训练文档集目录 
 *@returnMap<String,Double><目录名，单词总数>的map 
 *@throwsIOException 
 */ 
 privateMap<String,Double>getCateWordsNum(StringtrainDir)throwsIOException{ 
 //TODOAuto-generatedmethodstub 
 Map<String,Double>cateWordsNum=newTreeMap<String,Double>(); 
 File[]sampleDir=newFile(trainDir).listFiles(); 
 for(inti=0;i<sampleDir.length;i++){ 
 doublecount=0; 
 File[]sample=sampleDir[i].listFiles(); 
 for(intj=0;j<sample.length;j++){ 
 FileReaderspReader=newFileReader(sample[j]); 
 BufferedReaderspBR=newBufferedReader(spReader); 
 while(spBR.readLine()!=null){ 
 count++; 
 } 
 } 
 cateWordsNum.put(sampleDir[i].getName(),count); 
 } 
 returncateWordsNum; 
 } 
  
 /**根据正确类目文件和分类结果文件统计出准确率 
 *@paramclassifyResultFile正确类目文件 
 *@paramclassifyResultFileNew分类结果文件 
 *@returndouble分类的准确率 
 *@throwsIOException 
 */ 
 doublecomputeAccuracy(StringclassifyResultFile, 
 StringclassifyResultFileNew)throwsIOException{ 
 //TODOAuto-generatedmethodstub 
 Map<String,String>rightCate=newTreeMap<String,String>(); 
 Map<String,String>resultCate=newTreeMap<String,String>(); 
 rightCate=getMapFromResultFile(classifyResultFile); 
 resultCate=getMapFromResultFile(classifyResultFileNew); 
 Set<Map.Entry<String,String>>resCateSet=resultCate.entrySet(); 
 doublerightCount=0.0; 
 for(Iterator<Map.Entry<String,String>>it=resCateSet.iterator();it.hasNext();){ 
 Map.Entry<String,String>me=it.next(); 
 if(me.getValue().equals(rightCate.get(me.getKey()))){ 
 rightCount++; 
 } 
 } 
 computerConfusionMatrix(rightCate,resultCate); 
 returnrightCount/resultCate.size(); 
 } 
  
 /**根据正确类目文件和分类结果文计算混淆矩阵并且输出 
 *@paramrightCate正确类目对应map 
 *@paramresultCate分类结果对应map 
 *@returndouble分类的准确率 
 *@throwsIOException 
 */ 
 privatevoidcomputerConfusionMatrix(Map<String,String>rightCate, 
 Map<String,String>resultCate){ 
 //TODOAuto-generatedmethodstub 
 int[][]confusionMatrix=newint[20][20]; 
 //首先求出类目对应的数组索引 
 SortedSet<String>cateNames=newTreeSet<String>(); 
 Set<Map.Entry<String,String>>rightCateSet=rightCate.entrySet(); 
 for(Iterator<Map.Entry<String,String>>it=rightCateSet.iterator();it.hasNext();){ 
 Map.Entry<String,String>me=it.next(); 
 cateNames.add(me.getValue()); 
 } 
 cateNames.add("rec.sport.baseball");//防止数少一个类目 
 String[]cateNamesArray=cateNames.toArray(newString[0]); 
 Map<String,Integer>cateNamesToIndex=newTreeMap<String,Integer>(); 
 for(inti=0;i<cateNamesArray.length;i++){ 
 cateNamesToIndex.put(cateNamesArray[i],i); 
 } 
 for(Iterator<Map.Entry<String,String>>it=rightCateSet.iterator();it.hasNext();){ 
 Map.Entry<String,String>me=it.next(); 
 confusionMatrix[cateNamesToIndex.get(me.getValue())][cateNamesToIndex.get(resultCate.get(me.getKey()))]++; 
 } 
 //输出混淆矩阵 
 double[]hangSum=newdouble[20]; 
 System.out.print(""); 
 for(inti=0;i<20;i++){ 
 System.out.print(i+""); 
 } 
 System.out.println(); 
 for(inti=0;i<20;i++){ 
 System.out.print(i+""); 
 for(intj=0;j<20;j++){ 
 System.out.print(confusionMatrix[i][j]+""); 
 hangSum[i]+=confusionMatrix[i][j]; 
 } 
 System.out.println(confusionMatrix[i][i]/hangSum[i]); 
 } 
 System.out.println(); 
 } 
  
 /**从分类结果文件中读取map 
 *@paramclassifyResultFileNew类目文件 
 *@returnMap<String,String>由<文件名，类目名>保存的map 
 *@throwsIOException 
 */ 
 privateMap<String,String>getMapFromResultFile( 
 StringclassifyResultFileNew)throwsIOException{ 
 //TODOAuto-generatedmethodstub 
 FilecrFile=newFile(classifyResultFileNew); 
 FileReadercrReader=newFileReader(crFile); 
 BufferedReadercrBR=newBufferedReader(crReader); 
 Map<String,String>res=newTreeMap<String,String>(); 
 String[]s; 
 Stringline; 
 while((line=crBR.readLine())!=null){ 
 s=line.split(""); 
 res.put(s[0],s[1]); 
 } 
 returnres; 
 } 
  
 /** 
 *@paramargs 
 *@throwsException 
 */ 
 publicvoidNaiveBayesianClassifierMain(String[]args)throwsException{ 
 //TODOAuto-generatedmethodstub 
 //首先创建训练集和测试集 
 CreateTrainAndTestSamplectt=newCreateTrainAndTestSample(); 
 NaiveBayesianClassifiernbClassifier=newNaiveBayesianClassifier(); 
 ctt.filterSpecialWords();//根据包含非特征词的文档集生成只包含特征词的文档集到processedSampleOnlySpecial目录下 
 double[]accuracyOfEveryExp=newdouble[10]; 
 doubleaccuracyAvg,sum=0; 
 for(inti=0;i<10;i++){//用交叉验证法做十次分类实验，对准确率取平均值 
 StringTrainDir="F:/DataMiningSample/TrainSample"+i; 
 StringTestDir="F:/DataMiningSample/TestSample"+i; 
 StringclassifyRightCate="F:/DataMiningSample/classifyRightCate"+i+".txt"; 
 StringclassifyResultFileNew="F:/DataMiningSample/classifyResultNew"+i+".txt"; 
 ctt.createTestSamples("F:/DataMiningSample/processedSampleOnlySpecial",0.9,i,classifyRightCate); 
 nbClassifier.doProcess(TrainDir,TestDir,classifyResultFileNew); 
 accuracyOfEveryExp[i]=nbClassifier.computeAccuracy(classifyRightCate,classifyResultFileNew); 
 System.out.println("TheaccuracyforNaiveBayesianClassifierin"+i+"thExpis:"+accuracyOfEveryExp[i]); 
 } 
 for(inti=0;i<10;i++){ 
 sum+=accuracyOfEveryExp[i]; 
 } 
 accuracyAvg=sum/10; 
 System.out.println("TheaverageaccuracyforNaiveBayesianClassifierinallExpsis:"+accuracyAvg); 
  
 } 
 } 

4 朴素贝叶斯算法对newsgroup文档集做分类的结果

为方便计算混淆矩阵，将类目编号如下

0 alt.atheism
1 comp.graphics
2 comp.os.ms-windows.misc
3comp.sys.ibm.pc.hdwar
4comp.sys.mac.hardwar
5 comp.windows.x
6 misc.forsale
7 rec.autos
8 rec.motorcycles
9 rec.sport.baseball
10 rec.sport.hockey
11 sci.crypt
12 sci.electronics
13 sci.med
14 sci.space
15 soc.religion.christian
16 talk.politics.guns
17 talk.politics.mideast
18 talk.politics.misc
19 talk.religion.misc

贝叶斯算法分类结果-混淆矩阵表示，以交叉验证的第6次实验结果为例，分类准确率达到80.47%

程序运行硬件环境：Intel Core 2 Duo CPU T5750 2GHZ, 2G内存，实验结果如下
取所有词共87554个作为特征词：10次交叉验证实验平均准确率78.19%，用时23min,准确率范围75.65%-80.47%，第6次实验准确率超过80%
取出现次数大于等于4次的词共计30095个作为特征词： 10次交叉验证实验平均准确率77.91%，用时22min，准确率范围75.51%-80.26%，第6次实验准确率超过80%