(update 2012.12.28 关于本项目下载及运行的常见问题 FAQ见newsgroup18828文本分类器、文本聚类器、关联分析频繁模式挖掘算法的Java实现工程下载及运行FAQ)
本文主要内容如下:
对newsgroup文档集进行预处理,提取出30095 个特征词
计算每篇文档中的特征词的TF*IDF值,实现文档向量化,在KNN算法中使用
用JAVA实现了KNN算法及朴素贝叶斯算法的newsgroup文本分类器
1、Newsgroup文档集介绍
Newsgroups最早由Lang于1995收集并在[Lang 1995]中使用。它含有20000篇左右的Usenet文档,几乎平均分配20个不同的新闻组。除了其中4.5%的文档属于两个或两个以上的新闻组以外,其余文档仅属于一个新闻组,因此它通常被作为单标注分类问题来处理。Newsgroups已经成为文本分及聚类中常用的文档集。美国MIT大学Jason Rennie对Newsgroups作了必要的处理,使得每个文档只属于一个新闻组,形成Newsgroups-18828。
2、Newsgroup文档预处理
要做文本分类首先得完成文本的预处理,预处理的主要步骤如下
- packagecom.pku.yangliu;
- importjava.io.BufferedReader;
- importjava.io.File;
- importjava.io.FileReader;
- importjava.io.FileWriter;
- importjava.io.IOException;
- importjava.util.ArrayList;
- /**
- *Newsgroups文档集预处理类
- */
- publicclassDataPreProcess{
- /**输入文件调用处理数据函数
- *@paramstrDirnewsgroup文件目录的绝对路径
- *@throwsIOException
- */
- publicvoiddoProcess(StringstrDir)throwsIOException{
- FilefileDir=newFile(strDir);
- if(!fileDir.exists()){
- System.out.println("Filenotexist:"+strDir);
- return;
- }
- StringsubStrDir=strDir.substring(strDir.lastIndexOf('/'));
- StringdirTarget=strDir+"/../../processedSample_includeNotSpecial"+subStrDir;
- FilefileTarget=newFile(dirTarget);
- if(!fileTarget.exists()){//注意processedSample需要先建立目录建出来,否则会报错,因为母目录不存在
- fileTarget.mkdir();
- }
- File[]srcFiles=fileDir.listFiles();
- String[]stemFileNames=newString[srcFiles.length];
- for(inti=0;i<srcFiles.length;i++){
- StringfileFullName=srcFiles[i].getCanonicalPath();
- StringfileShortName=srcFiles[i].getName();
- if(!newFile(fileFullName).isDirectory()){//确认子文件名不是目录如果是可以再次递归调用
- System.out.println("Beginpreprocess:"+fileFullName);
- StringBuilderstringBuilder=newStringBuilder();
- stringBuilder.append(dirTarget+"/"+fileShortName);
- createProcessFile(fileFullName,stringBuilder.toString());
- stemFileNames[i]=stringBuilder.toString();
- }
- else{
- fileFullName=fileFullName.replace("\\","/");
- doProcess(fileFullName);
- }
- }
- //下面调用stem算法
- if(stemFileNames.length>0&&stemFileNames[0]!=null){
- Stemmer.porterMain(stemFileNames);
- }
- }
- /**进行文本预处理生成目标文件
- *@paramsrcDir源文件文件目录的绝对路径
- *@paramtargetDir生成的目标文件的绝对路径
- *@throwsIOException
- */
- privatestaticvoidcreateProcessFile(StringsrcDir,StringtargetDir)throwsIOException{
- //TODOAuto-generatedmethodstub
- FileReadersrcFileReader=newFileReader(srcDir);
- FileReaderstopWordsReader=newFileReader("F:/DataMiningSample/stopwords.txt");
- FileWritertargetFileWriter=newFileWriter(targetDir);
- BufferedReadersrcFileBR=newBufferedReader(srcFileReader);//装饰模式
- BufferedReaderstopWordsBR=newBufferedReader(stopWordsReader);
- Stringline,resLine,stopWordsLine;
- //用stopWordsBR够着停用词的ArrayList容器
- ArrayList<String>stopWordsArray=newArrayList<String>();
- while((stopWordsLine=stopWordsBR.readLine())!=null){
- if(!stopWordsLine.isEmpty()){
- stopWordsArray.add(stopWordsLine);
- }
- }
- while((line=srcFileBR.readLine())!=null){
- resLine=lineProcess(line,stopWordsArray);
- if(!resLine.isEmpty()){
- //按行写,一行写一个单词
- String[]tempStr=resLine.split("");//\s
- for(inti=0;i<tempStr.length;i++){
- if(!tempStr[i].isEmpty()){
- targetFileWriter.append(tempStr[i]+"\n");
- }
- }
- }
- }
- targetFileWriter.flush();
- targetFileWriter.close();
- srcFileReader.close();
- stopWordsReader.close();
- srcFileBR.close();
- stopWordsBR.close();
- }
- /**对每行字符串进行处理,主要是词法分析、去停用词和stemming
- *@paramline待处理的一行字符串
- *@paramArrayList<String>停用词数组
- *@returnString处理好的一行字符串,是由处理好的单词重新生成,以空格为分隔符
- *@throwsIOException
- */
- privatestaticStringlineProcess(Stringline,ArrayList<String>stopWordsArray)throwsIOException{
- //TODOAuto-generatedmethodstub
- //step1英文词法分析,去除数字、连字符、标点符号、特殊字符,所有大写字母转换成小写,可以考虑用正则表达式
- Stringres[]=line.split("[^a-zA-Z]");
- //这里要小心,防止把有单词中间有数字和连字符的单词截断了,但是截断也没事
- StringresString=newString();
- //step2去停用词
- //step3stemming,返回后一起做
- for(inti=0;i<res.length;i++){
- if(!res[i].isEmpty()&&!stopWordsArray.contains(res[i].toLowerCase())){
- resString+=""+res[i].toLowerCase()+"";
- }
- }
- returnresString;
- }
- /**
- *@paramargs
- *@throwsIOException
- */
- publicvoidBPPMain(String[]args)throwsIOException{
- //TODOAuto-generatedmethodstub
- DataPreProcessdataPrePro=newDataPreProcess();
- dataPrePro.doProcess("F:/DataMiningSample/orginSample");
- }
- }
出现次数大于等于1次的词有87554个
出现次数大于等于3次的词有36456个
出现次数大于等于4次的词有30095个
特征词的选取策略:
策略一:保留所有词作为特征词 共计87554个
策略二:选取出现次数大于等于4次的词作为特征词共计30095个
特征词的选取策略:采用策略一,后面将对两种特征词选取策略的计算时间和平均准确率做对比
- packagecom.pku.yangliu;
- importjava.io.BufferedReader;
- importjava.io.File;
- importjava.io.FileReader;
- importjava.io.FileWriter;
- importjava.io.IOException;
- importjava.util.SortedMap;
- importjava.util.TreeMap;
- /**创建训练样例集合与测试样例集合
- *
- */
- publicclassCreateTrainAndTestSample{
- voidfilterSpecialWords()throwsIOException{
- //TODOAuto-generatedmethodstub
- Stringword;
- ComputeWordsVectorcwv=newComputeWordsVector();
- StringfileDir="F:/DataMiningSample/processedSample_includeNotSpecial";
- SortedMap<String,Double>wordMap=newTreeMap<String,Double>();
- wordMap=cwv.countWords(fileDir,wordMap);
- cwv.printWordMap(wordMap);//把wordMap输出到文件
- File[]sampleDir=newFile(fileDir).listFiles();
- for(inti=0;i<sampleDir.length;i++){
- File[]sample=sampleDir[i].listFiles();
- StringtargetDir="F:/DataMiningSample/processedSampleOnlySpecial/"+sampleDir[i].getName();
- FiletargetDirFile=newFile(targetDir);
- if(!targetDirFile.exists()){
- targetDirFile.mkdir();
- }
- for(intj=0;j<sample.length;j++){
- StringfileShortName=sample[j].getName();
- if(fileShortName.contains("stemed")){
- targetDir="F:/DataMiningSample/processedSampleOnlySpecial/"+sampleDir[i].getName()+"/"+fileShortName.substring(0,5);
- FileWritertgWriter=newFileWriter(targetDir);
- FileReadersamReader=newFileReader(sample[j]);
- BufferedReadersamBR=newBufferedReader(samReader);
- while((word=samBR.readLine())!=null){
- if(wordMap.containsKey(word)){
- tgWriter.append(word+"\n");
- }
- }
- tgWriter.flush();
- tgWriter.close();
- }
- }
- }
- }
- voidcreateTestSamples(StringfileDir,doubletrainSamplePercent,intindexOfSample,StringclassifyResultFile)throwsIOException{
- //TODOAuto-generatedmethodstub
- Stringword,targetDir;
- FileWritercrWriter=newFileWriter(classifyResultFile);//测试样例正确类目记录文件
- File[]sampleDir=newFile(fileDir).listFiles();
- for(inti=0;i<sampleDir.length;i++){
- File[]sample=sampleDir[i].listFiles();
- doubletestBeginIndex=indexOfSample*(sample.length*(1-trainSamplePercent));//测试样例的起始文件序号
- doubletestEndIndex=(indexOfSample+1)*(sample.length*(1-trainSamplePercent));//测试样例集的结束文件序号
- for(intj=0;j<sample.length;j++){
- FileReadersamReader=newFileReader(sample[j]);
- BufferedReadersamBR=newBufferedReader(samReader);
- StringfileShortName=sample[j].getName();
- StringsubFileName=fileShortName;
- if(j>testBeginIndex&&j<testEndIndex){//序号在规定区间内的作为测试样本,需要为测试样本生成类别-序号文件,最后加入分类的结果,一行对应一个文件,方便统计准确率
- targetDir="F:/DataMiningSample/TestSample"+indexOfSample+"/"+sampleDir[i].getName();
- crWriter.append(subFileName+""+sampleDir[i].getName()+"\n");
- }
- else{//其余作为训练样本
- targetDir="F:/DataMiningSample/TrainSample"+indexOfSample+"/"+sampleDir[i].getName();
- }
- targetDir=targetDir.replace("\\","/");
- FiletrainSamFile=newFile(targetDir);
- if(!trainSamFile.exists()){
- trainSamFile.mkdir();
- }
- targetDir+="/"+subFileName;
- FileWritertsWriter=newFileWriter(newFile(targetDir));
- while((word=samBR.readLine())!=null){
- tsWriter.append(word+"\n");
- }
- tsWriter.flush();
- tsWriter.close();
- }
- }
- crWriter.flush();
- crWriter.close();
- }
- }
3、贝叶斯算法描述及实现
类条件概率P(tk|c)=(类c下单词tk在各个文档中出现过的次数之和+1)/(类c下单词总数+训练样本中不重复特征词总数)
先验概率P(c)=类c下的单词总数/整个训练样本的单词总数
(2)伯努利模型(Bernoulli model) –以文件为粒度,或者说是采用二项分布模型,伯努利实验即N次独立重复随机实验,只考虑事件发生/不发生,所以每个单词的表示变量是布尔型的
类条件概率P(tk|c)=(类c下包含单词tk的文件数+1)/(类c下文件总数+2)(注意:开始此处错写成了单词,多谢网友提醒后更正)
先验概率P(c)=类c下文件总数/整个训练样本的文件总数
本分类器选用多元分布模型计算,根据《Introduction to Information Retrieval 》,多元分布模型计算准确率更高
(2) 用交叉验证法做十次分类实验,对准确率取平均值
(3) 根据正确类目文件和分类结果文计算混淆矩阵并且输出
(4) Map<String,Double> cateWordsProb key为“类目_单词”, value为该类目下该单词的出现次数,避免重复计算
- packagecom.pku.yangliu;
- importjava.io.BufferedReader;
- importjava.io.File;
- importjava.io.FileReader;
- importjava.io.FileWriter;
- importjava.io.IOException;
- importjava.math.BigDecimal;
- importjava.util.Iterator;
- importjava.util.Map;
- importjava.util.Set;
- importjava.util.SortedSet;
- importjava.util.TreeMap;
- importjava.util.TreeSet;
- importjava.util.Vector;
- /**利用朴素贝叶斯算法对newsgroup文档集做分类,采用十组交叉测试取平均值
- *采用多项式模型,stanford信息检索导论课件上面言多项式模型比伯努利模型准确度高
- *类条件概率P(tk|c)=(类c下单词tk在各个文档中出现过的次数之和+1)/(类c下单词总数+|V|)
- *
- */
- publicclassNaiveBayesianClassifier{
- /**用贝叶斯法对测试文档集分类
- *@paramtrainDir训练文档集目录
- *@paramtestDir测试文档集目录
- *@paramclassifyResultFileNew分类结果文件路径
- *@throwsException
- */
- privatevoiddoProcess(StringtrainDir,StringtestDir,
- StringclassifyResultFileNew)throwsException{
- //TODOAuto-generatedmethodstub
- Map<String,Double>cateWordsNum=newTreeMap<String,Double>();//保存训练集每个类别的总词数
- Map<String,Double>cateWordsProb=newTreeMap<String,Double>();//保存训练样本每个类别中每个属性词的出现词数
- cateWordsProb=getCateWordsProb(trainDir);
- cateWordsNum=getCateWordsNum(trainDir);
- doubletotalWordsNum=0.0;//记录所有训练集的总词数
- Set<Map.Entry<String,Double>>cateWordsNumSet=cateWordsNum.entrySet();
- for(Iterator<Map.Entry<String,Double>>it=cateWordsNumSet.iterator();it.hasNext();){
- Map.Entry<String,Double>me=it.next();
- totalWordsNum+=me.getValue();
- }
- //下面开始读取测试样例做分类
- Vector<String>testFileWords=newVector<String>();
- Stringword;
- File[]testDirFiles=newFile(testDir).listFiles();
- FileWritercrWriter=newFileWriter(classifyResultFileNew);
- for(inti=0;i<testDirFiles.length;i++){
- File[]testSample=testDirFiles[i].listFiles();
- for(intj=0;j<testSample.length;j++){
- testFileWords.clear();
- FileReaderspReader=newFileReader(testSample[j]);
- BufferedReaderspBR=newBufferedReader(spReader);
- while((word=spBR.readLine())!=null){
- testFileWords.add(word);
- }
- //下面分别计算该测试样例属于二十个类别的概率
- File[]trainDirFiles=newFile(trainDir).listFiles();
- BigDecimalmaxP=newBigDecimal(0);
- StringbestCate=null;
- for(intk=0;k<trainDirFiles.length;k++){
- BigDecimalp=computeCateProb(trainDirFiles[k],testFileWords,cateWordsNum,totalWordsNum,cateWordsProb);
- if(k==0){
- maxP=p;
- bestCate=trainDirFiles[k].getName();
- continue;
- }
- if(p.compareTo(maxP)==1){
- maxP=p;
- bestCate=trainDirFiles[k].getName();
- }
- }
- crWriter.append(testSample[j].getName()+""+bestCate+"\n");
- crWriter.flush();
- }
- }
- crWriter.close();
- }
- /**统计某类训练样本中每个单词的出现次数
- *@paramstrDir训练样本集目录
- *@returnMap<String,Double>cateWordsProb用"类目_单词"对来索引的map,保存的val就是该类目下该单词的出现次数
- *@throwsIOException
- */
- publicMap<String,Double>getCateWordsProb(StringstrDir)throwsIOException{
- Map<String,Double>cateWordsProb=newTreeMap<String,Double>();
- FilesampleFile=newFile(strDir);
- File[]sampleDir=sampleFile.listFiles();
- Stringword;
- for(inti=0;i<sampleDir.length;i++){
- File[]sample=sampleDir[i].listFiles();
- for(intj=0;j<sample.length;j++){
- FileReadersamReader=newFileReader(sample[j]);
- BufferedReadersamBR=newBufferedReader(samReader);
- while((word=samBR.readLine())!=null){
- Stringkey=sampleDir[i].getName()+"_"+word;
- if(cateWordsProb.containsKey(key)){
- doublecount=cateWordsProb.get(key)+1.0;
- cateWordsProb.put(key,count);
- }
- else{
- cateWordsProb.put(key,1.0);
- }
- }
- }
- }
- returncateWordsProb;
- }
- /**计算某一个测试样本属于某个类别的概率
- *@paramMap<String,Double>cateWordsProb记录每个目录中出现的单词及次数
- *@paramFiletrainFile该类别所有的训练样本所在目录
- *@paramVector<String>testFileWords该测试样本中的所有词构成的容器
- *@paramdoubletotalWordsNum记录所有训练样本的单词总数
- *@paramMap<String,Double>cateWordsNum记录每个类别的单词总数
- *@returnBigDecimal返回该测试样本在该类别中的概率
- *@throwsException
- *@throwsIOException
- */
- privateBigDecimalcomputeCateProb(FiletrainFile,Vector<String>testFileWords,Map<String,Double>cateWordsNum,doubletotalWordsNum,Map<String,Double>cateWordsProb)throwsException{
- //TODOAuto-generatedmethodstub
- BigDecimalprobability=newBigDecimal(1);
- doublewordNumInCate=cateWordsNum.get(trainFile.getName());
- BigDecimalwordNumInCateBD=newBigDecimal(wordNumInCate);
- BigDecimaltotalWordsNumBD=newBigDecimal(totalWordsNum);
- for(Iterator<String>it=testFileWords.iterator();it.hasNext();){
- Stringme=it.next();
- Stringkey=trainFile.getName()+"_"+me;
- doubletestFileWordNumInCate;
- if(cateWordsProb.containsKey(key)){
- testFileWordNumInCate=cateWordsProb.get(key);
- }elsetestFileWordNumInCate=0.0;
- BigDecimaltestFileWordNumInCateBD=newBigDecimal(testFileWordNumInCate);
- BigDecimalxcProb=(testFileWordNumInCateBD.add(newBigDecimal(0.0001))).divide(totalWordsNumBD.add(wordNumInCateBD),10,BigDecimal.ROUND_CEILING);
- probability=probability.multiply(xcProb);
- }
- BigDecimalres=probability.multiply(wordNumInCateBD.divide(totalWordsNumBD,10,BigDecimal.ROUND_CEILING));
- returnres;
- }
- /**获得每个类目下的单词总数
- *@paramtrainDir训练文档集目录
- *@returnMap<String,Double><目录名,单词总数>的map
- *@throwsIOException
- */
- privateMap<String,Double>getCateWordsNum(StringtrainDir)throwsIOException{
- //TODOAuto-generatedmethodstub
- Map<String,Double>cateWordsNum=newTreeMap<String,Double>();
- File[]sampleDir=newFile(trainDir).listFiles();
- for(inti=0;i<sampleDir.length;i++){
- doublecount=0;
- File[]sample=sampleDir[i].listFiles();
- for(intj=0;j<sample.length;j++){
- FileReaderspReader=newFileReader(sample[j]);
- BufferedReaderspBR=newBufferedReader(spReader);
- while(spBR.readLine()!=null){
- count++;
- }
- }
- cateWordsNum.put(sampleDir[i].getName(),count);
- }
- returncateWordsNum;
- }
- /**根据正确类目文件和分类结果文件统计出准确率
- *@paramclassifyResultFile正确类目文件
- *@paramclassifyResultFileNew分类结果文件
- *@returndouble分类的准确率
- *@throwsIOException
- */
- doublecomputeAccuracy(StringclassifyResultFile,
- StringclassifyResultFileNew)throwsIOException{
- //TODOAuto-generatedmethodstub
- Map<String,String>rightCate=newTreeMap<String,String>();
- Map<String,String>resultCate=newTreeMap<String,String>();
- rightCate=getMapFromResultFile(classifyResultFile);
- resultCate=getMapFromResultFile(classifyResultFileNew);
- Set<Map.Entry<String,String>>resCateSet=resultCate.entrySet();
- doublerightCount=0.0;
- for(Iterator<Map.Entry<String,String>>it=resCateSet.iterator();it.hasNext();){
- Map.Entry<String,String>me=it.next();
- if(me.getValue().equals(rightCate.get(me.getKey()))){
- rightCount++;
- }
- }
- computerConfusionMatrix(rightCate,resultCate);
- returnrightCount/resultCate.size();
- }
- /**根据正确类目文件和分类结果文计算混淆矩阵并且输出
- *@paramrightCate正确类目对应map
- *@paramresultCate分类结果对应map
- *@returndouble分类的准确率
- *@throwsIOException
- */
- privatevoidcomputerConfusionMatrix(Map<String,String>rightCate,
- Map<String,String>resultCate){
- //TODOAuto-generatedmethodstub
- int[][]confusionMatrix=newint[20][20];
- //首先求出类目对应的数组索引
- SortedSet<String>cateNames=newTreeSet<String>();
- Set<Map.Entry<String,String>>rightCateSet=rightCate.entrySet();
- for(Iterator<Map.Entry<String,String>>it=rightCateSet.iterator();it.hasNext();){
- Map.Entry<String,String>me=it.next();
- cateNames.add(me.getValue());
- }
- cateNames.add("rec.sport.baseball");//防止数少一个类目
- String[]cateNamesArray=cateNames.toArray(newString[0]);
- Map<String,Integer>cateNamesToIndex=newTreeMap<String,Integer>();
- for(inti=0;i<cateNamesArray.length;i++){
- cateNamesToIndex.put(cateNamesArray[i],i);
- }
- for(Iterator<Map.Entry<String,String>>it=rightCateSet.iterator();it.hasNext();){
- Map.Entry<String,String>me=it.next();
- confusionMatrix[cateNamesToIndex.get(me.getValue())][cateNamesToIndex.get(resultCate.get(me.getKey()))]++;
- }
- //输出混淆矩阵
- double[]hangSum=newdouble[20];
- System.out.print("");
- for(inti=0;i<20;i++){
- System.out.print(i+"");
- }
- System.out.println();
- for(inti=0;i<20;i++){
- System.out.print(i+"");
- for(intj=0;j<20;j++){
- System.out.print(confusionMatrix[i][j]+"");
- hangSum[i]+=confusionMatrix[i][j];
- }
- System.out.println(confusionMatrix[i][i]/hangSum[i]);
- }
- System.out.println();
- }
- /**从分类结果文件中读取map
- *@paramclassifyResultFileNew类目文件
- *@returnMap<String,String>由<文件名,类目名>保存的map
- *@throwsIOException
- */
- privateMap<String,String>getMapFromResultFile(
- StringclassifyResultFileNew)throwsIOException{
- //TODOAuto-generatedmethodstub
- FilecrFile=newFile(classifyResultFileNew);
- FileReadercrReader=newFileReader(crFile);
- BufferedReadercrBR=newBufferedReader(crReader);
- Map<String,String>res=newTreeMap<String,String>();
- String[]s;
- Stringline;
- while((line=crBR.readLine())!=null){
- s=line.split("");
- res.put(s[0],s[1]);
- }
- returnres;
- }
- /**
- *@paramargs
- *@throwsException
- */
- publicvoidNaiveBayesianClassifierMain(String[]args)throwsException{
- //TODOAuto-generatedmethodstub
- //首先创建训练集和测试集
- CreateTrainAndTestSamplectt=newCreateTrainAndTestSample();
- NaiveBayesianClassifiernbClassifier=newNaiveBayesianClassifier();
- ctt.filterSpecialWords();//根据包含非特征词的文档集生成只包含特征词的文档集到processedSampleOnlySpecial目录下
- double[]accuracyOfEveryExp=newdouble[10];
- doubleaccuracyAvg,sum=0;
- for(inti=0;i<10;i++){//用交叉验证法做十次分类实验,对准确率取平均值
- StringTrainDir="F:/DataMiningSample/TrainSample"+i;
- StringTestDir="F:/DataMiningSample/TestSample"+i;
- StringclassifyRightCate="F:/DataMiningSample/classifyRightCate"+i+".txt";
- StringclassifyResultFileNew="F:/DataMiningSample/classifyResultNew"+i+".txt";
- ctt.createTestSamples("F:/DataMiningSample/processedSampleOnlySpecial",0.9,i,classifyRightCate);
- nbClassifier.doProcess(TrainDir,TestDir,classifyResultFileNew);
- accuracyOfEveryExp[i]=nbClassifier.computeAccuracy(classifyRightCate,classifyResultFileNew);
- System.out.println("TheaccuracyforNaiveBayesianClassifierin"+i+"thExpis:"+accuracyOfEveryExp[i]);
- }
- for(inti=0;i<10;i++){
- sum+=accuracyOfEveryExp[i];
- }
- accuracyAvg=sum/10;
- System.out.println("TheaverageaccuracyforNaiveBayesianClassifierinallExpsis:"+accuracyAvg);
- }
- }
4 朴素贝叶斯算法对newsgroup文档集做分类的结果
为方便计算混淆矩阵,将类目编号如下
0 alt.atheism
1 comp.graphics
2 comp.os.ms-windows.misc
3comp.sys.ibm.pc.hdwar
4comp.sys.mac.hardwar
5 comp.windows.x
6 misc.forsale
7 rec.autos
8 rec.motorcycles
9 rec.sport.baseball
10 rec.sport.hockey
11 sci.crypt
12 sci.electronics
13 sci.med
14 sci.space
15 soc.religion.christian
16 talk.politics.guns
17 talk.politics.mideast
18 talk.politics.misc
19 talk.religion.misc
取所有词共87554个作为特征词:10次交叉验证实验平均准确率78.19%,用时23min,准确率范围75.65%-80.47%,第6次实验准确率超过80%
取出现次数大于等于4次的词共计30095个作为特征词: 10次交叉验证实验平均准确率77.91%,用时22min,准确率范围75.51%-80.26%,第6次实验准确率超过80%