文本挖掘——基于TF-IDF的KNN分类算法实现

最新推荐文章于 2024-03-14 18:14:56 发布

XiaoXiao_Yang77

最新推荐文章于 2024-03-14 18:14:56 发布

阅读量4.6k

点赞数 3

分类专栏：文本挖掘机器学习文章标签：大数据算法

本文链接：https://blog.csdn.net/xiaoxiao_yang77/article/details/78226633

版权

机器学习同时被 2 个专栏收录

6 篇文章 1 订阅

订阅专栏

文本挖掘

5 篇文章 1 订阅

订阅专栏

一项目背景
二项目实施
三项目总结

一、项目背景

此项目是用于基建大数据的文本挖掘。首先爬虫师已经从各个公开网站上采集了大量的文本，这些文本是关于基建行业的各种招中标公告，文本里会有部分词汇明显或者隐晦的介绍此项目是关于哪一工程类别的，比如公路工程，市政工程，建筑工程，轨道交通工程，等等。

所以，拿到文本的我们需要对这些信息进行工程行业的归类，进而提供给不同行业有需求的客户。下图展示了部分采集的数据，现在我们需要根据项目名称和项目详情进行工程的分类。

图1：文本数据展示

二、项目实施

此项目我们采用机器学习KNN算法进行训练和分类，KNN算法的相关介绍已在另一篇博客里详细介绍过，KNN分类算法介绍。而我们知道，KNN算法处理的是数值向量，所以需要把文本信息转化为文本向量，再采用机器学习算法进行训练。下面，我简单介绍本项目实施的思路。

首先，需要对文本进行分词，提炼出有价值的词汇，构成属性词典。其次，准备训练样本和测试样本，训练样本为已知类别的样本，测试样本为待分类的样本，并根据属性词典计算文本的TF向量值，实现文档的向量化表示。最后，实现KNN算法，得出分类正确率。

1、属性词典的构造

选取采集的50000个招中标网络片段，去除网页标签，特殊字符等，写入TXT文档中。然后用Hadoop的map——reduce机制处理文本。读取文本，进行分词，项目采用基于搜狗词典的IK分词接口进行分词，然后统计分词后的词频，并对词频进行排序，去除词频数较高和较低的词，这些词对分类器的构造没有太大的价值，反而会造成干扰和增加计算复杂度。下面是采用Hadoop处理的Java代码：
(1)统计词频

import com.rednum.hadoopwork.tools.OperHDFS;
import java.io.IOException;
import java.io.StringReader;;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.wltea.analyzer.IKSegmentation;
import org.wltea.analyzer.Lexeme;


//读取TXT文件，进行分词并统计词频

public class WordCountJob extends Configured implements Tool {

    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IKTokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            StringReader reader = new StringReader(value.toString());
            IKSegmentation ik = new IKSegmentation(reader, true);// 当为true时，分词器进行最大词长切分
            Lexeme lexeme = null;

            while ((lexeme = ik.next()) != null) {
                word.set(lexeme.getLexemeText() + ":");
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    @Override
    public int run(String[] strings) throws Exception {
        OperHDFS hdfs = new OperHDFS();

        hdfs.deleteFile("hdfs://192.168.1.108:9001/user/hadoop/hotwords/", "hdfs://192.168.1.108:9001/user/hadoop/hotwords/output");
        hdfs.deleteFile("hdfs://192.168.1.108:9001/user/hadoop/hotwords/", "hdfs://192.168.1.108:9001/user/hadoop/hotwords/sort");
        Job job = Job.getInstance(getConf());
        job.setJarByClass(WordCountJob.class);
        job.setMapperClass(IKTokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path("hdfs://192.168.1.108:9001/user/hadoop/hotwords/train.txt"));
        FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.1.108:9001/user/hadoop/hotwords/output"));
        job.waitForCompletion(true);
        return 0;
    }

    public static void main(String[] args) throws Exception {
        ToolRunner.run(new WordCountJob(), args);
        ToolRunner.run(new SortDscWordCountMRJob(), args);
    }
}

(2)对词频进行排序

import java.io.IOException;

import org.apache.commons.io.output.NullWriter;
import org.apache.hadoop.conf.Configurable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class SortDscWordCountMRJob extends Configured implements Tool {

    public static class SortDscWordCountMapper extends Mapper<LongWritable, Text, INTDoWritable, Text> {
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] contents = value.toString().split(":");
            String wc = contents[1].trim();
            String wd = contents[0].trim();
            INTDoWritable iw = new INTDoWritable();
            try {
                iw.num = new IntWritable(Integer.parseInt(wc));
                context.write(iw, new Text(wd));
            } catch (Exception e) {
                System.out.println(e);
            }

        }
    }

    public static class SortDscWordCountReducer extends Reducer<INTDoWritable, Text, NullWritable, Text> {
        public void reduce(INTDoWritable key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {
            for (Text text : values) {
                text = new Text(text.toString() + ": " + key.num.get());
                context.write(NullWritable.get(), new Text(text));
            }
        }

    }


    @Override
    public int run(String[] allArgs) throws Exception {
        Job job = Job.getInstance(getConf());
        job.setJarByClass(SortDscWordCountMRJob.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        job.setMapOutputKeyClass(INTDoWritable.class);
        job.setMapOutputValueClass(Text.class);
        job.setOutputKeyClass(NullWriter.class);
        job.setMapOutputValueClass(Text.class);

        job.setMapperClass(SortDscWordCountMapper.class);
        job.setReducerClass(SortDscWordCountReducer.class);


        String[] args = new GenericOptionsParser(getConf(), allArgs).getRemainingArgs();
        FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.1.108:9001/user/hadoop/hotwords/output"));
        FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.1.108:9001/user/hadoop/hotwords/sort"));
        job.waitForCompletion(true);

        return 0;
    }

    public static void main(String[] args) throws Exception {
        ToolRunner.run(new SortDscWordCountMRJob(), args);
    }
}

词频统计和排序结果如下图所示：

图为：词频统计结果

根据统计出来的词频文档，去掉热词、停用词和词频较少的词汇，最终形成属性词典。此项目中，我选取了3000多个词构成属性词典。

2、文本向量TF-IDF的计算

实现KNN算法首先要实现文档的向量化表示。
计算特征词的TF*IDF，每个文档的向量由包含所有特征词的TF*IDF值组成，每一维对应一个特征词。
TF及IDF的计算公式如下，分别为特征词的特征项频率和逆文档频率：

要得到Wij,需要计算IDF和TF。下面为计算IDF值得Java代码：

//计算TDF
Map<String, Double> IDFPerWordMap = new TreeMap<String, Double>();
IDFPerWordMap = computeIDF(text, wordMap);

注：computeIDF函数所需的参数为文本数据和属性词典， text和wordMap由下面的代码获得：

            //获得特征词词典wordMap
            Map<String, Double> wordMap = new TreeMap<>();
            String path = "D:\\DataMining\\Title\\labeldict.txt";
            wordMap = countWords(path, wordMap);

            //获取要读取的文件
            File readFile = new File("D:\\DataMining\\Title\\train.txt");
            //输入IO流声明
            InputStream in = null;
            InputStreamReader ir = null;
            BufferedReader br = null;
            in = new BufferedInputStream(new FileInputStream(readFile));
            //如果你文件已utf-8编码的就按这个编码来读取，不然又中文会读取到乱码
            ir = new InputStreamReader(in, "utf-8");
            //字符输入流中读取文本,这样可以一行一行读取
            br = new BufferedReader(ir);
            String line = "";
            List<HashMap<String, Object>> text = new ArrayList<>();
            //一行一行读取
            while ((line = br.readLine()) != null) {
                HashMap<String, Object> map = new HashMap<>();
                String[] words = line.split("@&");
                String pro = "";
                String info = "";
                if (words.length == 2) {
                    pro = words[0];
                    info = words[1];
                    StringReader reader = new StringReader(info);
                    IKSegmentation ik = new IKSegmentation(reader, true);// 当为true时，分词器进行最大词长切分
                    Lexeme lexeme = null;
                    List<String> word = new ArrayList<>();
                    while ((lexeme = ik.next()) != null) {
                        String key = lexeme.getLexemeText();
                        word.add(key);
                    }
                    map.put(pro, word);
                    text.add(map);
                }
            }

            //计算TDF
            Map<String, Double> IDFPerWordMap = new TreeMap<String, Double>();
            IDFPerWordMap = computeIDF(text, wordMap);

下面为computeIDF函数的具体实现：

 /**
     * 计算IDF，即属性词典中每个词在多少个文档中出现过
     *
     * @param words 所有样本
     * @param wordMap 属性词典
     * @param SortedMap<String,Double> IDF值
     * @return 单词的IDFmap
     * @throws IOException
     */
    public static SortedMap<String, Double> computeIDF(List<HashMap<String, Object>> words, Map<String, Double> wordMap) throws IOException {
        // TODO Auto-generated method stub 

        SortedMap<String, Double> IDFPerWordMap = new TreeMap<String, Double>();
        Set<Map.Entry<String, Double>> wordMapSet = wordMap.entrySet();
        for (Iterator<Map.Entry<String, Double>> pt = wordMapSet.iterator(); pt.hasNext();) {
            Map.Entry<String, Double> pe = pt.next();
            Double coutDoc = 0.0;
            String dicWord = pe.getKey();

            boolean isExited = false;
            for (HashMap<String, Object> word : words) {
                Object[] partword = word.values().toArray();
                for (Object keyword : partword) {
                    List<String> list = (List) keyword;
                    for (String key : list) {
                        if (!key.isEmpty() && key.equals(dicWord)) {
                            isExited = true;
                            break;
                        }
                    }
                    if (isExited) {
                        coutDoc++;
                    }
                }

            }
            if (coutDoc == 0.0) {
                coutDoc = 1.0;
            }
            //计算单词的IDF  
            Double IDF = Math.log(20000 / coutDoc) / Math.log(10);
            IDFPerWordMap.put(dicWord, IDF);
        }
        return IDFPerWordMap;
    }

得到IDF后，我们计算TF并得到文本向量
函数接口为：

computeTFMultiIDF(text, 0.9, IDFPerWordMap, wordMap);

注：text为所有样本分词后构成的链表，0.9为训练样本与测试数据的样本的比例，即90%作为训练样本，10%为测试样本。最后把计算出来的TF-IDF值分别写入训练和测试文档中，为Train.txt和Test.Txt。

 /**
     * 计算文档的TF-IDF属性向量,直接写成二维数组遍历形式即可，没必要递归
     *
     * @param words
     * @param trainSamplePercent 训练样例集占每个类目的比例
     * @param iDFPerWordMap
     * @param wordMap 属性词典map
     * @throws IOException
     */
    public static void computeTFMultiIDF(List<HashMap<String, Object>> words, double trainSamplePercent, Map<String, Double> iDFPerWordMap, Map<String, Double> wordMap) throws IOException {
        SortedMap<String, Double> TFPerDocMap = new TreeMap<String, Double>();
        //注意可以用两个写文件，一个专门写测试样例，一个专门写训练样例，用sampleType的值来表示  
        String trainFileDir = "D:\\DataMining\\Title\\Train.txt";
        String testFileDir = "D:\\DataMining\\Title\\Test.txt";
        FileWriter tsTrainWriter = new FileWriter(new File(trainFileDir));
        FileWriter tsTestWrtier = new FileWriter(new File(testFileDir));
        FileWriter tsWriter = tsTrainWriter;
        int index = 0;
        for (HashMap<String, Object> word : words) {
            index++;
            Object[] partword = word.values().toArray();
            Double wordSumPerDoc = 0.0;//计算每篇文档的总词数
            for (Object keyword : partword) {
                List<String> list = (List) keyword;
                for (String key : list) {
                    if (!key.isEmpty() && wordMap.containsKey(key)) {//必须是属性词典里面的词，去掉的词不考虑  
                        wordSumPerDoc++;
                        if (TFPerDocMap.containsKey(key)) {
                            Double count = TFPerDocMap.get(key);
                            TFPerDocMap.put(key, count + 1);
                        } else {
                            TFPerDocMap.put(key, 1.0);
                        }
                    }

                }
            }

            if (index >= 1 && index <= trainSamplePercent * words.size()) {
                tsWriter = tsTrainWriter;
            } else {
                tsWriter = tsTestWrtier;
            }

            Double wordWeight;
            Set<Map.Entry<String, Double>> tempTF = TFPerDocMap.entrySet();
            for (Iterator<Map.Entry<String, Double>> mt = tempTF.iterator(); mt.hasNext();) {
                Map.Entry<String, Double> me = mt.next();
                wordWeight = (me.getValue() / wordSumPerDoc) * iDFPerWordMap.get(me.getKey());
                //这里IDF暂时设为1，具体的计算IDF算法改进和实现见我的博客中关于kmeans聚类的博文  
                //wordWeight = (me.getValue() / wordSumPerDoc) * 1.0;
                TFPerDocMap.put(me.getKey(), wordWeight);
            }
            Set<String> keyWord = word.keySet();
            for (String label : keyWord) {
                tsWriter.append(label + " ");
            }

            Set<Map.Entry<String, Double>> tempTF2 = TFPerDocMap.entrySet();
            for (Iterator<Map.Entry<String, Double>> mt = tempTF2.iterator(); mt.hasNext();) {
                Map.Entry<String, Double> ne = mt.next();
                tsWriter.append(ne.getKey() + " " + ne.getValue() + " ");
            }
            tsWriter.append("\n");
            tsWriter.flush();
        }

        tsTrainWriter.close();
        tsTestWrtier.close();
        tsWriter.close();
    }

到此为止，训练样本和测试样本已准备好，并转化为了文本向量，工作已经完成一半多。接下来可以利用KNN算法进行训练了。

3、分类器的训练、测试文本类别的判断、分类精度的计算

KNN算法流程：

step1：在训练文本集中选出与新文本最相似的 K 个文本，相似度用向量夹角余弦度量，计算公式为：

其中，K值的确定一般采用先定一个初始值，然后根据实验测试的结果调整 K 值，本项目中K取15。

step2：:在新文本的 K 个邻居中，依次计算每类的权重，每类的权重等于K个邻居中属于该类的训练样本与测试样本的相似度之和。

step3:比较类的权重，将文本分到权重最大的那个类别中。

下面为KNN算法具体实施的Java代码：

        String train = "D:\\DataMining\\Title\\Train.txt";
        String test = "D:\\DataMining\\Title\\Test.txt";
        String result = "D:\\DataMining\\Title\\result.txt";
        double classify = doProcess(train, test, result);
        System.out.print(classify);

注：从训练和测试文本中读取数据，进行分类器的训练，最终分类结果保存在result.txt文档中。doProcess函数的实现如下：

 public static double doProcess(String trainFiles, String testFiles,
            String kNNResultFile) throws IOException {
        // TODO Auto-generated method stub  
        //首先读取训练样本和测试样本，用map<String,map<word,TF>>保存测试集和训练集，注意训练样本的类目信息也得保存，  
        //然后遍历测试样本，对于每一个测试样本去计算它与所有训练样本的相似度，相似度保存入map<String,double>有  
        //序map中去，然后取前K个样本，针对这k个样本来给它们所属的类目计算权重得分，对属于同一个类目的权重求和进而得到  
        //最大得分的类目，就可以判断测试样例属于该类目下，K值可以反复测试，找到分类准确率最高的那个值  
        //！注意要以"类目_文件名"作为每个文件的key，才能避免同名不同内容的文件出现  
        //！注意设置JM参数，否则会出现JAVA heap溢出错误  
        //！本程序用向量夹角余弦计算相似度  
        System.out.println("开始训练模型：");
        File trainSamples = new File(trainFiles);
        BufferedReader trainSamplesBR = new BufferedReader(new FileReader(trainSamples));
        String line;
        String[] lineSplitBlock;
        Map<String, TreeMap<String, Double>> trainFileNameWordTFMap = new TreeMap<String, TreeMap<String, Double>>();
        TreeMap<String, Double> trainWordTFMap = new TreeMap<String, Double>();
        int index1 = 0;
        while ((line = trainSamplesBR.readLine()) != null) {
            index1++;
            lineSplitBlock = line.split(" ");
            trainWordTFMap.clear();
            for (int i = 1; i < lineSplitBlock.length; i = i + 2) {
                trainWordTFMap.put(lineSplitBlock[i], Double.valueOf(lineSplitBlock[i + 1]));
            }
            TreeMap<String, Double> tempMap = new TreeMap<String, Double>();
            tempMap.putAll(trainWordTFMap);
            trainFileNameWordTFMap.put(lineSplitBlock[0] + "_" + index1, tempMap);
        }
        trainSamplesBR.close();

        File testSamples = new File(testFiles);
        BufferedReader testSamplesBR = new BufferedReader(new FileReader(testSamples));
        Map<String, Map<String, Double>> testFileNameWordTFMap = new TreeMap<String, Map<String, Double>>();
        Map<String, String> testClassifyCateMap = new TreeMap<String, String>();//分类形成的<文件名，类目>对  
        Map<String, Double> testWordTFMap = new TreeMap<String, Double>();
        int index = 0;
        while ((line = testSamplesBR.readLine()) != null) {
            index++;
            lineSplitBlock = line.split(" ");
            testWordTFMap.clear();
            for (int i = 1; i < lineSplitBlock.length; i = i + 2) {
                testWordTFMap.put(lineSplitBlock[i], Double.valueOf(lineSplitBlock[i + 1]));
            }
            TreeMap<String, Double> tempMap = new TreeMap<String, Double>();
            tempMap.putAll(testWordTFMap);
            testFileNameWordTFMap.put(lineSplitBlock[0] + "_" + index, tempMap);
        }
        testSamplesBR.close();
        //下面遍历每一个测试样例计算与所有训练样本的距离，做分类  
        String classifyResult;
        FileWriter testYangliuWriter = new FileWriter(new File("D:\\DataMining\\Title\\yangliuTest.txt"));
        FileWriter KNNClassifyResWriter = new FileWriter(kNNResultFile);
        Set<Map.Entry<String, Map<String, Double>>> testFileNameWordTFMapSet = testFileNameWordTFMap.entrySet();
        for (Iterator<Map.Entry<String, Map<String, Double>>> it = testFileNameWordTFMapSet.iterator(); it.hasNext();) {
            Map.Entry<String, Map<String, Double>> me = it.next();
            classifyResult = KNNComputeCate(me.getKey(), me.getValue(), trainFileNameWordTFMap, testYangliuWriter);
            System.out.println("分类结果为："+ classifyResult+"；正确结果为："+me.getKey());
            KNNClassifyResWriter.append(me.getKey() + " " + classifyResult + "\n");
            KNNClassifyResWriter.flush();
            testClassifyCateMap.put(me.getKey(), classifyResult);
        }
        KNNClassifyResWriter.close();
        //计算分类的准确率  
        double righteCount = 0;
        Set<Map.Entry<String, String>> testClassifyCateMapSet = testClassifyCateMap.entrySet();
        for (Iterator<Map.Entry<String, String>> it = testClassifyCateMapSet.iterator(); it.hasNext();) {
            Map.Entry<String, String> me = it.next();
            String rightCate = me.getKey().split("_")[0];
            if (me.getValue().equals(rightCate)) {
                righteCount++;
            }
        }
        testYangliuWriter.close();
        return righteCount / testClassifyCateMap.size();
    }

    /**
     * 对于每一个测试样本去计算它与所有训练样本的向量夹角余弦相似度 相似度保存入map<String,double>有序map中去，然后取前K个样本，
     * 针对这k个样本来给它们所属的类目计算权重得分，对属于同一个类 目的权重求和进而得到最大得分的类目，就可以判断测试样例属于该
     * 类目下。K值可以反复测试，找到分类准确率最高的那个值
     *
     * @param testWordTFMap 当前测试文件的<单词,词频>向量
     * @param trainFileNameWordTFMap 训练样本<类目_文件名,向量>Map
     * @param testYangliuWriter
     * @return String K个邻居权重得分最大的类目
     * @throws IOException
     */
    public static String KNNComputeCate(
            String testFileName,
            Map<String, Double> testWordTFMap,
            Map<String, TreeMap<String, Double>> trainFileNameWordTFMap, FileWriter testYangliuWriter) throws IOException {
        // TODO Auto-generated method stub  
        HashMap<String, Double> simMap = new HashMap<String, Double>();//<类目_文件名,距离> 后面需要将该HashMap按照value排序  
        double similarity;
        Set<Map.Entry<String, TreeMap<String, Double>>> trainFileNameWordTFMapSet = trainFileNameWordTFMap.entrySet();
        for (Iterator<Map.Entry<String, TreeMap<String, Double>>> it = trainFileNameWordTFMapSet.iterator(); it.hasNext();) {
            Map.Entry<String, TreeMap<String, Double>> me = it.next();
            similarity = computeSim(testWordTFMap, me.getValue());
            simMap.put(me.getKey(), similarity);
        }
        //下面对simMap按照value排序  
        ByValueComparator bvc = new ByValueComparator(simMap);
        TreeMap<String, Double> sortedSimMap = new TreeMap<String, Double>(bvc);
        sortedSimMap.putAll(simMap);

        //在disMap中取前K个最近的训练样本对其类别计算距离之和，K的值通过反复试验而得  
        Map<String, Double> cateSimMap = new TreeMap<String, Double>();//K个最近训练样本所属类目的距离之和  
        double K = 15;
        double count = 0;
        double tempSim;

        Set<Map.Entry<String, Double>> simMapSet = sortedSimMap.entrySet();
        for (Iterator<Map.Entry<String, Double>> it = simMapSet.iterator(); it.hasNext();) {
            Map.Entry<String, Double> me = it.next();
            count++;
            String categoryName = me.getKey().split("_")[0];
            if (cateSimMap.containsKey(categoryName)) {
                tempSim = cateSimMap.get(categoryName);
                cateSimMap.put(categoryName, tempSim + me.getValue());
            } else {
                cateSimMap.put(categoryName, me.getValue());
            }
            if (count > K) {
                break;
            }
        }
        //下面到cateSimMap里面把sim最大的那个类目名称找出来  
        //testYangliuWriter.flush();  
        //testYangliuWriter.close();  
        double maxSim = 0;
        String bestCate = null;
        Set<Map.Entry<String, Double>> cateSimMapSet = cateSimMap.entrySet();
        for (Iterator<Map.Entry<String, Double>> it = cateSimMapSet.iterator(); it.hasNext();) {
            Map.Entry<String, Double> me = it.next();
            if (me.getValue() > maxSim) {
                bestCate = me.getKey();
                maxSim = me.getValue();
            }
        }
        return bestCate;
    }

    /**
     * 计算测试样本向量和训练样本向量的相似度
     *
     * @param testWordTFMap 当前测试文件的<单词,词频>向量
     * @param trainWordTFMap 当前训练样本<单词,词频>向量
     * @return Double 向量之间的相似度 以向量夹角余弦计算
     * @throws IOException
     */
    public static double computeSim(Map<String, Double> testWordTFMap,
            Map<String, Double> trainWordTFMap) {
        // TODO Auto-generated method stub  
        double mul = 0, testAbs = 0, trainAbs = 0;
        Set<Map.Entry<String, Double>> testWordTFMapSet = testWordTFMap.entrySet();
        for (Iterator<Map.Entry<String, Double>> it = testWordTFMapSet.iterator(); it.hasNext();) {
            Map.Entry<String, Double> me = it.next();
            if (trainWordTFMap.containsKey(me.getKey())) {
                mul += me.getValue() * trainWordTFMap.get(me.getKey());
            }
            testAbs += me.getValue() * me.getValue();
        }
        testAbs = Math.sqrt(testAbs);

        Set<Map.Entry<String, Double>> trainWordTFMapSet = trainWordTFMap.entrySet();
        for (Iterator<Map.Entry<String, Double>> it = trainWordTFMapSet.iterator(); it.hasNext();) {
            Map.Entry<String, Double> me = it.next();
            trainAbs += me.getValue() * me.getValue();
        }
        trainAbs = Math.sqrt(trainAbs);
        return mul / (testAbs * trainAbs);
    }

三、项目总结

由于分类的类别太多，总共有13个类别，所以KNN算法的分类精度会有影响。另外，算法没有采用分布式处理，时间消耗太久，后面需要移到Hadoop架构上进行挖掘，这只是初步尝试。下面为部分分类结果展示，最后的准确率为70%左右，算法还需继续改进。这是本菜鸟进入文本挖掘的第一步，欢迎感兴趣的小伙伴进行指正和交流。

XiaoXiao_Yang77

关注

3
点赞
踩
25

收藏

觉得还不错? 一键收藏
1
评论
文本挖掘——基于TF-IDF的KNN分类算法实现

[TOCM]一项目背景二项目实施1属性词典的构造2文本向量TF-IDF的计算3分类器的训练测试文本类别的判断分类精度的计算三项目总结一、项目背景此项目是用于基建大数据的文本挖掘。首先爬虫师已经从各个公开网站上采集了大量的文本，这些文本是关于基建行业的各种招中标公告，文本里会有部分词汇明显或者隐晦的介绍此项目是关于哪一工程类别的，比如公路工程，市政工程，建筑工程，轨道交通工程，等等。所以，拿
复制链接

扫一扫