最近自己在做一个基于朴素贝叶斯算法的微博情感分类,首先朴素贝叶斯算法的基本推到我这里就不细说了。分类中我们一般会进行下面几个步骤:
1 对我们的语料库(训练文本)进行分词
2 对分词之后的文本进行TF-IDF的计算(TF-IDF介绍可以参考这边文章http://blog.csdn.net/yqlakers/article/details/70888897)
3 利用计算好的TF-IDF记性分词
下面我直接上代码分析这个过程:
分词器:
package tfidf; import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.IOException; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.analysis.util.CharArraySet; import org.apache.lucene.util.Version; public class MMAnalyzer { public MMAnalyzer() { // TODO Auto-generated constructor stub } public String segment(String splitText, String str) throws IOException{ BufferedReader reader = new BufferedReader(new FileReader(new File("F:/Porject/Hadoop project/TDIDF/TFIDF/TFIDF/stop_words.txt"))); String line = reader.readLine(); String wordString = ""; while(line!=null){ //System.out.println(line); wordString = wordString + " " + line; line = reader.readLine(); } String[] self_stop_words = wordString.split(" "); //String[] self_stop_words = { "我","你","他","它","她","的", "了", "呢", ",", "0", ":", ",", "是", "流","!"}; CharArraySet cas = new CharArraySet(Version.LUCENE_46, 0, true); for (int i = 0; i < self_stop_words.length; i++) { cas.add(self_stop_words[i]); } @SuppressWarnings("resource") SmartChineseAnalyzer sca = new SmartChineseAnalyzer(Version.LUCENE_46, cas); //分词器做好处理之后得到的一个流,这个流中存储了分词的各种信息.可以通过TokenStream有效的获取到分词单元 TokenStream ts = sca.tokenStream("field", splitText); // 语汇单元对应的文本 //CharTermAttribute ch = ts.addAttribute(CharTermAttribute.class); //Resets this stream to the beginning. ts.reset(); // 递归处理所有语汇单元 //Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. String words = ""; while (ts.incrementToken()) { String word = ts.getAttribute(CharTermAttribute.class).toString(); System.out.println(word); words &
利用TFIDF实时微博情感分类-朴素贝叶斯算法
最新推荐文章于 2024-09-09 06:44:06 发布
本文介绍了如何使用朴素贝叶斯算法结合TF-IDF进行实时微博情感分类。首先,对训练文本进行分词,然后计算TF-IDF值,通过比较TF-IDF得分来确定情感类别。分词器选择SmartChineseAnalyzer,并去除stopwords以提升效果。分类质量取决于训练语料库的质量和完整性。为了提高效率,可以预先计算并存储TF-IDF,以避免每次分类时重复计算。
摘要由CSDN通过智能技术生成