LDA(Latent Dirichlet Allocation)是一种文档主题提取模型,也叫词题档三层贝叶斯,按照单词 -> 主题 -> 文档从属分布概率来隐式推导归档类目,属于无监督机器学习范畴。实际效果和训练样本的质量关系极其密切,目标类目不宜期待过多过细,否则效果的衰减将远超预期,另外脸书一款开源监督机器学习主题提取模型效果更佳,今天先挖个坑留着以后再说。核心的实现借用吉布斯采样算法,自己要做的是输入输出和准备训练样本。样本导入需要两个参数,cps,二维向量对应文档,size,整数对应词表尺寸,定义目标主题数可以根据需要指定,主题概率和单词概率默认配置即可,训练完毕根据模型提取主题词。
获取词表:
public static XxCorpus loadSample(String f) throws Exception {
File fds = new File(f);
XxCorpus cps = new XxCorpus ();
for (File v : fds.listFiles()) {
String buf = new String();
List<String> lst = new LinkedList<String>();
BufferedReader brd = new BufferedReader(new InputStreamReader(new FileInputStream(v), "UTF-8"));
while (null != (buf = brd.readLine())) {
for (String w : buf.split(" ")) {
if (w.trim().length() < 2) continue;
lst.add(w);
}
}
cps.addDoc(lst);
brd.close();
}
return cps;
}
public int[] addDoc(List<String> l) {
int idx = 0;
int[] doc = new int[l.size()];
for (String v : l) {
doc[idx ++] = voc.getId(v, true);
}
lst.add(doc);
return doc;
}
public Integer getId(String w, boolean b) {
Integer id = w2i.get(w);
if (true == b) {
if (null == id) {
id = w2i.size();
}
w2i.put(w, id);
if (i2w.length - 1 < id) {
String[] arr = new String[w2i.size() * 2];
System.arraycopy(i2w, 0, arr, 0, i2w.length);
i2w = arr;
}
i2w[id] = w;
}
return id;
}
样本训练:new LdaGibbsSampler(cps.getDoc(), cps.getSize()).gibbs(10);
提主题词:
public static Map<String, Double>[] translate(double[][] p, CmVocab v, int l) {
l = Math.min(l, p[0].length);
Map<String, Double>[] rst = new Map[p.length];
for (int k = 0; k < p.length; k++) {
Map<Double, String> map = new TreeMap<Double, String>(Collections.reverseOrder());
for (int i = 0; i < p[k].length; i++) {
map.put(p[k][i], v.getWord(i));
}
Iterator<Map.Entry<Double, String>> iterator = map.entrySet().iterator();
rst[k] = new LinkedHashMap<String, Double>();
for (int i = 0; i < l; ++i) {
Map.Entry<Double, String> val = iterator.next();
rst[k].put(val.getValue(), val.getKey());
}
}
return rst;
}
主题预测:随机一篇文章去停用词分词备用,根据词表生成一份一维向量表,连同训练结果模型一并作为参数执行主题预测,返回结果是一个双精度向量集合。
public static Map<String, Double> translate(double[] t/* 双精度向量 */, double[][] p, CmVocab v, int l) {
int n = -1;
double d = -1.0;
Map<String, Double>[] tpc = translate(p, v, l);
for (int k = 0; k < t.length; k++) {
if (t[k] > d) {
d = t[k];
n = k;
}
}
return tpc[n];
}
把结果打印出来看看,亲咋样?
预告:《机学走起第五式:加速》之基于TF向量的相似度算法与实现。