机学走起第四式：起飞

最新推荐文章于 2024-11-05 15:32:33 发布

weixin_34405332

最新推荐文章于 2024-11-05 15:32:33 发布

阅读量62

点赞数

文章标签： python 人工智能

原文链接：https://my.oschina.net/gonglibin/blog/1506946

版权

2019独角兽企业重金招聘Python工程师标准>>>

LDA（Latent Dirichlet Allocation）是一种文档主题提取模型，也叫词题档三层贝叶斯，按照单词 -> 主题 -> 文档从属分布概率来隐式推导归档类目，属于无监督机器学习范畴。实际效果和训练样本的质量关系极其密切，目标类目不宜期待过多过细，否则效果的衰减将远超预期，另外脸书一款开源监督机器学习主题提取模型效果更佳，今天先挖个坑留着以后再说。核心的实现借用吉布斯采样算法，自己要做的是输入输出和准备训练样本。样本导入需要两个参数，cps，二维向量对应文档，size，整数对应词表尺寸，定义目标主题数可以根据需要指定，主题概率和单词概率默认配置即可，训练完毕根据模型提取主题词。

获取词表：

	public static XxCorpus loadSample(String f) throws Exception {
		File fds = new File(f);
		XxCorpus cps = new XxCorpus ();
		
        for (File v : fds.listFiles()) {
        	String buf = new String();
    		List<String> lst = new LinkedList<String>();
    		BufferedReader brd = new BufferedReader(new InputStreamReader(new FileInputStream(v), "UTF-8"));
            
			while (null != (buf = brd.readLine())) {
				for (String w : buf.split(" ")) {
					if (w.trim().length() < 2) continue;
					lst.add(w);
				}
			}

			cps.addDoc(lst);
			brd.close();
        }
		
		return cps;
	}

	public int[] addDoc(List<String> l) {
		int idx = 0;
		int[] doc = new int[l.size()];
		
		for (String v : l) {
			doc[idx ++] = voc.getId(v, true);
		}
		lst.add(doc);
		
		return doc;
	}

	public Integer getId(String w, boolean b) {
		Integer id = w2i.get(w);
		if (true == b) {
			if (null == id) {
				id = w2i.size();
			}
			w2i.put(w, id);

			if (i2w.length - 1 < id) {
        		String[] arr = new String[w2i.size() * 2];
        		System.arraycopy(i2w, 0, arr, 0, i2w.length);
        		i2w = arr;
        	}
        	i2w[id] = w;
        }

        return id;
	}

样本训练：new LdaGibbsSampler(cps.getDoc(), cps.getSize()).gibbs(10);

提主题词：

    public static Map<String, Double>[] translate(double[][] p, CmVocab v, int l) {
		l = Math.min(l, p[0].length);
        Map<String, Double>[] rst = new Map[p.length];
        
        for (int k = 0; k < p.length; k++) {
        	Map<Double, String> map = new TreeMap<Double, String>(Collections.reverseOrder());
        	for (int i = 0; i < p[k].length; i++) {
        		map.put(p[k][i], v.getWord(i));
            }
            Iterator<Map.Entry<Double, String>> iterator = map.entrySet().iterator();
            rst[k] = new LinkedHashMap<String, Double>();
            for (int i = 0; i < l; ++i) {
                Map.Entry<Double, String> val = iterator.next();
                rst[k].put(val.getValue(), val.getKey());
            }
        }
        
        return rst;
    }

主题预测：随机一篇文章去停用词分词备用，根据词表生成一份一维向量表，连同训练结果模型一并作为参数执行主题预测，返回结果是一个双精度向量集合。

   public static Map<String, Double> translate(double[] t/* 双精度向量 */, double[][] p, CmVocab v, int l) {
		int n = -1;
		double d = -1.0;
        Map<String, Double>[] tpc = translate(p, v, l);
        
        for (int k = 0; k < t.length; k++) {
        	if (t[k] > d) {
                d = t[k];
                n = k;
            }
        }
        
        return tpc[n];
    }

把结果打印出来看看，亲咋样？

预告：《机学走起第五式：加速》之基于TF向量的相似度算法与实现。

转载于:https://my.oschina.net/gonglibin/blog/1506946