Mahout系列之核心功能实践

        上次已经说到了Mahout的计算项目模块mahout math。这里面包含了很多常用的数学计算或者统计方面的东西,有很多东西可能会用到,所以对这些基础的需要有很好的理解。Mahout提供了很多工具的命令行方式,下面列出所有的命令,当然这个是会变化的,而且每个都有不同的参数;这些命令也有很多相似之处,要每个都很熟悉还是要很多功力的。管中窥豹,可见一斑,这样可以知道Mahout到底可以做什么,提供了哪些直接使用的方式,可供参考:

CommandCommentDetail
  arff.vector从ARFF文件产生向量 Generate Vectors from an ARFF file or directory
  baumwelchHMM Baum-Welch训练算法 Baum-Welch algorithm for unsupervised HMM training
  buildforest构建随机森林分类器 Build the random forest classifier
  canopyCanopy聚类 Canopy clustering
  cat打印文件或者资源方便查看 Print a file or resource as the logistic regression models would see it
  cleansvd清空验证SVD输出 Cleanup and verification of SVD output
  clusterdumpDump聚类输出结果文本 Dump cluster output to text
  clusterpp分组聚类输出 Groups Clustering Output In Clusters
  cmdump以HTML或者文本格式Dump混淆矩阵 Dump confusion matrix in HTML or text formats
  concatmatrices合并相同基的矩阵到单个矩阵中 Concatenates 2 matrices of same cardinality into a single matrix
  cvbLDA LDA via Collapsed Variation Bayes (0th deriv. approx)
  cvb0_localLDA local LDA via Collapsed Variation Bayes, in memory locally.
  describe描述数据集中的字段和目标变量 Describe the fields and target variable in a data set
  evaluateFactorization计算RMSE 和 MAE  compute RMSE and MAE of a rating matrix factorization against probes
  fkmeansFuzzy K-means聚类 Fuzzy K-means clustering
  hmmpredict由给定的HMM模型产生随机观察序列 Generate random sequence of observations by given HMM
  itemsimilarity物品相似度 Compute the item-item-similarities for item-based collaborative filtering
  kmeansK-means聚类 K-means clustering
  lucene.vector产生Lucene索引向量 Generate Vectors from a Lucene index
  lucene2seqLucene索引产生文本序列 Generate Text SequenceFiles from a Lucene index
  matrixdump以CSV格式Dump矩阵 Dump matrix in CSV format
  matrixmult获得两矩阵的积 Take the product of two matrices
  parallelALS并行ALS ALS-WR factorization of a rating matrix
  qualcluster运行聚类实验和摘要 Runs clustering experiments and summarizes results in a CSV
  recommendfactorized使用等分因子获得推荐 Compute recommendations using the factorization of a rating matrix
  recommenditembased使用基于物品的协作过滤推荐 Compute recommendations using item-based collaborative filtering
  regexconverter按行基于正则表达式转换文本文件 Convert text files on a per line basis based on regular expressions
  resplit将文件文件切分多等分 Splits a set of SequenceFiles into a number of equal splits
   rowidMap系列文件 Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}
  rowsimilarity计算行矩阵的成对相似度 Compute the pairwise similarities of the rows of a matrix
  runAdaptiveLogistic运行自适应逻辑回归 Score new production data using a probably trained and validated AdaptivelogisticRegression model
  runlogistic从CSV数据运行逻辑回归 Run a logistic regression model against CSV data
  seq2encoded从文本序列文件获得编码稀疏向量 Encoded Sparse Vector generation from Text sequence files
  seq2sparse从文本序列文件获得稀疏向量 Sparse Vector generation from Text sequence files
  seqdirectory从目录创建序列文件 Generate sequence files (of Text) from a directory
  seqdumper通用序列文件Dump Generic Sequence File dumper
  seqmailarchives从压缩邮件目录中创建序列文件 Creates SequenceFile from a directory containing gzipped mail archives
  seqwikiWikipedia xml dump至序列文件 Wikipedia xml dump to sequence file
  spectralkmeans谱K-mean聚类 Spectral k-means clustering
  split输入数据分为测试和训练数据 Split Input data into test and train sets
  splitDataset等分训练和测试数据 split a rating dataset into training and probe parts
  ssvd随机SVD Stochastic SVD
  streamingkmeans流式K-mean聚类 Streaming k-means clustering
  svdLanczos 奇异值分解 Lanczos Singular Value Decomposition
  testforest测试随机森林分类器 Test the random forest classifier
  testnb测试Bayes分类器 Test the Vector-based Bayes classifier
  trainAdaptiveLogistic训练自适应逻辑回归模型 Train an AdaptivelogisticRegression model
  trainlogistic基于随机梯度下降训练逻辑回归 Train a logistic regression using stochastic gradient descent
  trainnb基于Bayes分类训练 Train the Vector-based Bayes classifier
  transpose转置矩阵 Take the transpose of a matrix
  validateAdaptiveLogistic验证自适应逻辑回归模型 Validate an AdaptivelogisticRegression model against hold-out data set
  vecdist计算向量距离 Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors
  vectordumpDump向量至文本文件 Dump vectors from a sequence file to text
  viterbiViterbi 算法 Viterbi decoding of hidden states from given output states sequence

            当然上面的有些中文的翻译不是很准,也没有一一使用过,具体的使用还有很多细节。

           Mahout提供了很多聚类,分类,推荐(协作过滤)方面的计算方法,对i数据分析提供了有意的帮助,目前用的比较成熟的应该就是推荐这块了,在很多系统里面得到了实际的应用,效果也不错;相对来说聚类分类还是使用的场合比较有限,有待进一步的研究。

前面几篇已经分析过了推荐方面的,从理论到实际操作,下面介绍一个逻辑回归(Logistic Regression)模型的例子。


1.数据准备

    使用的是iris数据,iris数据是数据分析使用比较多的实验数据,不多说了。

    打开R, 输入 iris,可以看到数据长什么样子,使用下面的命令导出数据

    write.csv(iris,file="D:/work_doc/Doc/iris.csv")

    数据是这样的:

"ID","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
"1",5.1,3.5,1.4,0.2,"setosa"
"2",4.9,3,1.4,0.2,"setosa"
"3",4.7,3.2,1.3,0.2,"setosa"
"4",4.6,3.1,1.5,0.2,"setosa"

2.  使用java代码来实际操作一番。

import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.util.List;
import java.util.Locale;

import org.apache.commons.io.FileUtils;
import org.apache.mahout.classifier.sgd.CsvRecordFactory;
import org.apache.mahout.classifier.sgd.LogisticModelParameters;
import org.apache.mahout.classifier.sgd.OnlineLogisticRegression;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.SequentialAccessSparseVector;
import org.apache.mahout.math.Vector;


import com.google.common.base.Charsets;
import com.google.common.collect.Lists;  
  
public class IrisLRTest {  
  
    private static LogisticModelParameters lmp;  
    private static PrintWriter output;  
    
    public static void main(String[] args) throws IOException {  
        // 初始化
        lmp = new LogisticModelParameters();  
        output = new PrintWriter(new OutputStreamWriter(System.out,  
                Charsets.UTF_8), true);  
        lmp.setLambda(0.001);  
        lmp.setLearningRate(50);
        lmp.setMaxTargetCategories(3);
        lmp.setNumFeatures(4);
        List<String> targetCategories = Lists.newArrayList("setosa", "versicolor", "versicolor");  //对应Species属性三个类别
        lmp.setTargetCategories(targetCategories);  
        lmp.setTargetVariable("Species"); // 需要进行预测的是Species属性  
        List<String> typeList = Lists.newArrayList("numeric", "numeric", "numeric", "numeric");  //每个属性的类型
        List<String> predictorList = Lists.newArrayList("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width");  //属性的名称
        lmp.setTypeMap(predictorList, typeList);  
  
        // 读取数据
        List<String> raw = FileUtils.readLines(new File(  
                "D:\\work_doc\\Doc\\iris.csv")); 
        String header = raw.get(0);  
        List<String> content = raw.subList(1, raw.size());
        CsvRecordFactory csv = lmp.getCsvRecordFactory();
        csv.firstLine(header); 
  
        // 训练
        OnlineLogisticRegression lr = lmp.createRegression();  
        for(int i = 0; i < 100; i++) {  //训练次数
            for (String line : content) {  
                Vector input = new RandomAccessSparseVector(lmp.getNumFeatures());  
                int targetValue = csv.processLine(line, input);  
                lr.train(targetValue, input);
            }
        }  
  
        // 评估分类结果
        double correctRate = 0;  
        double sampleCount = content.size();  
          
        for (String line : content) {  
            Vector v = new SequentialAccessSparseVector(lmp.getNumFeatures());  
            int target = csv.processLine(line, v);  
            int score = lr.classifyFull(v).maxValueIndex();
            //System.out.println("Target:" + target + "\tReal:" + score);  
            if(score == target) {  
                correctRate++;  
            }  
        }  
        output.printf(Locale.ENGLISH, "Rate = %.2f%n", correctRate / sampleCount);  
    }  
  

        代码里面给出了注释,过程比较容易理解。不仅是这个模型是这样的思路,很多其他的算法都是这样的过程,具体的训练方法,算法或者过程,有差别。

当然这里给出的是基于Mahout的代码,一样在R中也可以做很多模型,基本步骤类似。




  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值