1.什么是MLBase
MLBase是Spark生态圈的一部分,专注于机器学习,包含三个组件:MLlib、MLI、ML Optimizer。
ML Optimizer: This layer aims to automating the task of ML pipeline construction. The optimizer solves a search problem over feature extractors and ML algorithms included inMLI and MLlib. The ML Optimizer is currently under active development.
MLI: An experimental API for feature extraction and algorithm development that introduces high-level ML programming abstractions. A prototype of MLI has been implemented against Spark, and serves as a testbed for MLlib.
MLlib: Apache Spark's distributed ML library. MLlib was initially developed as part of the MLbase project, and the library is currently supported by the Spark community. Many features in MLlib have been borrowed from ML Optimizer and MLI, e.g., the model and algorithm APIs, multimodel training, sparse data support, design of local / distributed matrices, etc.
2.MLbase机器学习算法的流程
用户可以容易地使用MLbase这个工具来处理自己的数据。大部分的机器学习算法都包含训练以及预测两个部分,训练出模型,然后对未知样本进行预测。Spark中的机器学习包也是如此。
Spark将机器学习算法都分成了两个模块:
训练模块:通过训练样本输出模型参数
预测模块:利用模型参数初始化,预测测试样本,输出与测值。
MLbase提供了函数式编程语言Scala,利用MLlib可以很方便的实现机器学习的常用算法。
比如说,我们要做分类,只需要写如下scala代码:
1 var X = load("some_data", 2 to 10)2 var y = load("some_data", 1)3 var (fn-model, summary) = doClassify(X, y)
代码解释:X是需要分类的数据集,y