MOA 数据挖掘框架

最新推荐文章于 2024-09-26 07:13:40 发布

TingCole

最新推荐文章于 2024-09-26 07:13:40 发布

阅读量4.6k

点赞数 1

分类专栏：在线学习

本文链接：https://blog.csdn.net/weixin_42267615/article/details/103685335

版权

在线学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

MOA是一款用于在线学习不断演化的数据流的软件环境，支持离线和在线方法，实现boosting、bagging和HoeffdingTrees算法。MOA与WEKA交互，适用于大规模数据流评估，提供图形用户界面和命令行接口。

摘要由CSDN通过智能技术生成

MOA简介-翻译paper [1]

海量在线分析(Massive Online Analysis ,MOA)是一个用于实现算法和运行实验的软件环境，用于在线学习不断演化的数据流。MOA包括一系列离线(offline)和在线(online)方法以及用于评估的工具。特别是它实现了boosting、bagging和Hoeffding Trees，所有有或没有Na¨ıve Bayes classifiers 的树叶。MOA支持与WEKA(用于知识分析的Waikato环境)的双向交互，并在GNU GPL许可下发布。
数据流环境与传统的批处理学习设置有不同的要求。最重要的是以下几点：
条件 1 一次处理一个示例，并且只检查它一次(最多一次)
条件 2 使用有限的内存
条件 3 在有限的时间内完成工作
条件 4 随时准备好预测

Requirement 1 Process an example at a time, and inspect it only once (at most)
Requirement 2 Use a limited amount of memory
Requirement 3 Work in a limited amount of time
Requirement 4 Be ready to predict at any time

图1说明了数据流分类算法的典型用法，以及根据条件如何适应一个重复循环

从流中传递下一个可用的示例(条件 1)。
算法对实例进行处理，更新数据结构。它这样做没有超出设置在它上面的内存界限(条件 2)，并且尽可能快地(条件 3)。
该算法可以接受下一个示例。根据请求，它能够预测不可见示例的类别(条件 4)。

The algorithm is passed the next available example from the stream (Requirement 1)
The algorithm processes the example, updating its data structures. It does so without exceeding the memory bounds set on it (requirement 2), and as quickly as possible (Requirement 3).
The algorithm is ready to accept the next example. On request it is able to predict the class of unseen examples (Requirement 4).

在传统的批量学习中，通过对训练数据和测试数据的随机排列产生的多个模型进行分析和平均，克服了数据量有限的问题。在数据流中，设置(有效地)无限数据的问题带来了不同的挑战。一种解决方案是：在模型归纳期间的不同时间捕获快照，来查看模型改进了多少。
学习算法的评估过程决定了哪些实例可以用于训练算法，哪些实例可以用于测试算法的模型输出。在考虑数据流设置中使用什么过程的时候，一个独特的关注点是如何构建一幅精确的随时间变化的图像。出现了两种主要方法：

Holdout :当传统批学习达到规模交叉验证太耗费时间,通常是接受而不是衡量一个抵抗组的性能。这是最有用的,当训练集和测试集之间的分工是预定义的,这从不同的研究结果可以直接比较。
Interleaved Test-Then-Train or Prequential: 每个单独的例子都可以在用于训练之前用于测试模型，并且从这一点上可以增量地更新准确性。当有意按这个顺序执行时，模型总是在未见过的示例上进行测试。该方案的优点是测试不需要保持集，最大限度地利用了可用的数据。随着时间的推移，它还确保了准确性的平滑图，因为每个单独的例子对总体平均值的重要性将越来越小 [2]。

Holdout: When traditional batch learning reaches a scale where cross-validation is too time consuming, it is often accepted to instead measure performance on a single holdout set. This is most useful when the division between train and test sets has been pre-defined, so that results from different studies can be directly compared.
Interleaved Test-Then-Train or Prequential: Each individual example can be used to test the model before it is used for training,and from this the accuracy can be incrementally updated. When intentionally performed in this order, the model is always being tested on examples it has not seen. This scheme has the advantage that no holdout set is needed for testing, making maximum use of the available data. It also ensures a smooth plot of accuracy over time,as each individual example will become increasingly less significant to the overall average (Gama et al., 2009)

图 2
由于数据流分类是一个相对较新的领域，这种评价实践并不像传统的批处理设置那样得到很好的研究和建立。大多数实验评估使用不到100万个训练示例。在数据流的上下文方面令人失望。因为如果要在数据流分类中真正起作用，算法需要能够处理非常大的(可能是无限的)示例流。仅在少量数据上演示系统并不能令人信服地证明算法有能力解决要求更高的流应用程序[3]。
MOA允许数据流分类算法在大数据流上进行评估，在可能的情况下按数千万个示例的顺序进行评估，并且在显式内存限制下进行评估。任何小于这个的值实际上都不能在具有现实挑战性的环境中测试算法。
MOA是用Java编写的。Java的主要优点是可移植性，应用程序可以在任何平台上运行，使用合适的Java虚拟机和强大且开发良好的支持库。该语言的使用非常广泛，自动垃圾收集等特性有助于减少程序员的负担和错误。
MOA包含流生成器、分类器和评估方法。图2显示了MOA图形用户界面。也可以使用命令行接口。
MOA将数据流视为由纯分布生成的数据，将概念漂移事件建模为两个纯分布的加权组合，这两个纯分布描述了漂移前后的目标概念。在这个框架中，可以定义漂移之后流实例属于新概念的概率。它使用sigmoid函数，作为一种优雅而实用的解决方案[4,5]。
MOA包含文献中最常见的数据生成器。MOA流可以使用生成器、读取ARFF文件、连接多个流或过滤流来构建。它们允许模拟可能无限的数据序列。目前可用的生成器有:随机树生成器、SEA概念生成器、交错概念生成器、旋转超平面、随机RBF生成器、LED生成器、波形生成器和函数生成器(Random Tree Generator, SEA Concepts Generator, STAGGER Concepts Generator, Rotating Hyperplane, Random RBF Generator, LED Generator, Waveform Generator, and Function Generator )。
MOA包含几个分类器方法，比如Naive Bayes, Decision Stump, Hoeffding Tree, Hoeffding Option Tree (Pfahringer et al., 2008), Bagging, Boosting, Bagging using ADWIN,and Bagging using Adaptive-Size Hoeffding Trees. [5]

MOA框架及使用

 http://moa.cs.waikato.ac.nz/.     
 https://github.com/Waikato/moa

可以从这两个链接中获取MOA开源代码
根据论文和官方网站的介绍使用命令行运行moa，需要将sizeofag.jar根据下载版本中的sizeofag更改名称。

java -cp .:moa.jar:weka.jar -javaagent:sizeofag-1.0.4.jar moa.DoTask 
或
java -cp moa.jar -javaagent:sizeofag-1.0.4.jar moa.gui.GUI

或
双击\moa-release-2019.05.0-bin\moa-release-2019.05.0\lib下的moa.jar

如下图
在这里插入图片描述
点击Configure，调整参数

选择保存结果路径，可以保存成.txt或者.csv.之后点击run后如下图

比如还可以选择Clustering聚类后如下图
更多的功能后续再补充

参考文献

[1]:Bifet A, Holmes G, Kirkby R, et al. Moa: Massive online analysis[J]. Journal of Machine Learning Research, 2010, 11(May): 1601-1604.
[2]:Joao Gama, Raquel Sebastiao, and Pedro Pereira Rodrigues. Issues in evaluation of stream learning algorithms. In 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009.
[3]: Richard Kirkby. Improving Hoeffding Trees. PhD thesis, University of Waikato, November 2007.
[4]: Albert Bifet, Geoff Holmes, Bernhard Pfahringer, and Ricard Gavalda. Improving adaptive bagging methods for evolving data streams. In First Asian Conference on Machine Learning, ACML 2009.
[5]: Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavalda. New ensemble methods for evolving data streams. In 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009
[6]:MOA 框架 http://moa.cs.waikato.ac.nz/.
[7]: MOA开源代码： https://github.com/Waikato/moa