数据挖掘流程_数据流挖掘

数据挖掘流程

1-简介 (1- Introduction)

The fact that the pace of technological change is at its peak, Silicon Valley is also introducing new challenges that need to be tackled via new and efficient ways. Continuous research is being carried out to improve the existing tools, techniques, and algorithms to maximize their efficiency. Streaming data has always remained a challenge since the last decades, nevertheless plenty of stream-based algorithms have been introduced but still, the researchers are struggling to achieve better results and performance. As you know that, when water from a fire hose starts hitting your face, chances to measure it starts decreasing gradually. This is due to the torrent nature of streams. It has introduced new challenges of analyzing and mining the streams efficiently. Stream analysis has been made easy up to some extent because of a few new tools that are introduced in the market recently. These tools are following different approaches and algorithms which are being improved continuously. However, when it comes to mining data streams, it is not possible to store and iterate over the streams like traditional mining algorithms due to their continuous, high-speed, and unbounded nature.

技术变革的步伐达到顶峰这一事实,硅谷也带来了新的挑战,需要通过新的有效方式来应对。 正在进行持续的研究以改进现有的工具,技术和算法,以使其效率最大化。 自从过去的几十年以来,流数据一直是一个挑战,尽管引入了很多基于流的算法,但是研究人员仍在努力获得更好的结果和性能。 如您所知,当消防水带上的水开始溅到您的脸上时,测量水的机会开始逐渐减少。 这是由于流的洪流性质。 它带来了有效分析和挖掘流的新挑战。 由于最近市场上引入了一些新工具,因此在某种程度上简化了流分析。 这些工具采用了不同的方法和算法,并不断得到改进。 但是,在挖掘数据流时,由于其连续,高速且无限制的特性,因此无法像传统的挖掘算法一样在数据流上进行存储和迭代。

Due to irregularity and variation in the arriving data, memory management has become the main challenge to deal with. Applications like sensor networks cannot afford mining algorithms with high memory cost. Similarly, time management, data preprocessing techniques, and choice of the data structure are also considered as some of the main challenges in the stream mining algorithms. Therefore, summarization techniques derived from the statistical science are dealing with a challenge of memory limitation, and techniques of the computational theory are being used to improve the time and space-efficient algorithms. Another challenge is the consumption of available resources, to cope with this challenge resource-aware mining is introduced which makes sure that the algorithm always consumes the available resources with some consideration.

由于到达数据的不规则性和变化,内存管理已成为要处理的主要挑战。 像传感器网络这样的应用程序无法承受具有高存储成本的挖掘算法。 同样,时间管理,数据预处理技术和数据结构的选择也被视为流挖掘算法中的一些主要挑战。 因此,源自统计科学的摘要技术正在应对内存限制的挑战,并且使用计算理论的技术来改进时间和空间效率高的算法。 另一个挑战是可用资源的消耗,为了应对这一挑战,引入了资源感知挖掘,以确保算法始终在考虑某些因素的情况下消耗可用资源。

As data stream is seen only once therefore it requires mining in a single pass, for this purpose an extremely fast algorithm is required to avoid problems like data sampling and shredding. Such algorithms should be able to run with data streams in parallel settings partitioned to many distributed processing units. Infinite data streams with high volumes are produced by many online, offline real-time applications and systems. The update rate of data streams is time-dependent. Therefore to extract knowledge from streaming data, some special mechanism is required. Due to their high volume and speed, some special mechanism is required to extract knowledge from them.

由于只能看到一次数据流,因此需要单次挖掘,因此需要一种非常快速的算法来避免数据采样和粉碎等问题。 这样的算法应该能够与并行设置为多个分布式处理单元的数据流一起运行。 许多在线,离线实时应用程序和系统都会产生大量的无限数据流。 数据流的更新速率取决于时间。 因此,要从流数据中提取知识,需要一些特殊的机制。 由于它们的高容量和高速度,需要一些特殊的机制来从它们中提取知识。

Many stream mining algorithms have been developed and proposed by machine learning, statistical and theoretical computer science communities. The question is, how should we know which algorithm is best in terms of dealing with current challenges as mentioned above, and what is still needed in the market? This document intends to answer these questions. As this research topic is quite vast therefore deciding the best algorithm is not quite straightforward. We have compared the most recently published versions of stream mining algorithms in our distribution which are classification, clustering, and frequent itemset mining. Frequent itemset mining is a category of algorithms used to find the statistics about streaming data.

机器学习,统计和理论计算机科学界已经开发和提出了许多流挖掘算法。 问题是,就如何应对上述当前挑战而言,我们如何知道哪种算法最好,而市场仍需要什么呢? 本文档旨在回答这些问题。 由于这个研究主题非常广泛,因此确定最佳算法并不是一件容易的事。 我们已经比较了我们发行版中最新发布的流挖掘算法版本,它们是分类,聚类和频繁项集挖掘。 频繁项集挖掘是用于查找有关流数据的统计信息的一种算法。

2-分类 (2- Classification)

The classification task is to decide the proper label for any given record from a dataset. It is a part of Supervised learning. The way of the learning works is to have the algorithm learn patterns and important features from a set of labeled data or ground truths resulting in a model. This model will be utilized in the classification tasks. There are various metrics used to rate the performance of a model. For example, Accuracy, in which the focus of this metric is to maximize the number of correct labels. There is also, Specificity in which the focus is to minimize mislabelling negative class. There are few factors that are crucial to deciding which metrics are to be used in classification tasks, such as label distributions and the purpose of the task itself.

分类任务是为数据集中的任何给定记录确定适当的标签。 它是监督学习的一部分。 学习工作的方式是让算法从一组标记数据或模型得出的基础事实中学习模式和重要特征。 该模型将用于分类任务。 有多种指标可用于评估模型的性能。 例如,准确性,此度量标准的重点是最大化正确标签的数量。 在“特异性”中,重点是最大程度地减少标签错误的负面类别。 对于决定在分类任务中使用哪些度量至关重要的因素很少,例如标签分布和任务本身的目的。

There are also a few types in the Classification Algorithm, such as Decision Trees, Logistic Regression, Neural Networks, and Naive Bayes. In this work, we decide to focus on Decision Tree.

分类算法中也有几种类型,例如决策树,逻辑回归,神经网络和朴素贝叶斯。 在这项工作中,我们决定专注于决策树。

In Decision Tree, the learning algorithm will construct a tree-like model in which the node is a splitting attribute and the leaf is the predicted label. For every item, the decision tree will sort such items according to the splitting attribute down to the leaf which contained the predicted label.

在决策树中,学习算法将构建一个树状模型,其中节点是拆分属性,叶是预测标签。 对于每个项目,决策树将根据拆分属性将这些项目分类到包含预测标签的叶子。

2.1 Hoeffding树 (2.1 Hoeffding Trees)

Currently, Decision Tree Algorithms such as ID3 and C4.5 build the trees from large amounts of data by recursively select the best attribute to be split using various metrics such as Entropy Information Gain and GINI. However, existing algorithms are not suitable when the training data cannot be fitted to the memory.

当前,诸如ID3和C4.5之类的决策树算法通过使用诸如熵信息增益和GINI之类的各种度量来递归选择要分割的最佳属性,从而从大量数据中构建树。 但是,当训练数据无法拟合到存储器中时,现有算法不适合。

There exist few incremental learning methods in which the learning system, instead of fitting the entire data-sets at once in memory, continuously learning from the stream of data. However, it is found that those model lack of correctness guarantee compared to batch learning for the same amount of the data.

很少有增量学习方法,其中学习系统不是从内存中一次拟合整个数据集,而是从数据流中不断学习。 但是,发现对于相同数量的数据,与批处理学习相比,这些模型缺乏正确性保证。

Domingos and Hulten [1] formulated a decision tree algorithm called the Hoeffding Tree. With Hoeffding Tree, the record or training instance itself is not saved in the memory, only the tree nodes and statistics are stored. Furthermore, the most interesting property of this tree is that the correctness of this tree converges to trees built using a batch learning algorithm given sufficient massive data.

Domingos和Hulten [1]制定了一种决策树算法,称为Hoeffding树。 使用霍夫丁树,记录或训练实例本身不会保存在内存中,仅存储树节点和统计信息。 此外,该树的最有趣的属性是,在给定足够的大量数据的情况下,该树的正确性收敛到使用批处理学习算法构建的树。

The training method for this tree is simple. For each sample, sort it to the subsequent leaf and update its statistic.

这棵树的训练方法很简单。 对于每个样本,将其排序到随后的叶子并更新其统计信息。

There a

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值