分清big data,ML,AI之间的关系

原创 2016年06月01日 11:46:15

How are big data and machine learning related?(大数据与机器学习间关系)

Big data and machine learning are not related, but when used together can do real wonder. (没有直接联系,但是在一起效果更好)

Machine Learning & Big Data: The learning comes from extensive calculations done over existing datasets to create a learning model(in most cases). A normal system can’t handle very large dataset calculation and data size is increasing day by day, thus the obtained model should be adapted accordingly. To obtain this we have to implement distributed computing using big data technologies like Apache Mahout, Spark, R-Hadoop or initial analytics processing in projects like hive/ pig and feed output to machine learning algorithms for model/ learning generation.(机器学习需要对已经存储的数据集进行广泛计算进而产生学习模型。但是常规的系统不能处理大量的数据集,并且数据大小与日俱增,随着时间推移,已经得到的模型需要进行更新。为了达成这个目标,我们需要用分布式计算,利用大数据的技术,来产生模型和机器学习算法。)

You can apply machine learning algorithms to big data and/or you can apply big data processing techniques to machine learning.(两种技术可以相互渗透)

An example of the first case would be training a neural network or logistic regression with a large dataset using online gradient descent.(在大数据集上用在线梯度下降来训练神经网络或逻辑回归)

An example of the second case would be parallelizing gradient descent to run in a Map-Reduce environment.(在Map-Reduce环境下执行并行梯度下降)

In Machine learning large datasets usually mean you need to use simpler algorithms and they perform much better than on smaller datasets.

There are two types of insights anyone can get from a dataset :
Q1. Direct (group by/join/ sum/ max / average)(直接)
Q2. Inductive (if something is.. then something else is.. else anything is..)(推测)

Mind that the first type of insights are always exact, so you need to use computational tools like excel in small data and hadoop in big data to calculate.
The inductive insights on the other hand are approximations on seeing the data. For small amount of Data, a human can try and infer things seeing charts/graphs etc. However, when the data is huge, its beyond human capacity to infer rules from data. This is exactly when Machine Learning comes in.

One of the biggest reason’s why we use big data is to extract some meaning out of it, so that we can make better decisions. And that’s what machine learning does! It is the science of training systems to learn from data and output appropriate response without being explicitly programmed for that .But, on flip side without big data machine learning would be totally irrelevant, because to learn anything from data you need to have a large number of ‘training examples’ so that all possible scenarios are exhausted and also to avoid faulty training due to few erroneous datasets.
So, they are deeply interconnected.(一句话,大数据集让机器学习出来的模型不偏颇)

I have often found these terms used in an interchangeable way, which is totally wrong.
Big data has got more to do with High Performance Computing(大数据跟高性能计算相关), while Machine Learning is a part of Data Science(机器学习是数据科学的一部分). What happens in Big Data is large volumes of data which cannot be processed in reasonable amount of time, is processed quickly by various techniques and tools. In Machine Learning, a system learns from past experiences and is able to build a model which would most likely be able to comprehend future instances.
One of the main reason why big data and machine learning are used together is because big data is more likely to be a preprocessing step to machine learning.

Machine Learning is a science of studying patterns in the data. These patterns explain how the data is correlated. This correlated data is used to make future predictions.

Big Data is an art of working with large amount of data. As such, machine learning could be done on a smaller set of data, but larger the data; better the predictions.

So if I were to give a short answer; When you have a lot of structured/unstructured data that you want to study and find patterns, then you use big data and run your Machine Learning algorithms and find patterns that make a business use case.

Machine Learning - Build models. When people hear the term “machine learning”, they make mental images of robots who walk, climb or clean houses. In reality, machine learning starts alot closer to home. When you open your emails, spam has been filtered out from your important messages by an algorithm that has learnt to classify “spam” and “not spam”. Your Facebook news feed features posts from your closest friends because an algorithm has examined your likes, tags and photos to decipher who you connect with most. When you upload a photo and the website identifies your face, it’s fuelled by a facial recognition algorithm. When you use a search engine, you see the best and most relevant content first because of a sophisticated search ranking algorithm. In short, machine learning permeates our lives i.e it builds models for self learning algorithms.
Data Mining - It is an analytic process designed to explore data and consequently find Patterns in data. It is a practice of applying algorithms (mostly Machine learning algorithms ) to find patterns in data.
Artificial Intelligence - Behaves and Reasons. Science to develop a system or software to mimic human to respond and behave in a circumference. As field with extremely broad scope, AI has defined its goal into multiple chunks. Later each chuck has become a separate field of study to solve its problem.
Major list of AI goal :-
Knowledge Representation
Computer Vision
Machine Leaning
Natural Language
General intelligence, or strong AI
Machine learning is field emerged from one the AI goal to help machine to learn on it own to solve problems it’s can come across.

Natural language processing is another such field emerged from AI goal to help machine to communicate with real human.

Computer vision is a field emerged from AI goal to identify and distinguish objects that the machine could see.

Robotics is a field emerged from AI goal to give a physical appearance for a machine to do physical actions.




【pySpark教程】Big Data, Hardware trends, and Spark(二)

Big Data, Hardware trends, and Spark 本博客是【pySpark教程】系列的文章。 是 Berkeley 的 Python Spark公开课的学习笔记(...
  • cyh24
  • cyh24
  • 2016-02-13 21:28
  • 3517

New Apache project will Drill big data in near real time

New Apache project will Drill big data in near real time Dremel-based project accepted as...

大数据(big data)究竟是什么?

“大数据”这个词最近两三年在IT界越来越热门,搞IT的如果嘴里不说起大数据,就好象是落了伍。大数据的意思不同人有不同的说法,比较实在含义是特指以Hadoop为代表的大型并发机群(Massively P...

Big Data(5): Matrix Factorization算法实现

题目要求: In this question, you are provided with the same dataset in Part A. Based on the file of``trai...

Coursera--Big Data Analysis with Scala and Spark

package wikipedia import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org...

In-Stream Big Data Processing译文:流式大数据处理

转自:http://blog.csdn.net/idontwantobe/article/details/25938511 原文:http://highlyscalable.wordpress....

BIG DATA 大数据时代来临


基于《Web Intelligence and Big Data》的自我梳理 三、四

三、LOAD LOAD可以理解为数据实际上的处理。首先看看数据库。传统关系型数据库主要是面向事务的。最早基于row-oriented存储,使用B+树索引。由于不同事务间的并发,为了满足ACID(原子性...

[big data 1] Hadoop 2 .x 伪分布搭建

[big data 1] Hadoop 2 .x 伪分布搭建这是第一篇关于big data的文章,首先来搭建一个ubuntu 14.04 LTS 系统的hadoop 2.x 的一个伪分布式环境。big...