Very Brief Introduction to Machine Learning for AI

Intelligence

The notion of intelligence can be defined in many ways. Here we define it as theability to take the right decisions, according to some criterion (e.g. survivaland reproduction, for most animals). To take better decisions requires knowledge,in a form that is operational, i.e., can be used to interpret sensory dataand use that information to take decisions.

Artificial Intelligence

Computers already possess some intelligence thanks to all the programs that humanshave crafted and which allow them to “do things” that we consider useful (and thatis basically what we mean for a computer to take the right decisions).But there are many tasks which animals and humans are able to do rather easilybut remain out of reach of computers, at the beginning of the 21st century.Many of these tasks fall under the label of Artificial Intelligence, and includemany perception and control tasks. Why is it that we have failed to write programsfor these tasks? I believe that it is mostly because we do not know explicitly(formally) how to do these tasks, even though our brain (coupled with a body)can do them. Doing those tasks involve knowledge that is currently implicit,but we have information about those tasks through data and examples (e.g. observationsof what a human would do given a particular request or input).How do we get machines to acquire that kind of intelligence? Using data and examples tobuild operational knowledge is what learning is about.

Machine Learning

Machine learning has a long history and numerous textbooks have been written thatdo a good job of covering its main principles. Among the recent ones I suggest:

Here we focus on a few concepts that are most relevant to this course.

Formalization of Learning

First, let us formalize the most common mathematical framework for learning.We are given training examples

{\cal D} = \{z_1, z_2, \ldots, z_n\}

with the z_i being examples sampled from an unknown process P(Z).We are also given a loss functional L which takes as argumenta decision function f and an example z, and returnsa real-valued scalar. We want to minimize the expected value ofL(f,Z) under the unknown generating process P(Z).

Supervised Learning

In supervised learning, each examples is an (input,target) pair: Z=(X,Y)and f takes an X as argument.The most common examples are

  • regression: Y is a real-valued scalar or vector, the output of fis in the same set of values as Y, and we oftentake as loss functional the squared error

L(f,(X,Y)) = ||f(X) - Y||^2

  • classification: Y is a finite integer (e.g. a symbol) corresponding toa class index, and we often take as loss function the negative conditional log-likelihood,with the interpretation that f_i(X) estimates P(Y=i|X):

    L(f,(X,Y)) = -\log f_Y(X)

    where we have the constraints

    f_Y(X) \geq 0 \;\;,\; \sum_i f_i(X) = 1

Unsupervised Learning

In unsupervised learning we are learning a function f which helps tocharacterize the unknown distribution P(Z). Sometimes f isdirectly an estimator of P(Z) itself (this is called density estimation).In many other cases f is an attempt to characterize where the densityconcentrates. Clustering algorithms divide up the input space in regions(often centered around a prototype example or centroid). Some clusteringalgorithms create a hard partition (e.g. the k-means algorithm) while othersconstruct a soft partition (e.g. a Gaussian mixture model) which assignto each Z a probability of belonging to each cluster. Anotherkind of unsupervised learning algorithms are those that construct anew representation for Z. Many deep learning algorithms fallin this category, and so does Principal Components Analysis.

Local Generalization

The vast majority of learning algorithms exploit a single principle for achieving generalization:local generalization. It assumes that if input example x_i is close toinput example x_j, then the corresponding outputs f(x_i) and f(x_j)should also be close. This is basically the principle used to perform localinterpolation. This principle is very powerful, but it has limitations:what if we have to extrapolate? or equivalently, what if the target unknown functionhas many more variations than the number of training examples? in that case thereis no way that local generalization will work, because we need at least as manyexamples as there are ups and downs of the target function, in order to coverthose variations and be able to generalize by this principle.This issue is deeply connected to the so-called curse of dimensionality forthe following reason. When the input space is high-dimensional, it is easy forit to have a number of variations of interest that is exponential in the numberof input dimensions. For example, imagine that we want to distinguish between10 different values of each input variable (each element of the input vector),and that we care about about all the 10^n configurations of thesen variables. Using only local generalization, we need to see at leastone example of each of these 10^n configurations in order tobe able to generalize to all of them.

Distributed versus Local Representation and Non-Local Generalization

A simple-minded binary local representation of integer N is a sequence of B bitssuch that N<B, and all bits are 0 except the N-th one. A simple-mindedbinary distributed representation of integer N is a sequence of log_2 Bbits with the usual binary encoding for N. In this example we seethat distributed representations can be exponentially more efficient than local ones.In general, for learning algorithms, distributed representations have the potentialto capture exponentially more variations than local ones for the same number offree parameters. They hence offer the potential for better generalization becauselearning theory shows that the number of examples needed (to achieve a desireddegree of generalization performance) to tune O(B)effective degrees of freedom is O(B).

Another illustration of the difference between distributed and localrepresentation (and corresponding local and non-local generalization)is with (traditional) clustering versus Principal Component Analysis (PCA)or Restricted Boltzmann Machines (RBMs).The former is local while the latter is distributed. With k-meansclustering we maintain a vector of parameters for each prototype,i.e., one for each of the regions distinguishable by the learner.With PCA we represent the distribution by keeping track of itsmajor directions of variations. Now imagine a simplified interpretationof PCA in which we care mostly, for each direction of variation,whether the projection of the data in that direction is above orbelow a threshold. With d directions, we can thusdistinguish between 2^d regions. RBMs are similar inthat they define d hyper-planes and associate a bitto an indicator of being on one side or the other of each hyper-plane.An RBM therefore associates one inputregion to each configuration of the representation bits(these bits are called the hidden units, in neural network parlance).The number of parameters of the RBM is roughly equal to the number thesebits times the input dimension.Again, we see that the number of regions representableby an RBM or a PCA (distributed representation) can grow exponentially in the number ofparameters, whereas the number of regions representableby traditional clustering (e.g. k-means or Gaussian mixture, local representation)grows only linearly with the number of parameters.Another way to look at this is to realize that an RBM can generalizeto a new region corresponding to a configuration of its hidden unit bitsfor which no example was seen, something not possible for clusteringalgorithms (except in the trivial sense of locally generalizing to that newregions what has been learned for the nearby regions for which exampleshave been seen).

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
完整版:https://download.csdn.net/download/qq_27595745/89522468 【课程大纲】 1-1 什么是java 1-2 认识java语言 1-3 java平台的体系结构 1-4 java SE环境安装和配置 2-1 java程序简介 2-2 计算机中的程序 2-3 java程序 2-4 java类库组织结构和文档 2-5 java虚拟机简介 2-6 java的垃圾回收器 2-7 java上机练习 3-1 java语言基础入门 3-2 数据的分类 3-3 标识符、关键字和常量 3-4 运算符 3-5 表达式 3-6 顺序结构和选择结构 3-7 循环语句 3-8 跳转语句 3-9 MyEclipse工具介绍 3-10 java基础知识章节练习 4-1 一维数组 4-2 数组应用 4-3 多维数组 4-4 排序算法 4-5 增强for循环 4-6 数组和排序算法章节练习 5-0 抽象和封装 5-1 面向过程的设计思想 5-2 面向对象的设计思想 5-3 抽象 5-4 封装 5-5 属性 5-6 方法的定义 5-7 this关键字 5-8 javaBean 5-9 包 package 5-10 抽象和封装章节练习 6-0 继承和多态 6-1 继承 6-2 object类 6-3 多态 6-4 访问修饰符 6-5 static修饰符 6-6 final修饰符 6-7 abstract修饰符 6-8 接口 6-9 继承和多态 章节练习 7-1 面向对象的分析与设计简介 7-2 对象模型建立 7-3 类之间的关系 7-4 软件的可维护与复用设计原则 7-5 面向对象的设计与分析 章节练习 8-1 内部类与包装器 8-2 对象包装器 8-3 装箱和拆箱 8-4 练习题 9-1 常用类介绍 9-2 StringBuffer和String Builder类 9-3 Rintime类的使用 9-4 日期类简介 9-5 java程序国际化的实现 9-6 Random类和Math类 9-7 枚举 9-8 练习题 10-1 java异常处理 10-2 认识异常 10-3 使用try和catch捕获异常 10-4 使用throw和throws引发异常 10-5 finally关键字 10-6 getMessage和printStackTrace方法 10-7 异常分类 10-8 自定义异常类 10-9 练习题 11-1 Java集合框架和泛型机制 11-2 Collection接口 11-3 Set接口实现类 11-4 List接口实现类 11-5 Map接口 11-6 Collections类 11-7 泛型概述 11-8 练习题 12-1 多线程 12-2 线程的生命周期 12-3 线程的调度和优先级 12-4 线程的同步 12-5 集合类的同步问题 12-6 用Timer类调度任务 12-7 练习题 13-1 Java IO 13-2 Java IO原理 13-3 流类的结构 13-4 文件流 13-5 缓冲流 13-6 转换流 13-7 数据流 13-8 打印流 13-9 对象流 13-10 随机存取文件流 13-11 zip文件流 13-12 练习题 14-1 图形用户界面设计 14-2 事件处理机制 14-3 AWT常用组件 14-4 swing简介 14-5 可视化开发swing组件 14-6 声音的播放和处理 14-7 2D图形的绘制 14-8 练习题 15-1 反射 15-2 使用Java反射机制 15-3 反射与动态代理 15-4 练习题 16-1 Java标注 16-2 JDK内置的基本标注类型 16-3 自定义标注类型 16-4 对标注进行标注 16-5 利用反射获取标注信息 16-6 练习题 17-1 顶目实战1-单机版五子棋游戏 17-2 总体设计 17-3 代码实现 17-4 程序的运行与发布 17-5 手动生成可执行JAR文件 17-6 练习题 18-1 Java数据库编程 18-2 JDBC类和接口 18-3 JDBC操作SQL 18-4 JDBC基本示例 18-5 JDBC应用示例 18-6 练习题 19-1 。。。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值