Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning-笔记

原创 2018年04月17日 13:12:17

  通过Active Learning(AL)算法,找到最小的需要标注的数据进行训练,来标记未标记的数据。

  AL必须满需下边的需求才能作为crowd-sourced database的默认的最优策略:

  1. Generality:算法必须能够应用到任意的分类和标记任务。因为crowd-sourced systems应用广泛。
  2. Black-box treatment of the classifer:意思是,能够自动化,不需要对分类器内部进行调节,因为并不是所有的人都是专家。
  3. Batching:支持批处理。可以一次性处理多个数据。
  4. Parallelism:能够并行执行现代的多核处理器和分布式集群。
  5. Noise management:crowd-provided labels有很大噪声,错误,缺乏专业知识啥的。

  本文是第一个满足上述所有条件的AL算法。本文主要贡献了两个AL算法,MinExpError 和Uncertainty,还有一个噪声管理技术partitioning-basedallocation(PBA)。这里主要介绍两个AL算法。
  MinExpError 和Uncertainty决定哪些items被送入crowd。那么接下来就需要处理crowd-provided labels的内在噪声(PBA,基于整数线性编程),决定使用crowd返回的哪个label。
  本算法的一个主要的新奇之处在于使用了bootstrap理论。主要优点有:1bootstrap可以对很多的评估器产生稳定的估计;2基于bootstrap的估计可以通过将分类器看作黑盒来得到;3bootstrap需要的计算可以独立进行,适合分布式系统。
Activate Learning(AL)
Ranker-AL
Bootstrap

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/sunyao_123/article/details/79973291

Bootstrap-Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning

论文Scaling Up Crowd-Sourcing to Very Large Datasets A Case for Active Learning对bootstrap做了介绍。 原书(B. ...
  • sunyao_123
  • sunyao_123
  • 2018-04-17 12:26:35
  • 10

ranker-Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning

论文Scaling Up Crowd-Sourcing to Very Large Datasets A Case for Active Learning提出两种AL算法。   首先找到分类器θ对未...
  • sunyao_123
  • sunyao_123
  • 2018-04-17 12:15:31
  • 3

activate learning-Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning

Active Learning Notation 本文是介绍论文Scaling Up Crowd-Sourcing to Very Large Datasets A Case for Active ...
  • sunyao_123
  • sunyao_123
  • 2018-04-17 12:01:23
  • 4

On Using Very Large Target Vocabulary for Neural Machine Translation

neural machine translation的优点: (1)要求比较少的domain knowledge(比如说源语和目标语的特征) (2)joint tuned, 以往 phrase-b...
  • u014221266
  • u014221266
  • 2017-08-11 20:53:21
  • 258

OCP-1Z0-053-200题-198题-44

QUESTION 198 You have a very large table that your users access frequently. Which of the following ...
  • rlhua
  • rlhua
  • 2014-01-31 11:00:14
  • 2554

sparksql性能测试

spark耗时对数据大小并不是线性增长,而是随数据大小缓慢增长。 数据相差一个数量级,运行时间也只差几秒,下面是多次运行下面的程序的耗时情况:分别测试100,1000,10000 但是数据超过...
  • u012432611
  • u012432611
  • 2015-09-24 11:06:00
  • 1534

SparkR的第一个测试例子Spark Pi计算

安装SparkR颇费周折,网上看到的各种安装方法,其实最终测试都很不好用。可能是国内有些网站被屏蔽的关系吧。 如install_github("amplab-extras/SparkR-pkg", ...
  • sparkexpert
  • sparkexpert
  • 2015-10-09 09:36:24
  • 3044

053-14 You have a very large table that your users access frequently. Which of the following advisor

QUESTION 14 You have a very large table that your users access frequently. Which of the following a...
  • EVISWANG
  • EVISWANG
  • 2015-11-23 14:46:45
  • 974

OCP-1Z0-053-V12.02-44题

44.You have a very large table that your users access frequently. Which of the following advisors wi...
  • rlhua
  • rlhua
  • 2013-10-29 11:08:13
  • 6091

poj1001解题报告

 ExponentiationTime Limit: 500MS Memory Limit: 10000KTotal Submissions: 80517 Accepted: 19098Descrip...
  • neoxuhaotian
  • neoxuhaotian
  • 2011-01-23 11:05:00
  • 6378
收藏助手
不良信息举报
您举报文章:Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning-笔记
举报原因:
原因补充:

(最多只允许输入30个字)