TigerYang414-CSDN博客

原创 Top K算法的实现

1. 选择排序每次选出最大的一个，共K次。复杂度:n + (n-1) + .. + (n-k+1) = (2n-k+1)*k/2 = o(kn)冒泡跟归并排序类似。2. 快速排序每次都采用快排的方法进行划分，之后迭代的出力Top K所属的那一部分，直到找到为止。复杂度：最好情况：n + n/2 + ... = 2n最坏情况：o(kn)桶排

2012-06-10 21:41:03 706

原创 Hadoop Mapreduce

What happens if the value list for 1 key is largerfor 1 reduce task?Before reduce, framework will sort it first. It should be able to be processed in memory. Otherwise, External Sorting is needed.

2012-06-09 16:06:12 257

原创 Hadoop in Action Note

MapReduce vs. Relational DataBasescale out vs. scale upkey/value pairs vs. relational tablesfunctional programming vs. declarative queryoffline batch processing vs. online transactions(MapReduce i

2012-06-09 15:31:44 377

原创有用的IT站点

1. InfoQInfoQ（Information Queue）是一个时刻关注企业软件开发领域变化与创新的在线独立社区，读者受众群主要为技术架构师、技术团队带头人（高级开发人员）和项目经理等。通过由各个技术领域专家提供的最新的新闻、技术文章、视频访谈、视频演讲和迷你书等，InfoQ 为Java、.NET、Ruby、SOA、敏捷、架构和运维七大社区提供一流的资讯。Hadoop Topic

2012-05-26 10:09:44 307

原创 Linux Amin汇总

Linux(Centos)配置OpenSSH无密码登陆

2012-04-03 21:47:49 387

原创 Hadoop各Release关系

0.20->0.20.1->0.20.2-+security and user limits+->0.20.203.x->0.20.204.x->0.20.205.x | /

2012-03-15 00:05:03 420

原创 Hadoop配置

Config Filescore-default.xml, hdfs-default.xml, mapred-default.xml located in root dir of corresponding jar file (in folder share/hadoop for 0.23.1)which is added to classpath by bin filescore-cit

2012-03-13 23:16:02 233

原创 The Goole File System笔记

Assumptions:built from many commodity machines and able to detect and recover from failuresStore modest number of large files and thus optimize for itPrimarily two kinds of readlarge streaming

2012-02-26 17:13:08 587

原创利用程序中的局部性

将注意力集中在内部循环上，大部分计算与存储器访问都放生在那里按照数据在存储器的顺序读，是空间局部性最大一旦读入某数据，尽可能多的使用它，使时间局部性最大缓存命中率只是影响性能的一个重要因素，存储器访问数量也很重要，两者需要折中考虑摘自《深入理解计算机系统》第6章

2012-02-25 23:40:59 590

原创 Distributed Sort via MapReduce vs. K路归并+快排

Distributed Sort via MapReduceMap function just output key+recordPartition immediate keys to R pieces and this R pieces is sorted partitions for the key value domain. This functions as bucket sort

2012-02-23 09:54:09 2248

原创 Google: MapReduce in a Week Note

1. Failure is the number one concern in distributed system designHardware failureSoftware failure Heisenbug: A bug that seems to disappear or alter its characteristics when it is observe

2012-02-21 22:37:59 371

原创 Hadoop学习笔记

1. Quick Start on MapReduceGoogle: MapReduce in a WeekMapReduce paper笔记The Goole File System笔记2. HadoopHadoop各Release关系Hadoop配置3. Map-Reduce应用场景MapReduce Patterns, Algorith

2012-02-21 22:16:37 291

Spark 是一种与 Hadoop 相似的开源集群计算环境，但是两者之间还存在一些不同之处，这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越，换句话说，Spark 启用了内存分布数据集，除了能够提供交互式查询外，它还可以优化迭代工作负载。 Spark 是在 Scala 语言中实现的，它将 Scala 用作其应用程序框架。与 Hadoop 不同，Spark 和 Scala 能够紧密集成，其中的 Scala 可以像操作本地集合对象一样轻松地操作分布式数据集。

2019-03-06

The Google File System

Paper for The Google File System

2012-03-06

空空如也

TA创建的收藏夹 TA关注的收藏夹

TA关注的人

TigerYang414的专栏