Hadoop_Jack_F的博客-CSDN博客

Hadoop

关注

文章平均质量分 76

关注数：文章数：26 文章阅读量：78148 文章收藏量：3

作者: Jack_F

唉生活唉社交

展开

[Hadoop源码解读]（一）MapReduce篇之InputFormat

http://blog.csdn.net/posa88/article/details/7897963目录(?)[-]InputSplitInputFormatFileInputFormatTextInputFormatNLineInputFormat 平时我们写MapReduce程序的时候，在设置输入格式的时候

转载 2013-09-21 10:08:21 · 1076 阅读 · 0 评论
ZooKeeper学习

转自：http://agapple.iteye.com/blog/1111377背景前段时间看了S4流计算引擎，里面使用到了zookeeper进行集群管理，所以也就花了点时间研究了下zookeeper，不求看懂所有源码，但求了解其实现机制和原理，清楚其基本使用。这也是为后续hadoop,gridgain的分布式计算的产品。学习首先就是收集一些前人的一些学习资料和总结内

转载 2013-11-27 21:26:21 · 1112 阅读 · 0 评论
data-intensive text processing with mapreduce-EM Algorithms for Text Processing

EM Algorithms for Text Processing

原创 2013-11-16 20:21:10 · 1016 阅读 · 0 评论
data-intensive text processing with mapreduce-Graph Algorithms

Graph Algorithms

原创 2013-11-16 20:20:31 · 3108 阅读 · 0 评论
Ubuntu 编译安装 hadoop 2.2.0

本文属于转载，打patch部分是自己遇到的问题转自：http://blog.changecong.com/2013/10/ubuntu-%E7%BC%96%E8%AF%91%E5%AE%89%E8%A3%85-hadoop-2-2-0/编译环境OS: Ubuntu 12.04 64-bithadoop version: 2.2.0Java: Jdk1.7.0

转载 2013-12-11 20:49:33 · 2876 阅读 · 0 评论
第一个Spark On Yarn程序

环境hadoop 2.2.0 + Scala 2.10.3 + Spark 0.9 + Idea 13单机伪分布式的YarnIdea SBT插件使用：建立SBT项目，然后在Setting中设置SBT autoimport 和 auto 创建目录结构build.sbtname := "WordCount"version := "1.0"scalaVersion :

原创 2014-02-22 14:42:52 · 13080 阅读 · 5 评论
Spark PageRank

如果不考虑出度为0的节点情况，方法很easy，参考官方的code。但是考虑出度为0的节点的时候，会出现各种问题先贴上代码，再说明package myclassimport org.apache.spark.SparkContextimport SparkContext._import scala.collection.mutable.ArrayBufferimport scala.c

原创 2014-02-23 16:23:48 · 10094 阅读 · 0 评论
KMeans on Spark

思路：1.随机生成数据2.随机生成K个聚类中心3.计算每个点所属的类别4.计算新的聚类中心5.比较聚类中心的变化情况，大于阈值跳转至3；小于阈值停止。package myclassimport java.util.Randomimport org.apache.spark.SparkContextimport SparkContext._import org.ap

原创 2014-02-27 11:33:53 · 5726 阅读 · 4 评论
SparkTC ：Transitive closure on a graph（图中节点的可达性）

思路：1.生成数据（from，to），为初试可达节点对数目(同时也是基本的节点跳转规则）2.对数据需要做一次链接操作，（类似于一次矩阵乘）3.将链接操作的结果提取成（from，to）形式，与当前的可达节点对做并集，得到最新的当前可达节点对数目3.比较当前可达节点对的数量与上一轮节点对数量4.若没有增加，则停止；否则，跳转至2继续执行可能还是比较晕乎乎，看实验数据

原创 2014-02-27 20:50:47 · 3770 阅读 · 0 评论
Spark with Hadoop InputFormat

基于Yarn的，使用新的API，SBT需要添加，默认是用的1.0.4的clientlibraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.2.0"代码如下package myclassimport org.apache.spark.SparkContextimport org.apache.hadoop.

原创 2014-02-28 10:51:23 · 6979 阅读 · 2 评论
GraphX中Pregel单源点最短路径

GraphX中的单源点最短路径例子，使用的是类Pregel的方式。核心部分是三个函数：1.节点处理消息的函数 vprog: (VertexId, VD, A) => VD （节点id，节点属性，消息） => 节点属性2.节点发送消息的函数 sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId,A)] (边元组） => Iterato

原创 2014-03-04 21:54:27 · 7159 阅读 · 1 评论
data-intensive text processing with mapreduce-Inverted Indexing for Text Retrieval

Inverted Indexing for Text Retrieval

原创 2013-11-16 20:19:33 · 1179 阅读 · 0 评论
zookeeper 集群安装（单点与分布式成功安装）摘录

http://www.blogjava.net/hello-yun/archive/2012/05/03/377250.htmlZooKeeper是一个分布式开源框架，提供了协调分布式应用的基本服务，它向外部应用暴露一组通用服务——分布式同步（Distributed Synchronization）、命名服务（Naming Service）、集群维护（Group Maintenan

转载 2013-11-27 22:25:23 · 878 阅读 · 0 评论
分布式服务框架 Zookeeper -- 管理分布式环境中的数据

http://www.ibm.com/developerworks/cn/opensource/os-cn-zookeeper/简介Zookeeper 分布式服务框架是 Apache Hadoop 的一个子项目，它主要是用来解决分布式应用中经常遇到的一些数据管理问题，如：统一命名服务、状态同步服务、集群管理、分布式应用配置项的管理等。本文将从使用者角度详细介绍 Zookeeper

转载 2013-11-27 21:48:19 · 719 阅读 · 0 评论
MapReduce Design Patterns-chapter 2

啊

原创 2013-09-21 09:56:28 · 1361 阅读 · 1 评论
MapReduce Design Patterns-chapter 3

CHAPTER 3：Filtering PatternsThere are a couple of reasons why map-only jobs are efficient.• Since no reducers are needed, data never has to be transmitted between the mapand reduce phase. Most o

原创 2013-09-22 10:02:00 · 1391 阅读 · 0 评论
MapReduce Design Patterns-chapter 4

CHAPTER 4：Data Organization PatternsStructured to HierarchicalProblem: Given a list of posts and comments, create a structured XML hierarchy to nest comments with their related post.public

原创 2013-09-22 17:08:42 · 1705 阅读 · 0 评论
MapReduce Design Patterns-chapter 5

CHAPTER 5：Join PatternsA Refresher on JoinsINNER JOINWith this type of join, records from both A and B that contain identical values for a given foreign key f are brought together, such that all

原创 2013-09-23 18:38:44 · 972 阅读 · 0 评论
MapReduce Design Patterns-chapter 6

CHAPTER 6:Metapatterns**Oozie**# Job Chaining #CombineFileInputFormat takessmaller blocks and lumps them together to make a larger input splitbefore being processed by the mapper.You

原创 2013-09-25 09:17:24 · 1497 阅读 · 0 评论
MapReduce Design Patterns-chapter 7

CHAPTER 7：Input and Output PatternsCustomizing Input and Output in HadoopHadoop allows you to modify the way data is loaded on disk in two major ways: configuring how contiguous chunks of input ar

原创 2013-09-25 23:30:42 · 1094 阅读 · 0 评论
Hadoop 二次排序 Secondary Sort

转自：http://blog.csdn.net/heyutao007/article/details/5890103mr自带的例子中的源码SecondarySort，我重新写了一下，基本没变。这个例子中定义的map和reduce如下，关键是它对输入输出类型的定义：（java泛型编程） public static class Map extends Mapper publ

转载 2013-10-10 00:04:47 · 5575 阅读 · 2 评论
data-intensive text processing with mapreduce-MapReduce Algorithm Design

MapReduce Algorithm Designin-mapper combiningMain idea：通过借用Map手动实现聚集，在Mapper中实现Combiner。Example：WordCountReason：1.Hadoop的Combiner机制不管key的分布，都会执行combine，如很多key都只有唯一的value与他对应，则Combi

原创 2013-11-10 21:43:06 · 1337 阅读 · 0 评论
Haoop tricks（自用）

配置core-site.xml fs.default.name hdfs://192.168.0.1:9000 The name of the default file system. Either the literal string "local" or a host:port for NDFS.

原创 2013-11-12 10:25:32 · 664 阅读 · 0 评论
Hadoop命令汇总（自用）

管理篇执行篇

原创 2013-11-12 10:07:01 · 696 阅读 · 0 评论
Apache Hama配置

hama-site.xml bsp.master.address 192.168.0.1:40000 The address of the bsp master server. Either the literal string "local" or a host[:port] (where host is a name or IP address

原创 2013-11-27 11:08:29 · 1564 阅读 · 0 评论
找工作面试备忘录

Data StructureJava1.Java HashMap的工作原理2.Java应用程序中的内存泄漏及内存管理3.Java垃圾回收精粹Hadoop

原创 2014-04-08 16:54:33 · 1520 阅读 · 0 评论

Hadoop

作者: Jack_F

[Hadoop源码解读]（一）MapReduce篇之InputFormat

ZooKeeper学习

data-intensive text processing with mapreduce-EM Algorithms for Text Processing

data-intensive text processing with mapreduce-Graph Algorithms

Ubuntu 编译安装 hadoop 2.2.0

第一个Spark On Yarn程序

Spark PageRank

KMeans on Spark

SparkTC ：Transitive closure on a graph（图中节点的可达性）

Spark with Hadoop InputFormat

GraphX中Pregel单源点最短路径

data-intensive text processing with mapreduce-Inverted Indexing for Text Retrieval

zookeeper 集群安装（单点与分布式成功安装）摘录

分布式服务框架 Zookeeper -- 管理分布式环境中的数据

MapReduce Design Patterns-chapter 2

MapReduce Design Patterns-chapter 3

MapReduce Design Patterns-chapter 4

MapReduce Design Patterns-chapter 5

MapReduce Design Patterns-chapter 6

MapReduce Design Patterns-chapter 7

Hadoop 二次排序 Secondary Sort

data-intensive text processing with mapreduce-MapReduce Algorithm Design

Haoop tricks（自用）

Hadoop命令汇总（自用）

Apache Hama配置

找工作面试备忘录