Hadoop_puffsun的博客-CSDN博客

Hadoop

关注

文章平均质量分 74

关注数：文章数：57 文章阅读量：33583 文章收藏量：0

作者: puffsun

这个作者很懒，什么都没留下…

展开

Availability and Reliability with HBase

AvailabilityAvailability in the context of HBase can be defined as the ability of the system to handle failures. The most common failures cause one or more nodes in the HBase cluster to fall off t...

原创 2013-08-25 10:53:19 · 95 阅读 · 0 评论
Moving Data in/out of Hadoop Filesystem

Hadoop has a number of built-in mechanisms that can facilitate ingress and egress operations, to name a few:Embedded NameNode HTTP serverWebHDFS and Hadoop interfacesHbase built-in API, be sp...

原创 2013-07-18 23:11:51 · 84 阅读 · 0 评论
Enabling Oozie Web Console in CDH3, CDH4 with/without Cloudera Manager

To enable Oozie's web console, you must download and add the ExtJS library to the Oozie server. If you have not already done this, proceed as follows. If you use CDH3, you must do:Download th...

原创 2013-07-16 23:36:37 · 81 阅读 · 0 评论
指定Flume日志分类级别

用UDP或TCP接受syslog格式日志的时候，比如：flume dump 'syslogUdp(5140)' 这个命令使用UDP在5140端口接收日志。这时候假如你希望从命令行测试能否成功接收：echo '<37>Hello from cmd.' |nc -u localhost 5140 一定要在测试文本头加上<37>用来对日志进行分类，否则flum...

原创 2013-07-16 08:41:14 · 758 阅读 · 0 评论
PageRank Algorithm in MapReduce

In chapter 5 of Data-Intensive Text Processing with MapReduce, it introduces how to implement PageRank algorithm in MapReduce way. Here I am not going to talk more about PageRank itself, please refe...

原创 2013-07-14 12:12:29 · 150 阅读 · 0 评论
Breadth-first Graph Search in MapReduce

In chapter 5 of the book "Data-Intensive Text Processing with MapReduce", it introduced how to parallel breadth-first graph search with MapReduce. This parallel algorithm is a variant of Dijkstra's ...

原创 2013-07-13 20:44:43 · 136 阅读 · 0 评论
Homework - How to Configure Hadoop Task Scheduler

To configure MapReduce or YARN task scheduler, go to Services -> mapreduce1/yarn1 -> Configuration.Then click the 'view and edit' tab, search for property 'mapred.jobtracker.taskSchedu...

原创 2013-07-13 01:00:11 · 102 阅读 · 0 评论
Homework - NASA Access Log Processing

Hadoop workshop homework. For privacy, the blog post will not show source code at all, only the job output logs and counters.Copy the packaged jar file into hadoop cluster:[root@n1 hadoop-...

原创 2013-07-13 00:36:11 · 168 阅读 · 0 评论
Homework - Running Hadoop WordCount Examples

Hadoop workshop homework. Since I am an Intellij Idea guy now (I shifted to Intellij Idea from Eclipse several months ago because Intellij Idea is much much better than Eclipse now). Currently I...

原创 2013-07-12 23:44:17 · 71 阅读 · 0 评论
Homework - Benchmarking Hadoop Cluster

In this blog post I introduce some of the benchmarking and testing tools in the Apache Hadoop distribution. Namely, I'll look at TeraSort, NNBench and MRBench. These are popular choices to benchmar...

原创 2013-07-12 22:20:29 · 119 阅读 · 0 评论
Commissioning and Decommissioning Nodes from Hadoop Cluster

Nodes in Hadoop cluster run both a datanode and a tasktracker, and both are typically commissioned or decommissioned in tandem. Commissioning new nodesCommissioning a new node can be as simple...

原创 2013-07-11 23:34:02 · 170 阅读 · 0 评论
Overview of MapReduce Algorithm Design

Although the programming model of MapReduce framework force one to express algorithms in terms of a small set of rigidly defined components, there are many tools at one's disposal to shape the flow ...

原创 2013-07-11 09:32:02 · 83 阅读 · 0 评论
MapReduce Algorithm - Reduce-side Join

Reduce-side join is also known as repartition join. The idea is quite simple: we map over both datasets and emit the join key as the intermediate key, and the tuple itself as the intermediate value....

原创 2013-07-11 08:46:26 · 71 阅读 · 0 评论
Adding HBase Library into Java Classpath

Suppose you write some Java code to operate HBase via HBase Java client interface, you compile and package the java source code into a jar, called examples.jar. In Hadoop cluster you can use "hbase c...

原创 2013-07-20 14:17:36 · 85 阅读 · 0 评论
Running MapReduce Job with HBase

Generally there are three different ways of interacting with HBase from a MapReduce application. HBase can be used as data source at the beginning of a job, as a data sink at the end of a job or as ...

原创 2013-07-21 01:50:23 · 93 阅读 · 0 评论
Failed to Run Pig Script with Macro

Pig version:[root@n8 examples]# pig -versionApache Pig version 0.11.0-cdh4.3.0 (rexported) compiled May 27 2013, 20:48:21 Hadoop version:[root@n8 examples]# hadoop versionHadoop 2.0.0-cd...

原创 2013-08-16 19:44:29 · 150 阅读 · 0 评论
Solution to Hive Thrift Client Hang without Any Return

Env:Cloudera Manager 4.6.1 with CDH4.3Hadoop 2.0.0-CDH4.3Hive 0.10.0-CDH4.3CentOS 6.4 X86_64 Hive started successfully: [root@n8 hive]# netstat -anlp | grep 10000tcp 0 0 0.0.0.0:...

原创 2013-08-12 19:38:33 · 108 阅读 · 0 评论
如何制作Hive数据文件

在学习Hive的过程中我经常遇到的问题是没有合适的数据文件，比如在读《Programming Hive》这本书的时候就因为Employees这张表没有提供示例数据而倍感挫折。因为Hive默认用'\001'（Ctrl+A）作为字段(Fields)分隔符，'\002'(Ctrl+B)作为集合元素(Collections Items)分隔符，'\003'作为Map类型Key/Values分隔符。在编...

原创 2013-08-10 12:05:04 · 137 阅读 · 0 评论
Hive - 创建Index失败，原因暂未知

运行环境Cloudera Hive 0.10-CDH4 在我机器上安装的Hive里有如下的表： hive (human_resources)> describe formatted employees;OKcol_name data_type comment# col_name data_type comment ...

原创 2013-08-10 00:08:46 · 986 阅读 · 0 评论
Cascading Terminology and Concepts

Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows on a single computing node or distributed computing cluster. On a ...

原创 2013-08-02 23:17:37 · 117 阅读 · 1 评论
Cascading Kick Start: Word Counting

If you know Hadoop, you're undoubtedly have seen WordCount before, WordCount serves as a hello world for Hadoop apps. This simple program provides a great test case for parallel processing:It req...

原创 2013-07-31 19:36:29 · 95 阅读 · 0 评论
Joins with Apache Crunch

Apache Crunch is a Java library for creating MapReduce pipelines that is based on Google's FlumeJava library. Like other high-level tools for creating MapReduce jobs, such as Apache Hive, Apache Pig...

原创 2013-07-30 19:46:21 · 102 阅读 · 0 评论
Getting Started with Apache Crunch

The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to wr...

原创 2013-07-29 23:10:34 · 102 阅读 · 0 评论
Accelerating Comparison by Providing RawComparator

When a job is in sorting or merging phase, Hadoop leverage RawComparator for the map output key to compare keys. Built-in Writable classes such as IntWritable have byte-level implementation that are...

原创 2013-07-27 21:25:07 · 78 阅读 · 0 评论
MapReduce Algorithm - Secondary Sort

Secondary sort is used to sort to allow some records to arrive at a reducer ahead of other records, it requires an understanding of both data arrangement and data flow (partitioning, sorting and gro...

原创 2013-07-25 19:34:46 · 147 阅读 · 0 评论
MapReduce Algorithm - Semi-joins

In relational world, semi-join can be defined as a join between two tables returns rows from the first table where one or more matches are found in the second table. The difference between a semi-jo...

原创 2013-07-25 18:15:04 · 81 阅读 · 0 评论
MapReduce Algorithm - Another Way to Do Map-side Join

Map-side join is also known as replicated join, and gets is name from the fact that the smallest of the datasets is replicated to all the map hosts. You can find a implementation in Hadoop in Action...

原创 2013-07-25 17:51:08 · 134 阅读 · 0 评论
MapReduce Algorithm - in Map Combining

In this blog post, I will demonstrate severval techniques for local aggregation using the sample word count example. The original WordCount example comes from Hadoop examples. But I'll try to improv...

原创 2013-07-10 22:38:12 · 134 阅读 · 0 评论
High Order Functions, Roots of MapReduce

Hadoop has its roots in functional programming, which is exemplified in languages such as Lisp and ML. A key feature of functional programming language is the concept of higher-order functions, or f...

原创 2013-07-09 23:29:50 · 78 阅读 · 0 评论
JMX Port Monitoring for Cloudera CDH4

By default, JMX info is only accessible via the JMX JSON servet (https://issues.apache.org/jira/browse/HDFS-2083). For example, you can query the NameNode status via:curl -i http://n1.example.com:...

原创 2013-07-08 23:04:18 · 109 阅读 · 0 评论
Classic MapReduce - Shuffle and Sort

Hadoop make guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort and transfers the map outputs to the reducers as inputs is known as the shuf...

原创 2013-06-30 11:06:06 · 103 阅读 · 0 评论
Clsssic MapReduce (MapReduce 1) - Task execution

First, tasktracker localizes the job jar by copying it from the shared filesystem to the its filesystem. It also copies any files needed from the distributed cache by the application to the local di...

原创 2013-06-29 15:42:36 · 105 阅读 · 0 评论
Clsssic MapReduce (MapReduce 1) - Job assignment

Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker. /** * Main service loop. Will stay in this loop forever. */ State offerService() throw...

原创 2013-06-29 14:45:20 · 125 阅读 · 0 评论
Clsssic MapReduce (MapReduce 1) - Job initialization

When the JobTracker recieves a call to its submitJob(...) method, it first checks if JobTracker is in SafeMode private void checkSafeMode() throws SafeModeException { if (isInSafeMode()) { ...

原创 2013-06-29 13:56:51 · 93 阅读 · 0 评论
Clsssic MapReduce (MapReduce 1) - Job submission

Noted that the old and new MapReduce APIs are not the same thing as the classic and YRAN-based MapReduce implementations ( MapReduce 1 and MapReduce 2 respectively ). The APIs are user-facing client-...

原创 2013-06-29 11:59:49 · 373 阅读 · 0 评论
Note of Oozie 3.3.2 LocalOozie service start error.

按照Oozie Official Example: http://oozie.apache.org/docs/3.3.2/DG_Examples.html写了一个LocalOozie的例子：import org.apache.oozie.client.OozieClient;import org.apache.oozie.client.WorkflowAction;import...

原创 2013-06-29 00:09:51 · 80 阅读 · 0 评论
By default, HDFS trash is disabled

Hadoop is a hierarchical file system, so the old fashioned ' rm deathstar' (DONT RUN THIS! 'rm -rf /') is the greatest fear of people who worry about stuff for a living (system admins). Hadoop has ...

2013-06-28 20:20:01 · 80 阅读 · 0 评论
Hadoop jobtracker.jsp 404

Hadoop启动以后，访问Hadoop Administration页面和jobtracker页面有404 错误。jps显示一切正常。可能原因：if [ -d "$HADOOP_HOME/build/webapps" ]; then CLASSPATH=${CLASSPATH}:$HADOOP_HOME/buildfi 操作：cd $HADOOP_HO...

原创 2013-06-28 16:19:26 · 184 阅读 · 0 评论
Install oozie-3.3.2 on Hadoop 1.1.1

After a few hours tweaking and googling, I managed to install apache oozie 3.3.2 on Hadoop 1.1.1.The document provided in apache oozie 3.3.2 is not very clear. After some googling, I found this bl...

2013-06-27 22:13:12 · 95 阅读 · 0 评论
JDK 1.6.0_29 Mac OS X profiling issue

Today I tried to run hadoop with hprof to profiling hadoop map tasks. Unfortunately the task failed with below output: MacBookPro:hadoop-guide gsun$ hadoop -agentlib:hprof=cpu=samples,heap=sites,...

原创 2013-06-26 23:50:22 · 101 阅读 · 0 评论

Hadoop

作者: puffsun

Availability and Reliability with HBase

Moving Data in/out of Hadoop Filesystem

Enabling Oozie Web Console in CDH3, CDH4 with/without Cloudera Manager

指定Flume日志分类级别

PageRank Algorithm in MapReduce

Breadth-first Graph Search in MapReduce

Homework - How to Configure Hadoop Task Scheduler

Homework - NASA Access Log Processing

Homework - Running Hadoop WordCount Examples

Homework - Benchmarking Hadoop Cluster

Commissioning and Decommissioning Nodes from Hadoop Cluster

Overview of MapReduce Algorithm Design

MapReduce Algorithm - Reduce-side Join

Adding HBase Library into Java Classpath

Running MapReduce Job with HBase

Failed to Run Pig Script with Macro

Solution to Hive Thrift Client Hang without Any Return

如何制作Hive数据文件

Hive - 创建Index失败，原因暂未知

Cascading Terminology and Concepts

Cascading Kick Start: Word Counting

Joins with Apache Crunch

Getting Started with Apache Crunch

Accelerating Comparison by Providing RawComparator

MapReduce Algorithm - Secondary Sort

MapReduce Algorithm - Semi-joins

MapReduce Algorithm - Another Way to Do Map-side Join

MapReduce Algorithm - in Map Combining

High Order Functions, Roots of MapReduce

JMX Port Monitoring for Cloudera CDH4

Classic MapReduce - Shuffle and Sort

Clsssic MapReduce (MapReduce 1) - Task execution

Clsssic MapReduce (MapReduce 1) - Job assignment

Clsssic MapReduce (MapReduce 1) - Job initialization

Clsssic MapReduce (MapReduce 1) - Job submission

Note of Oozie 3.3.2 LocalOozie service start error.

By default, HDFS trash is disabled

Hadoop jobtracker.jsp 404

Install oozie-3.3.2 on Hadoop 1.1.1

JDK 1.6.0_29 Mac OS X profiling issue