![](https://img-blog.csdnimg.cn/20201014180756926.png?x-oss-process=image/resize,m_fixed,h_64,w_64)
Hadoop
文章平均质量分 74
puffsun
这个作者很懒,什么都没留下…
展开
-
Availability and Reliability with HBase
AvailabilityAvailability in the context of HBase can be defined as the ability of the system to handle failures. The most common failures cause one or more nodes in the HBase cluster to fall off t...原创 2013-08-25 10:53:19 · 95 阅读 · 0 评论 -
Moving Data in/out of Hadoop Filesystem
Hadoop has a number of built-in mechanisms that can facilitate ingress and egress operations, to name a few:Embedded NameNode HTTP serverWebHDFS and Hadoop interfacesHbase built-in API, be sp...原创 2013-07-18 23:11:51 · 84 阅读 · 0 评论 -
Enabling Oozie Web Console in CDH3, CDH4 with/without Cloudera Manager
To enable Oozie's web console, you must download and add the ExtJS library to the Oozie server. If you have not already done this, proceed as follows. If you use CDH3, you must do:Download th...原创 2013-07-16 23:36:37 · 81 阅读 · 0 评论 -
指定Flume日志分类级别
用UDP或TCP接受syslog格式日志的时候,比如:flume dump 'syslogUdp(5140)' 这个命令使用UDP在5140端口接收日志。这时候假如你希望从命令行测试能否成功接收:echo '<37>Hello from cmd.' |nc -u localhost 5140 一定要在测试文本头加上<37>用来对日志进行分类,否则flum...原创 2013-07-16 08:41:14 · 758 阅读 · 0 评论 -
PageRank Algorithm in MapReduce
In chapter 5 of Data-Intensive Text Processing with MapReduce, it introduces how to implement PageRank algorithm in MapReduce way. Here I am not going to talk more about PageRank itself, please refe...原创 2013-07-14 12:12:29 · 150 阅读 · 0 评论 -
Breadth-first Graph Search in MapReduce
In chapter 5 of the book "Data-Intensive Text Processing with MapReduce", it introduced how to parallel breadth-first graph search with MapReduce. This parallel algorithm is a variant of Dijkstra's ...原创 2013-07-13 20:44:43 · 136 阅读 · 0 评论 -
Homework - How to Configure Hadoop Task Scheduler
To configure MapReduce or YARN task scheduler, go to Services -> mapreduce1/yarn1 -> Configuration.Then click the 'view and edit' tab, search for property 'mapred.jobtracker.taskSchedu...原创 2013-07-13 01:00:11 · 102 阅读 · 0 评论 -
Homework - NASA Access Log Processing
Hadoop workshop homework. For privacy, the blog post will not show source code at all, only the job output logs and counters.Copy the packaged jar file into hadoop cluster:[root@n1 hadoop-...原创 2013-07-13 00:36:11 · 168 阅读 · 0 评论 -
Homework - Running Hadoop WordCount Examples
Hadoop workshop homework. Since I am an Intellij Idea guy now (I shifted to Intellij Idea from Eclipse several months ago because Intellij Idea is much much better than Eclipse now). Currently I...原创 2013-07-12 23:44:17 · 71 阅读 · 0 评论 -
Homework - Benchmarking Hadoop Cluster
In this blog post I introduce some of the benchmarking and testing tools in the Apache Hadoop distribution. Namely, I'll look at TeraSort, NNBench and MRBench. These are popular choices to benchmar...原创 2013-07-12 22:20:29 · 119 阅读 · 0 评论 -
Commissioning and Decommissioning Nodes from Hadoop Cluster
Nodes in Hadoop cluster run both a datanode and a tasktracker, and both are typically commissioned or decommissioned in tandem. Commissioning new nodesCommissioning a new node can be as simple...原创 2013-07-11 23:34:02 · 170 阅读 · 0 评论 -
Overview of MapReduce Algorithm Design
Although the programming model of MapReduce framework force one to express algorithms in terms of a small set of rigidly defined components, there are many tools at one's disposal to shape the flow ...原创 2013-07-11 09:32:02 · 83 阅读 · 0 评论 -
MapReduce Algorithm - Reduce-side Join
Reduce-side join is also known as repartition join. The idea is quite simple: we map over both datasets and emit the join key as the intermediate key, and the tuple itself as the intermediate value....原创 2013-07-11 08:46:26 · 71 阅读 · 0 评论 -
Adding HBase Library into Java Classpath
Suppose you write some Java code to operate HBase via HBase Java client interface, you compile and package the java source code into a jar, called examples.jar. In Hadoop cluster you can use "hbase c...原创 2013-07-20 14:17:36 · 85 阅读 · 0 评论 -
Running MapReduce Job with HBase
Generally there are three different ways of interacting with HBase from a MapReduce application. HBase can be used as data source at the beginning of a job, as a data sink at the end of a job or as ...原创 2013-07-21 01:50:23 · 93 阅读 · 0 评论 -
Failed to Run Pig Script with Macro
Pig version:[root@n8 examples]# pig -versionApache Pig version 0.11.0-cdh4.3.0 (rexported) compiled May 27 2013, 20:48:21 Hadoop version:[root@n8 examples]# hadoop versionHadoop 2.0.0-cd...原创 2013-08-16 19:44:29 · 150 阅读 · 0 评论 -
Solution to Hive Thrift Client Hang without Any Return
Env:Cloudera Manager 4.6.1 with CDH4.3Hadoop 2.0.0-CDH4.3Hive 0.10.0-CDH4.3CentOS 6.4 X86_64 Hive started successfully: [root@n8 hive]# netstat -anlp | grep 10000tcp 0 0 0.0.0.0:...原创 2013-08-12 19:38:33 · 108 阅读 · 0 评论 -
如何制作Hive数据文件
在学习Hive的过程中我经常遇到的问题是没有合适的数据文件,比如在读《Programming Hive》这本书的时候就因为Employees这张表没有提供示例数据而倍感挫折。因为Hive默认用'\001'(Ctrl+A)作为字段(Fields)分隔符,'\002'(Ctrl+B)作为集合元素(Collections Items)分隔符,'\003'作为Map类型Key/Values分隔符。在编...原创 2013-08-10 12:05:04 · 137 阅读 · 0 评论 -
Hive - 创建Index失败,原因暂未知
运行环境Cloudera Hive 0.10-CDH4 在我机器上安装的Hive里有如下的表: hive (human_resources)> describe formatted employees;OKcol_name data_type comment# col_name data_type comment ...原创 2013-08-10 00:08:46 · 986 阅读 · 0 评论 -
Cascading Terminology and Concepts
Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows on a single computing node or distributed computing cluster. On a ...原创 2013-08-02 23:17:37 · 117 阅读 · 1 评论 -
Cascading Kick Start: Word Counting
If you know Hadoop, you're undoubtedly have seen WordCount before, WordCount serves as a hello world for Hadoop apps. This simple program provides a great test case for parallel processing:It req...原创 2013-07-31 19:36:29 · 95 阅读 · 0 评论 -
Joins with Apache Crunch
Apache Crunch is a Java library for creating MapReduce pipelines that is based on Google's FlumeJava library. Like other high-level tools for creating MapReduce jobs, such as Apache Hive, Apache Pig...原创 2013-07-30 19:46:21 · 102 阅读 · 0 评论 -
Getting Started with Apache Crunch
The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to wr...原创 2013-07-29 23:10:34 · 102 阅读 · 0 评论 -
Accelerating Comparison by Providing RawComparator
When a job is in sorting or merging phase, Hadoop leverage RawComparator for the map output key to compare keys. Built-in Writable classes such as IntWritable have byte-level implementation that are...原创 2013-07-27 21:25:07 · 78 阅读 · 0 评论 -
MapReduce Algorithm - Secondary Sort
Secondary sort is used to sort to allow some records to arrive at a reducer ahead of other records, it requires an understanding of both data arrangement and data flow (partitioning, sorting and gro...原创 2013-07-25 19:34:46 · 147 阅读 · 0 评论 -
MapReduce Algorithm - Semi-joins
In relational world, semi-join can be defined as a join between two tables returns rows from the first table where one or more matches are found in the second table. The difference between a semi-jo...原创 2013-07-25 18:15:04 · 81 阅读 · 0 评论 -
MapReduce Algorithm - Another Way to Do Map-side Join
Map-side join is also known as replicated join, and gets is name from the fact that the smallest of the datasets is replicated to all the map hosts. You can find a implementation in Hadoop in Action...原创 2013-07-25 17:51:08 · 134 阅读 · 0 评论 -
MapReduce Algorithm - in Map Combining
In this blog post, I will demonstrate severval techniques for local aggregation using the sample word count example. The original WordCount example comes from Hadoop examples. But I'll try to improv...原创 2013-07-10 22:38:12 · 134 阅读 · 0 评论 -
High Order Functions, Roots of MapReduce
Hadoop has its roots in functional programming, which is exemplified in languages such as Lisp and ML. A key feature of functional programming language is the concept of higher-order functions, or f...原创 2013-07-09 23:29:50 · 78 阅读 · 0 评论 -
JMX Port Monitoring for Cloudera CDH4
By default, JMX info is only accessible via the JMX JSON servet (https://issues.apache.org/jira/browse/HDFS-2083). For example, you can query the NameNode status via:curl -i http://n1.example.com:...原创 2013-07-08 23:04:18 · 109 阅读 · 0 评论 -
Classic MapReduce - Shuffle and Sort
Hadoop make guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort and transfers the map outputs to the reducers as inputs is known as the shuf...原创 2013-06-30 11:06:06 · 103 阅读 · 0 评论 -
Clsssic MapReduce (MapReduce 1) - Task execution
First, tasktracker localizes the job jar by copying it from the shared filesystem to the its filesystem. It also copies any files needed from the distributed cache by the application to the local di...原创 2013-06-29 15:42:36 · 105 阅读 · 0 评论 -
Clsssic MapReduce (MapReduce 1) - Job assignment
Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker. /** * Main service loop. Will stay in this loop forever. */ State offerService() throw...原创 2013-06-29 14:45:20 · 125 阅读 · 0 评论 -
Clsssic MapReduce (MapReduce 1) - Job initialization
When the JobTracker recieves a call to its submitJob(...) method, it first checks if JobTracker is in SafeMode private void checkSafeMode() throws SafeModeException { if (isInSafeMode()) { ...原创 2013-06-29 13:56:51 · 93 阅读 · 0 评论 -
Clsssic MapReduce (MapReduce 1) - Job submission
Noted that the old and new MapReduce APIs are not the same thing as the classic and YRAN-based MapReduce implementations ( MapReduce 1 and MapReduce 2 respectively ). The APIs are user-facing client-...原创 2013-06-29 11:59:49 · 373 阅读 · 0 评论 -
Note of Oozie 3.3.2 LocalOozie service start error.
按照Oozie Official Example: http://oozie.apache.org/docs/3.3.2/DG_Examples.html写了一个LocalOozie的例子:import org.apache.oozie.client.OozieClient;import org.apache.oozie.client.WorkflowAction;import...原创 2013-06-29 00:09:51 · 80 阅读 · 0 评论 -
By default, HDFS trash is disabled
Hadoop is a hierarchical file system, so the old fashioned ' rm deathstar' (DONT RUN THIS! 'rm -rf /') is the greatest fear of people who worry about stuff for a living (system admins). Hadoop has ...2013-06-28 20:20:01 · 80 阅读 · 0 评论 -
Hadoop jobtracker.jsp 404
Hadoop启动以后,访问Hadoop Administration页面和jobtracker页面有404 错误。jps显示一切正常。 可能原因:if [ -d "$HADOOP_HOME/build/webapps" ]; then CLASSPATH=${CLASSPATH}:$HADOOP_HOME/buildfi 操作:cd $HADOOP_HO...原创 2013-06-28 16:19:26 · 184 阅读 · 0 评论 -
Install oozie-3.3.2 on Hadoop 1.1.1
After a few hours tweaking and googling, I managed to install apache oozie 3.3.2 on Hadoop 1.1.1.The document provided in apache oozie 3.3.2 is not very clear. After some googling, I found this bl...2013-06-27 22:13:12 · 95 阅读 · 0 评论 -
JDK 1.6.0_29 Mac OS X profiling issue
Today I tried to run hadoop with hprof to profiling hadoop map tasks. Unfortunately the task failed with below output: MacBookPro:hadoop-guide gsun$ hadoop -agentlib:hprof=cpu=samples,heap=sites,...原创 2013-06-26 23:50:22 · 101 阅读 · 0 评论