Cascading Kick Start: Word Counting

If you know Hadoop, you're undoubtedly have seen WordCount before, WordCount serves as a hello world for Hadoop apps. This simple program provides a great test case for parallel processing:

  • It requires a minimal amount of code.
  • It demonstrates use of both symbolic and numeric values
  • It shows a dependency graph of tuples as an abstraction
  • It is not many steps away from useful search indexing

When a distributed computing framework can run WordCount in parallel at scale, it can handle much larger and more interesting algorithms as well. Along the way, we'll show you how to use a few more Cascading operations, plus show how to generate a flow diagram as a visualization. The code shown as below:

/*
 * Copyright (c) 2007-2013 Concurrent, Inc. All Rights Reserved.
 *
 * Project and contact information: http://www.cascading.org/
 *
 * This file is part of the Cascading project.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package impatient;

import cascading.flow.Flow;
import cascading.flow.FlowDef;
import cascading.flow.hadoop.HadoopFlowConnector;
import cascading.operation.aggregator.Count;
import cascading.operation.regex.RegexSplitGenerator;
import cascading.pipe.Each;
import cascading.pipe.Every;
import cascading.pipe.GroupBy;
import cascading.pipe.Pipe;
import cascading.property.AppProps;
import cascading.scheme.hadoop.TextDelimited;
import cascading.tap.Tap;
import cascading.tap.hadoop.Hfs;
import cascading.tuple.Fields;

import java.util.Properties;


public class Main {
    public static void main(String[] args) {
        String docPath = args[0];
        String wcPath = args[1];

        Properties properties = new Properties();
        AppProps.setApplicationJarClass(properties, Main.class);
        HadoopFlowConnector flowConnector = new HadoopFlowConnector(properties);

        // create source and sink taps
        Tap docTap = new Hfs(new TextDelimited(true, "\t"), docPath);
        Tap wcTap = new Hfs(new TextDelimited(true, "\t"), wcPath);

        // specify a regex operation to split the "document" text lines into a token stream
        Fields token = new Fields("token");
        Fields text = new Fields("text");
        RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]");
        // only returns "token"
        Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);

        // determine the word counts
        Pipe wcPipe = new Pipe("wc", docPipe);
        wcPipe = new GroupBy(wcPipe, token);
        wcPipe = new Every(wcPipe, Fields.ALL, new Count(), Fields.ALL);

        // connect the taps, pipes, etc., into a flow
        FlowDef flowDef = FlowDef.flowDef()
                .setName("wc")
                .addSource(docPipe, docTap)
                .addTailSink(wcPipe, wcTap);

        // write a DOT file and run the flow
        Flow wcFlow = flowConnector.connect(flowDef);
        wcFlow.writeDOT("dot/wc.dot");
        wcFlow.complete();
    }
}
Let's go through the source code line by line. 

 

  1. Define a docTap as a incoming tap, and a wcTap as a outcoming tap.
  2. Configure HadoopFlowConnector, which will be used to connect the pipe between source tap and sink tap, we will talk about phpe later.
  3. Use a generator inside an Each object to split the document text into a token stream, the generator uses a regex pattern to split the input text on word boundaries: blank, [, ], (, ), ,(comma sign) and .(period sign).
    RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]");
    Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);
     
  4. Out of that pipe, we get a tuple stream of token values. One benefit of using a regex is that it's simple to change. We can handle more complex cases of splitting tokens without having to rewrite the generator.
  5. Next, we use a GroupBy to count the occurences of each token:
    Pipe wcPipe = new Pipe("wc", docPipe);
    wcPipe = new GroupBy(wcPipe, token);
    wcPipe = new Every(wcPipe, Fields.ALL, new Count(), Fields.ALL);
    Note that we have used Each and Every to perform operations within the pipe assembly. The difference between these two is that an Each operates on individual tuples so that it takes Function operations. An Every operates on groups of tuples so that it takes Aggregator or Buffer operations - in this case, the GroupBy performed an aggregation. The different ways of inserting operations serve to categorize the different built-in operations in Cascading.
  6. From that wcPipe we get a resulting tuple stream of token and count for the output. Again, we connect the plumbing with a FlowDef:
    FlowDef flowDef = FlowDef.flowDef()
            .setName("wc")
            .addSource(docPipe, docTap)
            .addTailSink(wcPipe, wcTap);
    Flow wcFlow = flowConnector.connect(flowDef);
     
  7. Finally, we generate a dot file to depict the Cascading flow graphically, those diagrams are really helpful for troubleshooting workflows in Cascading:
    // Generate a dot file to depict the flow.
    wcFlow.writeDOT("dot/wc.dot");
    wcFlow.complete();
     Below is what the diagram looks like in OmniGraffle.

     

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值