关闭

基于HDFS的SparkStreaming案例实战

475人阅读 评论(0) 收藏 举报
分类:

参考

DT大数据梦工厂

场景

SparkStreaming监听HDFS上某一目录,并打印该目录下文件的类容

实验

package cool.pengych.spark.streaming;

import java.util.Arrays;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;

import scala.Tuple2;

/**
 * SparkStreaming 监听并计数 HDFS中的某目录下的文件内容
 * @author pengyucheng
 *
 */
public class SparkStreamingOnHDFS 
{
    public static void main(String[] args)
    {
        /*
         *  第一步:配置SparkConf
         */
        final SparkConf  conf = new SparkConf().setMaster("local[2]").setAppName("SparkStreamingOnHDFS");

        final String checkpointDirectory = "hdfs://112.74.21.122:9000/library/SparkStreaming/data";

        JavaStreamingContextFactory factory = new JavaStreamingContextFactory()
        {
            @Override
            public JavaStreamingContext create()
            {
                return createContext(checkpointDirectory,conf);
            }
        };
        /*
         * 可以从失败中恢复Driver,不过还需要指定Driver这个进程运行在Cluster,并且提交应用程序的时候指定 --supervise
         */
        JavaStreamingContext jsc = JavaStreamingContext.getOrCreate(checkpointDirectory, factory);

        JavaDStream<String>  lines = jsc.textFileStream(checkpointDirectory);

        JavaDStream<String>  words = lines.flatMap(new FlatMapFunction<String, String>() {
            private static final long serialVersionUID = 1L;

            @Override
            public Iterable<String> call(String line) throws Exception {
                // TODO Auto-generated method stub
                return Arrays.asList(line.split(","));
            }
        });

        JavaPairDStream<String, Integer>  pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
            private static final long serialVersionUID = 1L;
            @Override
            public Tuple2<String, Integer> call(String word) throws Exception {
                // TODO Auto-generated method stub
                return new Tuple2<String,Integer>(word,1);
            }
        });

        JavaPairDStream<String, Integer>  wordsCount  = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
            private static final long serialVersionUID = 1L;
            @Override
            public Integer call(Integer v1, Integer v2) throws Exception {
                // TODO Auto-generated method stub
                return null;
            }
        });

        wordsCount.print();

        /*
         * 第五步:启动StreamingContext的执行.
         */
        jsc.start();

        jsc.awaitTermination();
    }

    /**
     * 生成 JavaStreamingContext
     * @param checkpointDirectory
     * @param conf
     * @return JavaStreamingContext
     */
    private static JavaStreamingContext createContext(String checkpointDirectory, SparkConf conf)
    {
        JavaStreamingContext ssc = new JavaStreamingContext(conf,Durations.seconds(5));
        ssc.checkpoint(checkpointDirectory);
        return ssc;
    }
}

执行结果

16/06/01 16:49:10 INFO DAGScheduler: Job 5 finished: print at SparkStreamingOnHDFS.java:76, took 0.303949 s
-------------------------------------------
Time: 1464770950000 ms
-------------------------------------------
(1��|~

总结

最近学习上遇到了两个问题:
1、spark SQL 访问hive中的数据,出现 database不存在-但是hive上的database确实存在啊
2、今天学习SparkSreaming第85课时感觉讲的乱七八糟的,毫无条理:这是机会!先花 7天时间完成 sparkStreaming全部课程的学习,大致有个轮廓,弄清楚技术点 => 完了再进行第二轮学习。加油!I love pains
3、上面的实验结果,不知道是什么意思,先作个纪念。

0
0

查看评论
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
    个人资料
    • 访问:115786次
    • 积分:2369
    • 等级:
    • 排名:第17365名
    • 原创:97篇
    • 转载:30篇
    • 译文:25篇
    • 评论:17条