spark程序执行过程

最新推荐文章于 2024-07-02 09:36:29 发布

wuxizhi777

最新推荐文章于 2024-07-02 09:36:29 发布

阅读量1.5k

点赞数 1

分类专栏：大数据

本文链接：https://blog.csdn.net/wuxizhi777/article/details/90017578

版权

大数据专栏收录该内容

6 篇文章 0 订阅

订阅专栏

这个博客写的比较好：

https://blog.csdn.net/liuxiangke0210/article/details/79687240

https://www.jianshu.com/p/b9ec3c2ff8dd

cluster client

一个spark 任务提交后

先把一个程序分成不同stage,然后生成不同的任务集 task0, task1 ,task2 ,task3,task4 .... 最后汇总结果。

task在那个 node (机器上) 是由数据存放的物理地址的分布来决定的。

实际的例子

这个demo 的例子要完成的功能是把 hdfs上有 mongoDB 导出的bson文件切分成按时间（天数分区）的sequence文件。

bson文件一共 12T。

public class Recommender {
    static class KeyasNameMSFOutputFormat extends
            MultipleSequenceFileOutputFormat<LongWritable, BytesWritable>
    {
 
        protected String generateFileNameForKeyValue(LongWritable key,
                                                     BytesWritable value, String name)
        {
            SimpleDateFormat format =  new SimpleDateFormat( "yyyyMMdd/HH/" );
            Long time=new Long(key.toString());
            System.out.println("time : ");
            System.out.println(time);
            String dateFileName = format.format(time);
            System.out.println("Format To String(Date):"+dateFileName);
            dateFileName = dateFileName + "result.seq";
            return dateFileName;
        }
    }
 
 
 
    public static void main(String[] args) {
       // String SourceUri =   "result/result.bson";
        String SourceUri =   "hdfs://insight//user/work/transfer/aios_result/243/result.bson";
        //String SourceUri =   "hdfs://insight//user/flume/aios/result/result.bson";
       // String desUri =   "/home/xizhiwu/codeFromGitlab/bson/res13/";
        String desUri =   "hdfs://insight//user/flume/aios/result/test/";
 
       //SparkConf conf = new SparkConf().setAppName("SparkRecommender");
        SparkConf conf = new SparkConf().setAppName("BsonToSequence");
 
        JavaSparkContext sc = new JavaSparkContext(conf);
        Logger log = sc.sc().log();
 
        Configuration bsonDataConfig = new Configuration();
        bsonDataConfig.set("mongo.job.input.format", "com.mongodb.hadoop.BSONFileInputFormat");
 
 
        log.warn("start ! dazhi");
        JavaRDD<Object> userData = sc.newAPIHadoopFile(SourceUri,
                BSONFileInputFormat.class, Object.class, BSONObject.class, bsonDataConfig).map(
                new Function<Tuple2<Object, BSONObject>, Object>() {
                    @Override
                    public BSONObject call(Tuple2<Object, BSONObject> doc) throws Exception {
                        return doc._2;
                    }
                }
        );
        log.warn("the num  of " + SourceUri +"is:"+userData.count());
         List<Object> resList = userData.collect();
 
        JavaPairRDD<LongWritable, BytesWritable> resTimeRDD = userData.mapToPair(new PairFunction<Object, LongWritable, BytesWritable>() {
            @Override
            public Tuple2<LongWritable, BytesWritable> call(Object obj) throws Exception {
 
                String json = JSON.serialize(obj);
                DBObject bson = (DBObject)JSON.parse(json);
                log.info(bson.get("serverTime").toString().substring(0,13));  // 把16 换成13
                Long time = new Long(bson.get("serverTime").toString().substring(0,13));
                log.info("JSON : "+json);
                log.info("BSON STRING IS :  "+bson.toString());
                return new Tuple2(new LongWritable(time), new BytesWritable(bson.toString().getBytes()));
            }
        });
 
 
        resTimeRDD.saveAsHadoopFile(desUri, LongWritable.class,BytesWritable.class,KeyasNameMSFOutputFormat.class);
    }
 
}

List<Object> resList = userData.collect(); 程序汇报 OOM 而中断。

如果需要使用collect算子将RDD的数据全部拉取到Driver上进行处理，那么必须确保Driver的内存足够大，否则会出现OOM内存溢出的问题。

RDD的构成

初始的RDD 的大小就是 hdfs 上bson文件数的大小。

调优的点

https://spark.apache.org/docs/latest/running-on-yarn.html

实际上的做法： https://github.com/mongodb/mongo-hadoop/wiki/Using-.bson-Files

在用 newAPIHadoopFile 读取的时候会调用 BsonFileInputFormat.class ,如果在要读取的hdfs 目录底下有.splits 文件则会更据 .splits 文件内容来切分文件。这样就可以通程序来控制 RDD 的大小。在结合yarn里面的设置的参数来调优。

而.splits 文件是由如下的hadoop 命令产生的:

hadoop jar mongo-hadoop-core.jar com.mongodb.hadoop.splitter.BSONSplitter \ file:///home/mongodb/backups/really-big-bson-file.bson \ -c org.apache.hadoop.io.compress.BZip2Codec \ -o hdfs://localhost:8020/user/mongodb/input/bson

主要要是的参数的有如下几个：

1. numberExecutor

一个work可以有多个 executor,哪台物理机（datanode）分配到几个executor有yarn来决定，这里申请了15个executor，正好一台物理机分配到一个exexcutor 。（看日志）

2. executorMemory

由于yarn给每个container分配了10G, 还有其他地方会使用到这些资源，这里申请了 8G。在实际代码过程中，这里的8G 一般能用的只有4G左右（可以通过 spark.storage.memoryFraction 参数来设置）,

也就是说一个 executorCores为1的时候，这个一个exector最多可以处理4G的partition 。不然容易报资源不够的错误。如果executorCores为2的时候，这个executor可以处理2G的数据。每个partion能用的内存只有2G了。

3. executorCores

这个核数，决定了 executor中task的并发数。这些task均分 executorMemory的内存数。

4. 宽依赖

https://www.cnblogs.com/arachis/p/Spark_Shuffle.html

有宽依赖即shuffle 操作。就会有 stage 的产生。常见的会产生宽依赖的操作： reduceBykey() ,join()

wuxizhi777

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
spark程序执行过程

这个博客写的比较好： https://blog.csdn.net/liuxiangke0210/article/details/79687240https://www.jianshu.com/p/b9ec3c2ff8dd cluster ...
复制链接

扫一扫

专栏目录