spark算子-wholeTextFiles算子

26 篇文章 0 订阅
25 篇文章 2 订阅

1.简介

wholeTextFiles简介	
	SparkContext.wholeTextFiles具备同时处理多个文件的能力;
	以多个文件所在的文件夹为参数,构建RDD;
	该RDD元素为每个文件作的 (filename, content) 键值对,filename为文件路径,conten为文件内容;
	常用于对多个小文件构建RDD的场景;
	可以通过参数指定返回RDD的最小分区数;
wholeTextFiles与textFile对比
	相似
		两者都可以基于多个文件构建RDD;
	区别
		wholeTextFiles的参数是多个小文件所在文件夹的路径,textFile处理多个文件时,参数是多个文件路径以逗号分割形成的字符串;
		wholeTextFiles常用于基于多个小文件构建RDD,textFile常用于基于单个文件构建RDD;
		wholeTextFiles将每个文件作为 (filename, content) 键值对返回,textFile将在每个文件中每行返回一条记录;

2.源码阅读

2.1.wholeTextFiles源码

defaultMinPartitions 逻辑参考spark算子-textFile算子

参数说明
	path:小文件所在文件夹路径
	minPartitions:
		最小分区数,可以在调用wholeTextFiles算子时指定,也可以不指定;
		不指定的情况下,默认最小分区数为defaultMinPartitions:
			 spark.default.parallelism参数有值为p:
				当p>2时,defaultMinPartitions=2,即textFile()算子默认最小分区数为2;
				当p<=2时,defaultMinPartitions=p,即textFile()算子默认最小分区数为p;
			spark.default.parallelism参数无值:
				集群模式下,defaultMinPartitions=2,即textFile()算子默认最小分区数为2;
				local模式下,defaultMinPartitions=min(cpu总核数,2);
代码内容
	指定RDD分区
	指定读取文件夹路径
	构建WholeTextFileRDD
	RDD转换:文件路径为key,文件内容为value
	设置RDD名称
def wholeTextFiles(
      path: String,
      //此参数指定最小分区数
      minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope {
    assertNotStopped()
    val job = NewHadoopJob.getInstance(hadoopConfiguration)
    
    //指定读取文件夹路径
    NewFileInputFormat.setInputPaths(job, path)
    val updateConf = job.getConfiguration
    //构建WholeTextFileRDD
    new WholeTextFileRDD(
      this,
      classOf[WholeTextFileInputFormat],
      classOf[Text],
      classOf[Text],
      updateConf,
      minPartitions
      ).map(
      	//RDD转换:文件路径为key,文件内容为value
      	record => (record._1.toString, record._2.toString)
      ).setName(path)
  }

2.2.WholeTextFileRDD中数据分区源码

总结
	文件分片数决定RDD分区数
分区步骤
	根据最小分区数设置maxSplitSize(最大分片大小)、minSplitSizeRack(机架最小分片大小)、minSplitSizeNode(节点最小分片大小)
	文件切片
	根据文件切片构建分区
override def getPartitions: Array[Partition] = {
    val conf = getConf
    
    conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS,
      Runtime.getRuntime.availableProcessors().toString)
    val inputFormat = inputFormatClass.newInstance
    inputFormat match {
      case configurable: Configurable =>
        configurable.setConf(conf)
      case _ =>
    }
    val jobContext = new JobContextImpl(conf, jobId)
    //设置maxSplitSize(最大分片大小)、minSplitSizeRack(机架最小分片大小)、minSplitSizeNode(节点最小分片大小)
    inputFormat.setMinPartitions(jobContext, minPartitions)
    //文件切片
    val rawSplits = inputFormat.getSplits(jobContext).toArray
    //文件切片数量设置为RDD分区数
    val result = new Array[Partition](rawSplits.size)
    //根据文件切片构建RDD分区
    for (i <- 0 until rawSplits.size) {
      result(i) = new NewHadoopPartition(id, i, rawSplits(i).asInstanceOf[InputSplit with Writable])
    }
    result
  }

2.3.setMinPartitions 设置切片大小

切片最大字节数maxSplitSize:
	比 总字节数/最小分区数 大的最小整数值
每个节点/机架的切片最小字节数不能大于maxSplitSize
def setMinPartitions(context: JobContext, minPartitions: Int) {
    val files = listStatus(context).asScala
    //小文件总字节数
    val totalLen = files.map(file => if (file.isDirectory) 0L else file.getLen).sum
    //切片最大字节数:比 总字节数/最小分区数 大的最小整数值
    val maxSplitSize = Math.ceil(totalLen * 1.0 /
      (if (minPartitions == 0) 1 else minPartitions)).toLong

    val config = context.getConfiguration
    //配置文件中设置的节点切片最小字节数
    val minSplitSizePerNode = config.getLong(CombineFileInputFormat.SPLIT_MINSIZE_PERNODE, 0L)
    //配置文件中设置的机架切片最小字节数
    val minSplitSizePerRack = config.getLong(CombineFileInputFormat.SPLIT_MINSIZE_PERRACK, 0L)

	//确保每个节点/机架的切片最小字节数不能大于maxSplitSize
    if (maxSplitSize < minSplitSizePerNode) {
      super.setMinSplitSizeNode(maxSplitSize)
    }
    if (maxSplitSize < minSplitSizePerRack) {
      super.setMinSplitSizeRack(maxSplitSize)
    }
    super.setMaxSplitSize(maxSplitSize)
  }

3.CombineFileInputFormat.getSplits() 文件切片

内容
	获取分块限制条件:minSizeNode、minSizeRack、maxSize
	获取文件信息:数量、位置等
	调用具体文件切片方法
public List<InputSplit> getSplits(JobContext job) throws IOException {
        long minSizeNode = 0L;
        long minSizeRack = 0L;
        long maxSize = 0L;
        Configuration conf = job.getConfiguration();
        //获取minSizeNode、minSizeRack、maxSize;这些值在setMinPartitions设置切片大小的时候设置过,没有设置则从配置文件获取,配置文件没有配置就置0;
        if (this.minSplitSizeNode != 0L) {
            minSizeNode = this.minSplitSizeNode;
        } else {
            minSizeNode = conf.getLong("mapreduce.input.fileinputformat.split.minsize.per.node", 0L);
        }

        if (this.minSplitSizeRack != 0L) {
            minSizeRack = this.minSplitSizeRack;
        } else {
            minSizeRack = conf.getLong("mapreduce.input.fileinputformat.split.minsize.per.rack", 0L);
        }

        if (this.maxSplitSize != 0L) {
            maxSize = this.maxSplitSize;
        } else {
            maxSize = conf.getLong("mapreduce.input.fileinputformat.split.maxsize", 0L);
        }

		//确保minSizeNode、minSizeRack不大于maxSize, minSizeNode不大于minSizeRack
        if (minSizeNode != 0L && maxSize != 0L && minSizeNode > maxSize) {
            throw new IOException("Minimum split size pernode " + minSizeNode + " cannot be larger than maximum split size " + maxSize);
        } else if (minSizeRack != 0L && maxSize != 0L && minSizeRack > maxSize) {
            throw new IOException("Minimum split size per rack " + minSizeRack + " cannot be larger than maximum split size " + maxSize);
        } else if (minSizeRack != 0L && minSizeNode > minSizeRack) {
            throw new IOException("Minimum split size per node " + minSizeNode + " cannot be larger than minimum split " + "size per rack " + minSizeRack);
        } else {
        	//获取文件属性清单
            List<FileStatus> stats = this.listStatus(job);
            //文件切片存放容器
            List<InputSplit> splits = new ArrayList();
            //空文件夹直接返回空容器
            if (stats.size() == 0) {
                return splits;
            } else {
                Iterator var11 = this.pools.iterator();

                while(var11.hasNext()) {
                    CombineFileInputFormat.MultiPathFilter onepool = (CombineFileInputFormat.MultiPathFilter)var11.next();
                    ArrayList<FileStatus> myPaths = new ArrayList();
                    Iterator iter = stats.iterator();

                    while(iter.hasNext()) {
                        FileStatus p = (FileStatus)iter.next();
                        if (onepool.accept(p.getPath())) {
                            myPaths.add(p);
                            iter.remove();
                        }
                    }

                    this.getMoreSplits(job, myPaths, maxSize, minSizeNode, minSizeRack, splits);
                }

				//文件切分
                this.getMoreSplits(job, stats, maxSize, minSizeNode, minSizeRack, splits);
                this.rackToNodes.clear();
                return splits;
            }
        }
    }

3.1.getMoreSplits 文件切片具体实现

文件切片步骤
	单个文件进行文件分块汇总
	根据文件分块和分片数量进行数切片
private void getMoreSplits(JobContext job, List<FileStatus> stats, long maxSize, long minSizeNode, long minSizeRack, List<InputSplit> splits) throws IOException {
        Configuration conf = job.getConfiguration();
        //描述block与机架、节点之间的对应关系
        //机架->blocks
        HashMap<String, List<CombineFileInputFormat.OneBlockInfo>> rackToBlocks = new HashMap();
        //block -> nodes
        HashMap<CombineFileInputFormat.OneBlockInfo, String[]> blockToNodes = new HashMap();
        //节点node -> blocks
        HashMap<String, Set<CombineFileInputFormat.OneBlockInfo>> nodeToBlocks = new HashMap();
        //单个文件信息数组
        CombineFileInputFormat.OneFileInfo[] files = new CombineFileInputFormat.OneFileInfo[stats.size()];
        if (stats.size() != 0) {
            long totLength = 0L;
            int i = 0;

			//轮询处理每个文件,并统计总文件字节数
            for(Iterator var18 = stats.iterator(); var18.hasNext(); totLength += files[i].getLength()) {
                FileStatus stat = (FileStatus)var18.next();
                //单个文件进行文件分块
                files[i] = new CombineFileInputFormat.OneFileInfo(stat, conf, this.isSplitable(job, stat.getPath()), rackToBlocks, blockToNodes, nodeToBlocks, this.rackToNodes, maxSize);
            }

			//根据文件分块和分片数进行数据切片
            this.createSplits(nodeToBlocks, blockToNodes, rackToBlocks, totLength, maxSize, minSizeNode, minSizeRack, splits);
        }
    }

3.1.1.单个文件的文件分块逻辑

切块大小确定规则
	未设置切块大小参数
		文件大小即为切块大小
	已设置切块大小参数
		文件大小 < 切块大小参数,文件大小即为切块大小
		切块大小参数 < 文件大小 < 切块大小参数 * 2 ,文件大小的1/2 即为切块大小
		文件大小 > 2 * 切块大小参数,切块大小参数即为切块大小
总结
	此段代码即为构建单个文件信息类OneFileInfo的对象的过程,切分文件块构建block对象,完善OneFileInfo对象的fileSize、blocks属性;
OneFileInfo(FileStatus stat, Configuration conf, boolean isSplitable, HashMap<String, List<CombineFileInputFormat.OneBlockInfo>> rackToBlocks, HashMap<CombineFileInputFormat.OneBlockInfo, String[]> blockToNodes, HashMap<String, Set<CombineFileInputFormat.OneBlockInfo>> nodeToBlocks, HashMap<String, Set<String>> rackToNodes, long maxSize) throws IOException {
			//获取文件信息
            BlockLocation[] locations;
            if (stat instanceof LocatedFileStatus) {//本地文件
                locations = ((LocatedFileStatus)stat).getBlockLocations();
            } else {
                FileSystem fs = stat.getPath().getFileSystem(conf);
                locations = fs.getFileBlockLocations(stat, 0L, stat.getLen());
            }

			//空文件,放回空文件块数组
            if (locations == null) {
                this.blocks = new CombineFileInputFormat.OneBlockInfo[0];
            } else {
                if (locations.length == 0 && !stat.isDirectory()) {
                    locations = new BlockLocation[]{new BlockLocation()};
                }
				
				//不需要切分的情况:一个文件构建一个文件块block
                if (!isSplitable) {
                    this.blocks = new CombineFileInputFormat.OneBlockInfo[1];
                    this.fileSize = stat.getLen();
                    this.blocks[0] = new CombineFileInputFormat.OneBlockInfo(stat.getPath(), 0L, this.fileSize, locations[0].getHosts(), locations[0].getTopologyPaths());
                } else {
                	//存放文件块block的容器
                    ArrayList<CombineFileInputFormat.OneBlockInfo> blocksList = new ArrayList(locations.length);
                    int i = 0;

                    while(true) {
                    	//文件分块完成后,将blocks的容器由列表转为数组
                        if (i >= locations.length) {
                            this.blocks = (CombineFileInputFormat.OneBlockInfo[])blocksList.toArray(new CombineFileInputFormat.OneBlockInfo[blocksList.size()]);
                            break;
                        }

						//文件大小
                        this.fileSize += locations[i].getLength();
                        //待切块数据大小
                        long left = locations[i].getLength();
                        //待切块数据起始偏移量
                        long myOffset = locations[i].getOffset();
                        //初始化文件块大小
                        long myLength = 0L;

                        do {
                        	//确定文件切块大小
                        	//未设置切块大小参数,文件大小即为文件切块大小
                            if (maxSize == 0L) {
                                myLength = left;
                            } 
                            //文件大小在设置的切块大小参数的1~2倍间,文件切块大小为文件大小的1/2
                            else if (left > maxSize && left < 2L * maxSize) {
                                myLength = left / 2L;
                            } 
                            //文件大小大于设置切块大小参数2倍的,设置的切块大小参数即为文件切块大小
                            //文件大小小于设置切块大小参数的,文件大小即为文件切块大小
                            else {
                                myLength = Math.min(maxSize, left);
                            }

							//根据确定的文件切块大小数据,对文件进行切块
                            CombineFileInputFormat.OneBlockInfo oneblock = new CombineFileInputFormat.OneBlockInfo(stat.getPath(), myOffset, myLength, locations[i].getHosts(), locations[i].getTopologyPaths());
                            //从待切块数据总移除已切块数据
                            left -= myLength;
                            //更新待切块数据的其实偏移量
                            myOffset += myLength;
                            //将block添加到block容器中
                            blocksList.add(oneblock);
                        } while(left > 0L);

                        ++i;
                    }
                }

				//并完善block、节点、机架间的映射关系
                populateBlockInfo(this.blocks, rackToBlocks, blockToNodes, nodeToBlocks, rackToNodes);
            }

        }
3.1.1.1OneFileInfo 单文件信息类
fileSize:文件字节数
blocks:文件块block的数组
static class OneFileInfo {
        private long fileSize = 0L;
        private CombineFileInputFormat.OneBlockInfo[] blocks;
    }
3.1.1.2.OneBlockInfo 单block信息类
static class OneBlockInfo {
        Path onepath;//文件位置
        long offset;//block起始位置偏移量
        long length;//block字节数
        String[] hosts;//block所在节点
        String[] racks;//block所在机架
	}
3.1.1.3.完善block、node、rack对应关系
static void populateBlockInfo(CombineFileInputFormat.OneBlockInfo[] blocks, Map<String, List<CombineFileInputFormat.OneBlockInfo>> rackToBlocks, Map<CombineFileInputFormat.OneBlockInfo, String[]> blockToNodes, Map<String, Set<CombineFileInputFormat.OneBlockInfo>> nodeToBlocks, Map<String, Set<String>> rackToNodes) {
            CombineFileInputFormat.OneBlockInfo[] var5 = blocks;
            int var6 = blocks.length;
			
			//轮询blocks数组
            for(int var7 = 0; var7 < var6; ++var7) {
                CombineFileInputFormat.OneBlockInfo oneblock = var5[var7];
				
				//完善block->节点
                blockToNodes.put(oneblock, oneblock.hosts);
                String[] racks = null;
                if (oneblock.hosts.length == 0) {
                    racks = new String[]{"/default-rack"};
                } else {
                    racks = oneblock.racks;
                }

                int j;
                String node;
                Object blklist;
                //完善机架->blocks
                for(j = 0; j < racks.length; ++j) {
                    node = racks[j];
                    blklist = (List)rackToBlocks.get(node);
                    if (blklist == null) {
                        blklist = new ArrayList();
                        rackToBlocks.put(node, blklist);
                    }

                    ((List)blklist).add(oneblock);
                    if (!racks[j].equals("/default-rack")) {
                        CombineFileInputFormat.addHostToRack(rackToNodes, racks[j], oneblock.hosts[j]);
                    }
                }

				//完善节点->blocks
                for(j = 0; j < oneblock.hosts.length; ++j) {
                    node = oneblock.hosts[j];
                    blklist = (Set)nodeToBlocks.get(node);
                    if (blklist == null) {
                        blklist = new LinkedHashSet();
                        nodeToBlocks.put(node, blklist);
                    }

                    ((Set)blklist).add(oneblock);
                }
            }

        }

3.1.2.所有文件的数据切片逻辑

分片逻辑
	1、轮询每个节点,对每个节点的block列表进行数据分片
		A1、当前节点待处理block集合总字节数  > maxSize(分片参数)时,进行数据分片
		B1、A1不满足时,当 minSizeNode < 当前节点待处理block集合总字节数 < maxSize,进行数据分片
		C1、A1和B1不满足时,将待处理block表示为未处理,等待后续处理
	2、进过步骤1处理后,还有未处理block时,轮询每个机架,对每个机架未处理block列表进行数据分片
		A2、当前机架未处理block总字节数 > maxSize时,进行数据分片
		B2、A2不满足时,当 minSizeRack < 当前机架未处理block总字节数 < maxSize,进行数据分片
		C2、A2和B2不满足时,将未处理block放在专用容器中,留待后续处理;
	3、经过步骤1和步骤2处理后,还有未处理block存放在专用容器中时,对专用容器中block列表进行数据分片
		A3、轮询剩余block列表,当累加block字节总数 >= maxSize,进行数据分片
		B3、当A3不满足时,对所有剩余block进行一次数据分片
void createSplits(Map<String, Set<CombineFileInputFormat.OneBlockInfo>> nodeToBlocks, Map<CombineFileInputFormat.OneBlockInfo, String[]> blockToNodes, Map<String, List<CombineFileInputFormat.OneBlockInfo>> rackToBlocks, long totLength, long maxSize, long minSizeNode, long minSizeRack, List<InputSplit> splits) {
		//待处理block列表
        ArrayList<CombineFileInputFormat.OneBlockInfo> validBlocks = new ArrayList();
        //待处理block字节数
        long curSplitSize = 0L;
        //节点数
        int totalNodes = nodeToBlocks.size();
        //需要切片的总字节数
        long totalLength = totLength;
        //每个节点的数据切片集合
        Multiset<String> splitsPerNode = HashMultiset.create();
        //已处理节点集合
        HashSet completedNodes = new HashSet();

        label170:
        //轮询处理每个节点数据
        do {
        	//node -> blocks 迭代器
            Iterator iter = nodeToBlocks.entrySet().iterator();

            while(true) {
                while(true) {
                    Entry one;
                    String node;
                    do {
                    	//数据处理完毕,跳出大轮询
                        if (!iter.hasNext()) {
                            continue label170;
                        }
						//本地循环需要处理的node->blocks映射
                        one = (Entry)iter.next();
                        //本次循环需要处理的节点
                        node = (String)one.getKey();
                    } while(completedNodes.contains(node));
					//当前节点的文件块block集合
                    Set<CombineFileInputFormat.OneBlockInfo> blocksInCurrentNode = (Set)one.getValue();
                    //文件块block集合迭代器
                    Iterator oneBlockIter = blocksInCurrentNode.iterator();

					//轮询处理当前节点所有block
                    while(oneBlockIter.hasNext()) {
                    	//单个block
                        CombineFileInputFormat.OneBlockInfo oneblock = (CombineFileInputFormat.OneBlockInfo)oneBlockIter.next();
                        if (!blockToNodes.containsKey(oneblock)) {
                        	//移除已处理block
                            oneBlockIter.remove();
                        } else {
                        	//将当前block加入待处理列表
                            validBlocks.add(oneblock);
                            //标识当前block已处理
                            blockToNodes.remove(oneblock);
                            //计算待处理block列表中字节总数
                            curSplitSize += oneblock.length;
                            //当前节点待处理block集合总字节数 > maxSize(分片参数)时,进行数据分片
                            if (maxSize != 0L && curSplitSize >= maxSize) {
                            	//数据分片
                                this.addCreatedSplit(splits, Collections.singleton(node), validBlocks);
                                //计算剩下需要分片的数据字节总数
                                totalLength -= curSplitSize;
                                //初始化待处理block字节数
                                curSplitSize = 0L;
                                splitsPerNode.add(node);
                                //更新当前节点待处理文件块集合
                                blocksInCurrentNode.removeAll(validBlocks);
                                //清空待处理block集合
                                validBlocks.clear();
                                break;
                            }
                        }
                    }
					
					//当前节点block集合的数据分片未结束 
                    if (validBlocks.size() != 0) {
                    	//minSizeNode < 当前节点待处理block集合总字节数 < maxSize
                        if (minSizeNode != 0L && curSplitSize >= minSizeNode && splitsPerNode.count(node) == 0) {
                        	//继续分片
                            this.addCreatedSplit(splits, Collections.singleton(node), validBlocks);
                            totalLength -= curSplitSize;
                            splitsPerNode.add(node);
                            blocksInCurrentNode.removeAll(validBlocks);
                        } else {//minSizeNode > 当前节点待处理block集合总字节数
                            Iterator var36 = validBlocks.iterator();

							//当前节点所有待处理blocks标识为未处理,后续根据标识判断进行分片处理
                            while(var36.hasNext()) {
                                CombineFileInputFormat.OneBlockInfo oneblock = (CombineFileInputFormat.OneBlockInfo)var36.next();
                                blockToNodes.put(oneblock, oneblock.hosts);
                            }
                        }

                        validBlocks.clear();
                        curSplitSize = 0L;
                        completedNodes.add(node);
                    } else if (blocksInCurrentNode.size() == 0) {//当前节点block集合的数据全部分片完成
                        completedNodes.add(node);
                    }
                }
            }
        } while(completedNodes.size() != totalNodes && totalLength != 0L);

        LOG.info("DEBUG: Terminated node allocation with : CompletedNodes: " + completedNodes.size() + ", size left: " + totalLength);
        ArrayList overflowBlocks = new ArrayList();
        HashSet racks = new HashSet();

        Iterator iter;
        label130:
        //节点轮询处理后,剩下未处理block数据的分片
        while(blockToNodes.size() > 0) {
            iter = rackToBlocks.entrySet().iterator();
			//轮询每个机架
            while(true) {
                while(true) {
                    if (!iter.hasNext()) {
                        continue label130;
                    }

                    Entry<String, List<CombineFileInputFormat.OneBlockInfo>> one = (Entry)iter.next();
                    racks.add(one.getKey());
                    //当前机架所有的block列表
                    List<CombineFileInputFormat.OneBlockInfo> blocks = (List)one.getValue();
                    boolean createdSplit = false;
                    Iterator var38 = blocks.iterator();
					//轮询机架所有block
                    while(var38.hasNext()) {
                        CombineFileInputFormat.OneBlockInfo oneblock = (CombineFileInputFormat.OneBlockInfo)var38.next();
                        //筛选当前机架未处理block
                        if (blockToNodes.containsKey(oneblock)) {
                            validBlocks.add(oneblock);
                            blockToNodes.remove(oneblock);
                            curSplitSize += oneblock.length;
                            //当前机架未处理block总字节数 > maxSize
                            if (maxSize != 0L && curSplitSize >= maxSize) {
                            	//数据分片
                                this.addCreatedSplit(splits, this.getHosts(racks), validBlocks);
                                //分片成功标识
                                createdSplit = true;
                                break;
                            }
                        }
                    }

                    if (createdSplit) {
                        curSplitSize = 0L;
                        validBlocks.clear();
                        racks.clear();
                    } else {
                    	//当前机架未处理block总字节数 < maxSize
                        if (!validBlocks.isEmpty()) {
                        	//minSizeRack < 当前机架未处理block总字节数 < maxSize
                            if (minSizeRack != 0L && curSplitSize >= minSizeRack) {
                            	//数据分片
                                this.addCreatedSplit(splits, this.getHosts(racks), validBlocks);
                            } else {//minSizeRack > 当前机架未处理block总字节数
                            	//当前机架未处理block不满足分片条件,后续处理
                                overflowBlocks.addAll(validBlocks);
                            }
                        }

                        curSplitSize = 0L;
                        validBlocks.clear();
                        racks.clear();
                    }
                }
            }
        }

        assert blockToNodes.isEmpty();

        assert curSplitSize == 0L;

        assert validBlocks.isEmpty();

        assert racks.isEmpty();
		//节点、机架轮询分片后,剩下未分片blocks列表迭代器
        iter = overflowBlocks.iterator();
		//轮询剩余block
        while(iter.hasNext()) {
            CombineFileInputFormat.OneBlockInfo oneblock = (CombineFileInputFormat.OneBlockInfo)iter.next();
            validBlocks.add(oneblock);
            curSplitSize += oneblock.length;

            for(int i = 0; i < oneblock.racks.length; ++i) {
                racks.add(oneblock.racks[i]);
            }
			//剩余block字节总数 >= maxSize
            if (maxSize != 0L && curSplitSize >= maxSize) {
            	//数据分片
                this.addCreatedSplit(splits, this.getHosts(racks), validBlocks);
                curSplitSize = 0L;
                validBlocks.clear();
                racks.clear();
            }
        }
		
		//最后,将剩余block进行数据分片;这部分block列表字节总数 < maxSize
        if (!validBlocks.isEmpty()) {
            this.addCreatedSplit(splits, this.getHosts(racks), validBlocks);
        }

    }
3.1.2.1.addCreatedSplit 构建数据分片
private void addCreatedSplit(List<InputSplit> splitList, Collection<String> locations, ArrayList<CombineFileInputFormat.OneBlockInfo> validBlocks) {
        Path[] fl = new Path[validBlocks.size()];
        long[] offset = new long[validBlocks.size()];
        long[] length = new long[validBlocks.size()];
		//轮询待处理block
        for(int i = 0; i < validBlocks.size(); ++i) {
            fl[i] = ((CombineFileInputFormat.OneBlockInfo)validBlocks.get(i)).onepath;
            offset[i] = ((CombineFileInputFormat.OneBlockInfo)validBlocks.get(i)).offset;
            length[i] = ((CombineFileInputFormat.OneBlockInfo)validBlocks.get(i)).length;
        }
		//构建数据切片对象
        CombineFileSplit thissplit = new CombineFileSplit(fl, offset, length, (String[])locations.toArray(new String[0]));
        //数据切片列表
        splitList.add(thissplit);
    }
3.1.2.2.切片信息类
一个数据切片包含0~n个文件块
public class CombineFileSplit extends InputSplit implements Writable {
    private Path[] paths;		//数据库路径
    private long[] startoffset;	//数据块起始偏移量
    private long[] lengths;		//数据块字节数
    private String[] locations;	//数据块所在节点
    private long totLength;		//数据切片字节数
   
   	private void initSplit(Path[] files, long[] start, long[] lengths, String[] locations) {
        this.startoffset = start;
        this.lengths = lengths;
        this.paths = files;
        this.totLength = 0L;
        this.locations = locations;
        long[] var5 = lengths;
        int var6 = lengths.length;

        for(int var7 = 0; var7 < var6; ++var7) {
            long length = var5[var7];
            this.totLength += length;
        }

    }
}

4.参考资料

RDD编程指南
十九、CombineTextInputFormat切片机制源码分析

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值