Spark自定义分区(Partitioner)

最新推荐文章于 2021-12-14 23:21:40 发布

江成琳

最新推荐文章于 2021-12-14 23:21:40 发布

阅读量694

点赞数

文章标签： Spark 分区图划分

我们都知道Spark内部提供了HashPartitioner和RangePartitioner两种分区策略(这两种分区的代码解析可以参见：《Spark分区器HashPartitioner和RangePartitioner代码详解》)，这两种分区策略在很多情况下都适合我们的场景。但是有些情况下，Spark内部不能符合咱们的需求，这时候我们就可以自定义分区策略。为此，Spark提供了相应的接口，我们只需要扩展Partitioner抽象类，然后实现里面的三个方法：

 
package org.apache.spark
 
 
 
/**
 
 * An object that defines how the elements in a key-value pair RDD are partitioned by key.
 
 * Maps each key to a partition ID, from 0 to `numPartitions - 1`.
 
 */
 
abstract class Partitioner extends Serializable {
 
  def numPartitions: Int
 
  def getPartition(key: Any): Int
 
}

　　def numPartitions: Int：这个方法需要返回你想要创建分区的个数；
　　def getPartition(key: Any): Int：这个函数需要对输入的key做计算，然后返回该key的分区ID，范围一定是0到numPartitions-1；
　　equals()：这个是Java标准的判断相等的函数，之所以要求用户实现这个函数是因为Spark内部会比较两个RDD的分区是否一样。

　　假如我们想把来自同一个域名的URL放到一台节点上，比如:http://www.iteblog.com和http://www.iteblog.com/archives/1368，如果你使用HashPartitioner，这两个URL的Hash值可能不一样，这就使得这两个URL被放到不同的节点上。所以这种情况下我们就需要自定义我们的分区策略，可以如下实现：

 
package com.iteblog.utils
 
 
 
import org.apache.spark.Partitioner
 
 
 
/**
 
 * User: 过往记忆
 
 * Date: 2015-05-21
 
 * Time: 下午23:34
 
 * bolg: http://www.iteblog.com
 
 * 本文地址：http://www.iteblog.com/archives/1368
 
 * 过往记忆博客，专注于hadoop、hive、spark、shark、flume的技术博客，大量的干货
 
 * 过往记忆博客微信公共帐号：iteblog_hadoop
 
 */
 
 
 
class IteblogPartitioner(numParts: Int) extends Partitioner {
 
  override def numPartitions: Int = numParts
 
 
 
  override def getPartition(key: Any): Int = {
 
    val domain = new java.net.URL(key.toString).getHost()
 
    val code = (domain.hashCode % numPartitions)
 
    if (code < 0) {
 
      code + numPartitions
 
    } else {
 
      code
 
    }
 
  }
 
 
 
  override def equals(other: Any): Boolean = other match {
 
    case iteblog: IteblogPartitioner =>
 
      iteblog.numPartitions == numPartitions
 
    case _ =>
 
      false
 
  }
 
 
 
  override def hashCode: Int = numPartitions
 
}

因为hashCode值可能为负数，所以我们需要对他进行处理。然后我们就可以在partitionBy()方法里面使用我们的分区：

`1`	`iteblog.partitionBy(new` `IteblogPartitioner(20))`

　　类似的，在Java中定义自己的分区策略和Scala类似，只需要继承org.apache.spark.Partitioner，并实现其中的方法即可。

　　在Python中，你不需要扩展Partitioner类，我们只需要对iteblog.partitionBy()加上一个额外的hash函数，如下：

 
    1 import urlparse
 
  
 
    2  
 
  
 
    3 def iteblog_domain(url):
 
  
 
    4   return hash(urlparse.urlparse(url).netloc)
 
  
 
    5  
 
  
 
    6 iteblog.partitionBy(20, iteblog_domain)
 
   
 
  

     上述部分 
   转载自 
   （http://www.iteblog.com/） 
  

     下述部分转载自  
   http://blog.csdn.net/bluejoe2000/article/details/41415087 
  
 RDD是个抽象类，定义了诸如map()、reduce()等方法，但实际上继承RDD的派生类一般只要实现两个方法：
 
def getPartitions: Array[Partition]
def compute(thePart: Partition, context: TaskContext): NextIterator[T]
 
 getPartitions()用来告知怎么将input分片；
 
 compute()用来输出每个Partition的所有行（行是我给出的一种不准确的说法，应该是被函数处理的一个单元）；
 以一个hdfs文件HadoopRDD为例：
 
[java]view plain copy 
      
 print?
 override def getPartitions: Array[Partition] = {  
   val jobConf = getJobConf()  
   // add the credentials here as this can be called before SparkContext initialized  
   SparkHadoopUtil.get.addCredentials(jobConf)  
   val inputFormat = getInputFormat(jobConf)  
   if (inputFormat.isInstanceOf[Configurable]) {  
     inputFormat.asInstanceOf[Configurable].setConf(jobConf)  
   }  
   val inputSplits = inputFormat.getSplits(jobConf, minPartitions)  
   val array = new Array[Partition](inputSplits.size)  
   for (i <- 0 until inputSplits.size) {  
     array(i) = new HadoopPartition(id, i, inputSplits(i))  
   }  
   array  
 }  
它直接将各个split包装成RDD了，再看compute()： 
 
[java]view plain copy 
      
 print?
 override def compute(theSplit: Partition, context: TaskContext): InterruptibleIterator[(K, V)] = {  
   val iter = new NextIterator[(K, V)] {  
   
     val split = theSplit.asInstanceOf[HadoopPartition]  
     logInfo("Input split: " + split.inputSplit)  
     var reader: RecordReader[K, V] = null  
     val jobConf = getJobConf()  
     val inputFormat = getInputFormat(jobConf)  
     HadoopRDD.addLocalConfiguration(new SimpleDateFormat("yyyyMMddHHmm").format(createTime),  
       context.stageId, theSplit.index, context.attemptId.toInt, jobConf)  
     reader = inputFormat.getRecordReader(split.inputSplit.value, jobConf, Reporter.NULL)  
   
     // Register an on-task-completion callback to close the input stream.  
     context.addTaskCompletionListener{ context => closeIfNeeded() }  
     val key: K = reader.createKey()  
     val value: V = reader.createValue()  
   
     // Set the task input metrics.  
     val inputMetrics = new InputMetrics(DataReadMethod.Hadoop)  
     try {  
       /* bytesRead may not exactly equal the bytes read by a task: split boundaries aren't 
        * always at record boundaries, so tasks may need to read into other splits to complete 
        * a record. */  
       inputMetrics.bytesRead = split.inputSplit.value.getLength()  
     } catch {  
       case e: java.io.IOException =>  
         logWarning("Unable to get input size to set InputMetrics for task", e)  
     }  
     context.taskMetrics.inputMetrics = Some(inputMetrics)  
   
     override def getNext() = {  
       try {  
         finished = !reader.next(key, value)  
       } catch {  
         case eof: EOFException =>  
           finished = true  
       }  
       (key, value)  
     }  
   
     override def close() {  
       try {  
         reader.close()  
       } catch {  
         case e: Exception => logWarning("Exception in RecordReader.close()", e)  
       }  
     }  
   }  
   new InterruptibleIterator[(K, V)](context, iter)  
 }  
它调用reader返回一系列的K,V键值对。 
 再来看看数据库的JdbcRDD：
 
[java]view plain copy 
      
 print?
 override def getPartitions: Array[Partition] = {  
   // bounds are inclusive, hence the + 1 here and - 1 on end  
   val length = 1 + upperBound - lowerBound  
   (0 until numPartitions).map(i => {  
     val start = lowerBound + ((i * length) / numPartitions).toLong  
     val end = lowerBound + (((i + 1) * length) / numPartitions).toLong - 1  
     new JdbcPartition(i, start, end)  
   }).toArray  
 }  
它直接将结果集分成numPartitions份。其中很多参数都来自于构造函数： 
 
[java]view plain copy 
      
 print?
 class JdbcRDD[T: ClassTag](  
     sc: SparkContext,  
     getConnection: () => Connection,  
     sql: String,  
     lowerBound: Long,  
     upperBound: Long,  
     numPartitions: Int,  
     mapRow: (ResultSet) => T = JdbcRDD.resultSetToObjectArray _)  
再看看compute()函数： 
 
[java]view plain copy 
      
 print?
 override def compute(thePart: Partition, context: TaskContext) = new NextIterator[T] {  
   context.addTaskCompletionListener{ context => closeIfNeeded() }  
   val part = thePart.asInstanceOf[JdbcPartition]  
   val conn = getConnection()  
   val stmt = conn.prepareStatement(sql, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)  
   
   // setFetchSize(Integer.MIN_VALUE) is a mysql driver specific way to force streaming results,  
   // rather than pulling entire resultset into memory.  
   // see http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html  
   if (conn.getMetaData.getURL.matches("jdbc:mysql:.*")) {  
     stmt.setFetchSize(Integer.MIN_VALUE)  
     logInfo("statement fetch size set to: " + stmt.getFetchSize + " to force MySQL streaming ")  
   }  
   
   stmt.setLong(1, part.lower)  
   stmt.setLong(2, part.upper)  
   val rs = stmt.executeQuery()  
   
   override def getNext: T = {  
     if (rs.next()) {  
       mapRow(rs)  
     } else {  
       finished = true  
       null.asInstanceOf[T]  
     }  
   }  
   
   override def close() {  
     try {  
       if (null != rs && ! rs.isClosed()) {  
         rs.close()  
       }  
     } catch {  
       case e: Exception => logWarning("Exception closing resultset", e)  
     }  
     try {  
       if (null != stmt && ! stmt.isClosed()) {  
         stmt.close()  
       }  
     } catch {  
       case e: Exception => logWarning("Exception closing statement", e)  
     }  
     try {  
       if (null != conn && ! conn.isClosed()) {  
         conn.close()  
       }  
       logInfo("closed connection")  
     } catch {  
       case e: Exception => logWarning("Exception closing connection", e)  
     }  
   }  
 }  

 

江成琳

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark自定义分区(Partitioner)

转载自过往记忆（http://www.iteblog.com/）我们都知道Spark内部提供了HashPartitioner和RangePartitioner两种分区策略(这两种分区的代码解析可以参见：《Spark分区器HashPartitioner和RangePartitioner代码详解》)，这两种分区策略在很多情况下都适合我们的场景。但是有些情况下，Spark内部不能符合咱们的需求，
复制链接

扫一扫