海量数据的情况下,采用spark-redis无法满足数据处理性能,其通过scan的方式遍历亿级键值时,在足够资源的情况下,无法满足一分钟内拉完亿级数据的需求,因此需要对REDIS的存储结构进行设计,并结合当前业务流行的、性能极高的与REDIS组件Redisson自定义spark rdd的方式,改善数据读取方式。
RDD是Spark提供的核心抽象,全称为Resillient Distributed Dataset,即弹性分布式数据集,其最重要的特性就是,提供了容错性,可以自动从节点失败中恢复过来。即如果某个节点上的RDD partition,因为节点故障,导致数据丢了,那么RDD会自动通过自己的数据来源重新计算该partition。
对于自定义RDD的要求是对象是可以被序列化的,Ressionclient对象是不可序列化的,那是否无法做到通过Ression做自定义RDD呢,答案是否定的,可通过自定义的类继续AbstractIterator,数据读取游标获取数据,并可将Ression定义为静态变量,节省连接获取的时间。以下按步骤建立自定义RDD。
首先继承Partition类建设自定义的Partition信息:
public static class RedissonRangePartition implements Partition {
private static final long serialVersionUID = 1L;
private int index; //partition序号从0开始
private int from; //定义为redis slot 开始槽位号
private int to; //定义为redis slot 结束槽位号
public RedissonRangePartition(int index, int c, int d) {
this.index = index;
this.from = c;
this.to = d;
}
@Override
public int index() {
return index;
}
@Override
public boolean equals(Object obj) {
if(!(obj instanceof RedissonRangePartition)) {
return false;
}
return ((RedissonRangePartition)obj).index != index;
}
@Override
public int hashCode() {
return index();
}
}
定义读取游标:
public static class RedissonIterator extends AbstractIterator<String> {
private int from;
private int to;
private ConcurrentLinkedQueue<String> queue=new ConcurrentLinkedQueue<String>();
private int overRun=0;
public class DealThread implements Runnable{ //处理线程
private final Properties properties;
public DealThread(final Properties properties){
this.properties=properties;
}
@Override
public void run() {
getKeys(properties);
overRun=1;
}
private void getKeys(final Properties properties)
{
try {
RedisClient.initInstance(properties); //静态初始化连接
RedissonClient redisson = RedisClient.getClient(); //获取对象数
for (int i = from; i <= to; i++) {
//操作redis
}
}
catch (Exception e)
{
log.error("Get Message error:", e);
}
}
}
public RedissonIterator(int from, int to,final Properties properties) {
this.from = from;
this.to = to;
DealThread dealThread=new DealThread(properties,slotInfoMap);
new Thread(dealThread).start(); //采用线程的方式提高吞吐量,减少IO等待时间
}
@Override
public boolean hasNext() {
while(queue.isEmpty()&&overRun==0)
{
try {
Thread.sleep(200);
} catch (InterruptedException e) {
log.error("Get Message error:", e);
}
}
return !queue.isEmpty();
}
@Override
public String next() {
// Post increments next after returning it
return queue.poll() ;
}
}
最后为自定义RDD:
public static class RedissonRDD extends RDD<String> {
private static final long serialVersionUID = 1L;
private final Properties properties;
private final String[] jobList; //任务列表如:1-1000,1000-1200
public RedissonRDD(SparkContext sc, final Properties properties, final String[] jobList) {
super(sc, new ArrayBuffer<Dependency<?>>(), STRING_TAG);
this.properties=properties;
this.jobList=jobList;
}
@Override
public Iterator<String> compute(Partition arg0, TaskContext arg1) {
RedissonRangePartition p = (RedissonRangePartition)arg0;
return new CharacterIterator(p.from, p.to,properties);
}
@Override
public Partition[] getPartitions() {
int length=jobList.length;
Partition[] partition=new Partition[length];
for (int i = 0; i < jobList.length; i++) {
String[] strs = StringUtils.split(jobList[i], "-");
int startPos = Integer.parseInt(strs[0]);
int endPos = Integer.parseInt(strs[1]);
Partition e=new RedissonRangePartition(i,startPos,endPos);
partition[i]=e;
}
return partition;
}
}
最后通过SPARK SQL的方式调用:
String[] doJobArgs= {"0-100","100-200"};
JavaSparkContext sc = new JavaSparkContext(sparkConf);
try(JavaSparkContext sc = new JavaSparkContext(sparkConf)) {
System.out.println(new RedissonRDD(sc.sc(),Config.getProp(),doJobArgs).toJavaRDD().count());
}
spark如何自定义redisson rdd
最新推荐文章于 2022-02-28 15:07:03 发布