目录
一、前言
我们都知道我们在使用datax完成一个异构数据源的同步任务的时候,比如从mysql读取一张表写入到hdfs,我们只需要配置一个json文件进行reader和writer的配置,然后执行datax,他就会不断的从reader处拉取数据写入到writer,这个过程是持续进行的,直到所需要的数据都读取且写入完毕,才算完成任务,那么这个过程中reader和writer是怎么协作的呢,我们下面慢慢道来。
二、核心源码解读
我们先聚焦一下我们要解读的源码范围,我们假定前面datax已经帮我们做完taskGroup的分配、task的任务的切割,现在准备运行一个个待运行的task了,即taskExecutor执行doStart()方法,如下
public void doStart() {
this.writerThread.start();
// reader没有起来,writer不可能结束
if (!this.writerThread.isAlive() || this.taskCommunication.getState() == State.FAILED) {
throw DataXException.asDataXException(
FrameworkErrorCode.RUNTIME_ERROR,
this.taskCommunication.getThrowable());
}
this.readerThread.start();
// 这里reader可能很快结束
if (!this.readerThread.isAlive() && this.taskCommunication.getState() == State.FAILED) {
// 这里有可能出现Reader线上启动即挂情况 对于这类情况 需要立刻抛出异常
throw DataXException.asDataXException(
FrameworkErrorCode.RUNTIME_ERROR,
this.taskCommunication.getThrowable());
}
}
这里先补充一下readerThread和writerThread在TaskExecutor被构造的时候实例化,他们对应的readerRunner和writerRunner各自持有一个BufferedRecordExchanger实例的引用,分别为recordSender和recordReceiver,他们都持有了同一个MemoryChannel实例的引用,MemoryChannel的作用就是作为reader和writer传输的纽带,reader从source数据源处获取数据,写入到缓冲区,缓冲区满的时候同步到MemoryChannel;writer从MemoryChannel获取数据到本地缓冲区,再进行数据的消费写入到target数据源。
①writer线程
我们先从writer线程开始看起,先看到WriterRunner的run()方法
@Override
public void run() {
Validate.isTrue(this.recordReceiver != null);
Writer.Task taskWriter = (Writer.Task) this.getPlugin();
//统计waitReadTime,并且在finally end
PerfRecord channelWaitRead = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WAIT_READ_TIME);
try {
channelWaitRead.start();
LOG.debug("task writer starts to do init ...");
PerfRecord initPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WRITE_TASK_INIT);
initPerfRecord.start();
taskWriter.init();
initPerfRecord.end();
LOG.debug("task writer starts to do prepare ...");
PerfRecord preparePerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WRITE_TASK_PREPARE);
preparePerfRecord.start();
taskWriter.prepare();
preparePerfRecord.end();
LOG.debug("task writer starts to write ...");
PerfRecord dataPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WRITE_TASK_DATA);
dataPerfRecord.start();
taskWriter.startWrite(recordReceiver);
dataPerfRecord.addCount(CommunicationTool.getTotalReadRecords(super.getRunnerCommunication()));
dataPerfRecord.addSize(CommunicationTool.getTotalReadBytes(super.getRunnerCommunication()));
dataPerfRecord.end();
LOG.debug("task writer starts to do post ...");
PerfRecord postPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WRITE_TASK_POST);
postPerfRecord.start();
taskWriter.post();
postPerfRecord.end();
super.markSuccess();
} catch (Throwable e) {
LOG.error("Writer Runner Received Exceptions:", e);
super.markFail(e);
} finally {
LOG.debug("task writer starts to do destroy ...");
PerfRecord desPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WRITE_TASK_DESTROY);
desPerfRecord.start();
super.destroy();
desPerfRecord.end();
channelWaitRead.end(super.getRunnerCommunication().getLongCounter(CommunicationTool.WAIT_READER_TIME));
}
}
看着东西很多,其实就是执行taskWriter的init、prepare、startWrite、post、destroy,我们主要的关注点在于startWrite方法。
startWrite方法交给了用户去自定义实现,但是这里有个约定俗称的地方,就是需要在我们自己的taskWriter插件中调用下面这段代码while ((record = recordReceiver.getFromReader()) != null),来源源不断地拉取reader处读到的数据并进行消费。
所以我们直接看BufferedRecordExchanger的getFromReader()方法,如下
public Record getFromReader() {
if(shutdown){
throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, "");
}
// 先判断本地缓冲区内是否已经空了,空了就说明需要去MemoryChannel那边拉取数据了
boolean isEmpty = (this.bufferIndex >= this.buffer.size());
if (isEmpty) {
receive();
}
// 从本地缓冲区获取一个record,如果record是TerminateRecord的实例,表示reader已经结束了,就返回null,后面再对null进行逻辑处理,并结束writer
Record record = this.buffer.get(this.bufferIndex++);
if (record instanceof TerminateRecord) {
record = null;
}
return record;
}
我们再看到receive()方法,
private void receive() {
this.channel.pullAll(this.buffer);
this.bufferIndex = 0;
this.bufferSize = this.buffer.size();
}
这边主要是调用MemoryChannle的pullAll()方法,从channel拉取数据到缓冲区,然后将本地缓冲区的下一个可消费的record的下标重置为第一个元素。
MemoryChannel的pullAll()方法里面主要是执行doPullAll()所以我们直接跳到doPullAll()方法,
private ArrayBlockingQueue<Record> queue = null;
private ReentrantLock lock;
private Condition notInsufficient, notEmpty;
protected void doPullAll(Collection<Record> rs) {
assert rs != null;
rs.clear();
try {
long startTime = System.nanoTime();
lock.lockInterruptibly();
// 从queue拉取数据到缓冲区,如果当前拉取不到数据,即queue是空的,那么进行await,等待reader线程往queue中push数据,并对writer进行signal
while (this.queue.drainTo(rs, bufferSize) <= 0) {
notEmpty.await(200L, TimeUnit.MILLISECONDS);
}
waitReaderTime += System.nanoTime() - startTime;
int bytes = getRecordBytes(rs);
memoryBytes.addAndGet(-bytes);
// writer执行过一次数据拉取了,当前queue的空间是充足的,对reader也进行唤醒,因为reader会在往queue中push数据的时候,可能存在queue空间不足,无法继续push而等待
notInsufficient.signalAll();
} catch (InterruptedException e) {
throw DataXException.asDataXException(
FrameworkErrorCode.RUNTIME_ERROR, e);
} finally {
lock.unlock();
}
}
②reader线程
同样,先看reader线程的run()方法,
public void run() {
assert null != this.recordSender;
Reader.Task taskReader = (Reader.Task) this.getPlugin();
//统计waitWriterTime,并且在finally才end。
PerfRecord channelWaitWrite = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WAIT_WRITE_TIME);
try {
channelWaitWrite.start();
LOG.debug("task reader starts to do init ...");
PerfRecord initPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.READ_TASK_INIT);
initPerfRecord.start();
taskReader.init();
initPerfRecord.end();
LOG.debug("task reader starts to do prepare ...");
PerfRecord preparePerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.READ_TASK_PREPARE);
preparePerfRecord.start();
taskReader.prepare();
preparePerfRecord.end();
LOG.debug("task reader starts to read ...");
PerfRecord dataPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.READ_TASK_DATA);
dataPerfRecord.start();
taskReader.startRead(recordSender);
recordSender.terminate();
dataPerfRecord.addCount(CommunicationTool.getTotalReadRecords(super.getRunnerCommunication()));
dataPerfRecord.addSize(CommunicationTool.getTotalReadBytes(super.getRunnerCommunication()));
dataPerfRecord.end();
LOG.debug("task reader starts to do post ...");
PerfRecord postPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.READ_TASK_POST);
postPerfRecord.start();
taskReader.post();
postPerfRecord.end();
// automatic flush
// super.markSuccess(); 这里不能标记为成功,成功的标志由 writerRunner 来标志(否则可能导致 reader 先结束,而 writer 还没有结束的严重 bug)
} catch (Throwable e) {
LOG.error("Reader runner Received Exceptions:", e);
super.markFail(e);
} finally {
LOG.debug("task reader starts to do destroy ...");
PerfRecord desPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.READ_TASK_DESTROY);
desPerfRecord.start();
super.destroy();
desPerfRecord.end();
channelWaitWrite.end(super.getRunnerCommunication().getLongCounter(CommunicationTool.WAIT_WRITER_TIME));
long transformerUsedTime = super.getRunnerCommunication().getLongCounter(CommunicationTool.TRANSFORMER_USED_TIME);
if (transformerUsedTime > 0) {
PerfRecord transformerRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.TRANSFORMER_TIME);
transformerRecord.start();
transformerRecord.end(transformerUsedTime);
}
}
}
我们主要的关注点在于startRead方法,
startRead方法交给了用户去自定义实现,这里也有个约定俗称的地方,就是需要在我们自己的taskReader插件中调用下面这段代码recordSender.sendToWriter(record);不断地把记录push到MemoryChannel。
所以我们直接看BufferedRecordExchanger的sendToWriter()方法,如下
public void sendToWriter(Record record) {
if(shutdown){
throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, "");
}
Validate.notNull(record, "record不能为空.");
if (record.getMemorySize() > this.byteCapacity) {
this.pluginCollector.collectDirtyRecord(record, new Exception(String.format("单条记录超过大小限制,当前限制为:%s", this.byteCapacity)));
return;
}
// 先判断本地缓冲区是否满了,如果满了就flush到channel,没满的话就继续往缓冲区添加
boolean isFull = (this.bufferIndex >= this.bufferSize || this.memoryBytes.get() + record.getMemorySize() > this.byteCapacity);
if (isFull) {
flush();
}
this.buffer.add(record);
this.bufferIndex++;
memoryBytes.addAndGet(record.getMemorySize());
}
我们再看到flush()方法,
public void flush() {
if(shutdown){
throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, "");
}
this.channel.pushAll(this.buffer);
this.buffer.clear();
this.bufferIndex = 0;
this.memoryBytes.set(0);
}
这边主要是调用MemoryChannle的pushAll()方法,把缓冲区的数据推到MemoryChannel,然后清空缓冲区,重置下标。
MemoryChannel的pushAll()方法里面主要是执行doPushAll()所以我们直接跳到doPushAll()方法,
private ArrayBlockingQueue<Record> queue = null;
private ReentrantLock lock;
private Condition notInsufficient, notEmpty;
protected void doPushAll(Collection<Record> rs) {
try {
long startTime = System.nanoTime();
lock.lockInterruptibly();
int bytes = getRecordBytes(rs);
// 判断当前的queue能不能够我们的record列表塞进去,如果不够的话,await等待writer消费数据释放queue的空间,再来唤醒reader线程
while (memoryBytes.get() + bytes > this.byteCapacity || rs.size() > this.queue.remainingCapacity()) {
notInsufficient.await(200L, TimeUnit.MILLISECONDS);
}
this.queue.addAll(rs);
waitWriterTime += System.nanoTime() - startTime;
memoryBytes.addAndGet(bytes);
// 因为数据已经push到queue中了,可以去唤醒因为queue中没有数据而在await的writer线程
notEmpty.signalAll();
} catch (InterruptedException e) {
throw DataXException.asDataXException(
FrameworkErrorCode.RUNTIME_ERROR, e);
} finally {
lock.unlock();
}
}
③源码流程图
如下:https://www.processon.com/view/link/5ff18b60e0b34d19e4f8da55
三、总结
reader和writer的协作机制,其原理就是利用ArrayBlockingQueue以及Condition实现的一个轻量级的生产者消费者模型。
生产者即reader线程,如果本地缓冲区满了就把数据推到queue,如果queue满了,说明writer还没消费完queue,就不急着推,等待writer进行notInsufficient的唤醒;如果本地缓冲区没满或者已经被writer唤醒了,那么就把数据推到queue中,并进行notEmpty的唤醒,告诉writer数据已经准备完毕,queue不为空,可以进行拉取。
消费者及writer线程,如果queue中没有数据,没法拉取数据到缓冲区进行后续的消费,等待reader进行notEmpty的唤醒;如果queue中数据拉取缓冲区成功,那么就进行notInsufficient的唤醒,告诉reader queue的空间已经充足可以继续往queue中推数据。