语言栈众多的时候,大家一般采取网络传输协议 进行分解,常用的http rpc grpc ,这些都是很好的解决方式,但是他们有一个问题,实时性与数据量不能兼得,不过也适用了大部分场景,本文推的是 对实时性、数据量都有要求的方案。
解决思路只有两种
减少传输量
减少序列化
首先得根据硬件的使用情况来,如CPU算力(用于压缩)、内存、IO(序列化),通常网络IO容易有瓶颈,局域网传输,尽量打包,如生成较大的文件,进行传输。
解决方案主要分如下几步:
去网络化,将需要大量传输的两个程序部署在一台机器,用耦合换性能
去序列化,如python与java进行通信,语言之间数据内存格式不一样,一般都需要进行序列化,可采取内存共享方式
减少传输,必须序列化的文件,是否可压缩,在保证压缩解压性能基础上,能压缩则压缩
基于以上3点,我们最终将技术方案定到 apache-arrow上,他是基于C++的一个数据共享方案,包含了基于文件的共享,基于内存共享,基于rpc的数据共享方案。
(基于内存共享,文件中仅仅描述内存信息及必要序列化字段)
arrow特征如下:
1.跨语言并提供API
arrow重新定义了序列头与体,与语言无关
API for C/C++、C#、Go、Java、JavaScript、Julia、Matlab、Python、R、Ruby、Rust
2.对自然数据(数字类型)支持友好
数据顺序临近的计算速度快
3.数据读写速度快
O(1)的随机访问时间
SIMD(单指令多数据,cpu特性)及矢量计算
0指针摆动
内存对齐64bit
4.数字压缩比例高,可内存共享
FlatBuffers序列化
IPC基于流的数据访问
maven引用
<dependency>
<groupId>org.apache.arrow</groupId>
<artifactId>arrow-vector</artifactId>
<version>10.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.arrow</groupId>
<artifactId>arrow-memory-core</artifactId>
<version>10.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.arrow</groupId>
<artifactId>arrow-memory-netty</artifactId>
<version>10.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.arrow</groupId>
<artifactId>arrow-format</artifactId>
<version>10.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.arrow</groupId>
<artifactId>arrow-dataset</artifactId>
<version>10.0.1</version>
</dependency>
下图是基于文件的官方示例
import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.scanner.ScanOptions;
import org.apache.arrow.dataset.scanner.Scanner;
import org.apache.arrow.dataset.source.Dataset;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.VectorSchemaRoot;
import org.apache.arrow.vector.ipc.ArrowReader;
import java.io.IOException;
String uri = "file:" + System.getProperty("user.dir") + "/thirdpartydeps/arrowfiles/random_access.arrow";
ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
try (
BufferAllocator allocator = new RootAllocator();
DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.ARROW_IPC, uri);
Dataset dataset = datasetFactory.finish();
Scanner scanner = dataset.newScan(options);
ArrowReader reader = scanner.scanBatches()
) {
int count = 1;
while (reader.loadNextBatch()) {
try (VectorSchemaRoot root = reader.getVectorSchemaRoot()) {
System.out.println("Number of rows per batch["+ count++ +"]: " + root.getRowCount());
}
}
} catch (Exception e) {
e.printStackTrace();
}
https://arrow.apache.org/cookbook/java/dataset.html#query-arrow-files
因为跨语言数据类型定义不一样,需要进行转换(感谢同事辛苦编写的工具类)
import lombok.Data;
import lombok.extern.slf4j.Slf4j;
import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.scanner.ScanOptions;
import org.apache.arrow.dataset.scanner.Scanner;
import org.apache.arrow.dataset.source.Dataset;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.BigIntVector;
import org.apache.arrow.vector.BitVector;
import org.apache.arrow.vector.FieldVector;
import org.apache.arrow.vector.Float4Vector;
import org.apache.arrow.vector.Float8Vector;
import org.apache.arrow.vector.IntVector;
import org.apache.arrow.vector.SmallIntVector;
import org.apache.arrow.vector.TimeStampVector;
import org.apache.arrow.vector.TinyIntVector;
import org.apache.arrow.vector.UInt1Vector;
import org.apache.arrow.vector.UInt2Vector;
import org.apache.arrow.vector.UInt4Vector;
import org.apache.arrow.vector.UInt8Vector;
import org.apache.arrow.vector.VarCharVector;
import org.apache.arrow.vector.VectorSchemaRoot;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.util.Text;
import java.io.IOException;
import java.time.LocalDateTime;
import java.util.ArrayList;
import java.util.List;
/**
* apache-arrow 工具类
* 对.arrow | .feather |
*
* @author rui.chen2
* @date 2023/2/14
*/
@Slf4j
public class ArrowUtils {
/**
* 默认数据扫描参数,可以指定需要扫描的字段名
* Optional.of(new String[]{...})
*/
private static final ScanOptions OPTIONS = new ScanOptions(/*batchSize*/ 32768);
/**
* C++内存池(arrow:MemoryPoll)的Java映射实例
*/
private static final NativeMemoryPool NATIVE_MEMORY_POOL = NativeMemoryPool.getDefault();
/**
* 根分配器,用于为Arrow矢量/数组使用直接内存。支持创建后代子分配器树,以便于更好地检测内存分配。
*/
private static final RootAllocator ROOT_ALLOCATOR = new RootAllocator();
/**
* 数据映射包装类,用于管理数据读取时生成的各部分问题
*
* @author rui.chen2
* @date 2023/2/14
*/
@Data
public static class DataMapper implements AutoCloseable {
/**
* DatasetFactory提供了一种在具体化数据集潜在架构之前检查它的方法。因此,用户可以查看数据源的架构并决定统一的架构。
*/
private DatasetFactory datasetFactory;
/**
* 多个<片段>的容器,<片段>是读取数据的内部可迭代单元
*/
private Dataset dataset;
/**
* 用于在数据集上扫描数据的高级接口。
*/
private Scanner scanner;
/**
* 用于读取Schema和ArrowRecordBatches的抽象类。
*/
private ArrowReader reader;
/**
* 要加载/卸载的一组向量的保持器。
* VectorSchemaToot是一个可以容纳批的容器,批作为管道的一部分通过VectorSchemaRoot。
* 请注意,这与其他实现C++和Python不同,后者是一个基于已知模式的矢量方案的RecordBatch部分,
* 并将数据反复填充到同一Flight或ArrowFileWriter中,以便更好地理解)。因此,在任何一点上,
* VectorSchemaRoot都可能在批处理流中包含seVectorSchemaRoot,
* 而不是每次都创建一个新的VectorSchemaRoot实例,
* 或者可能没有数据(比如它已向下游传输或尚未填充
*/
private VectorSchemaRoot vectorSchemaRoot;
/**
* 流关闭方法
*/
@Override
public void close() throws Exception {
if (vectorSchemaRoot != null) {
vectorSchemaRoot.close();
}
reader.close();
scanner.close();
dataset.close();
datasetFactory.close();
}
}
/**
* 以默认初始化方法初始化数据
*
* @param uri file:/xx/xxx/xxx.arrow
* @author rui.chen2
* @date 2023/2/14
*/
public static DataMapper readInit(String uri) {
return readInit(uri, OPTIONS);
}
/**
* 只初始化指定列名的数据
*
* @param uri file:/xx/xxx/xxx.arrow
* @param options Optional.of(new String[]{...})
* @return .util.ArrowUtils.DataMapper
* @author rui.chen2
* @date 2023/2/14
*/
public static DataMapper readInit(String uri, ScanOptions options) {
DataMapper mapper = new DataMapper();
mapper.datasetFactory = new FileSystemDatasetFactory(ROOT_ALLOCATOR, NATIVE_MEMORY_POOL, FileFormat.ARROW_IPC, uri);
mapper.dataset = mapper.datasetFactory.finish();
mapper.scanner = mapper.dataset.newScan(options);
mapper.reader = mapper.scanner.scanBatches();
return mapper;
}
/**
* 读取现有字段名
*
* @param mapper 数据映射
* @return java.util.List<org.apache.arrow.vector.types.pojo.Field>
* @author rui.chen2
* @date 2023/2/14
*/
public static List<Field> readFields(DataMapper mapper) throws IOException {
return mapper.reader.getVectorSchemaRoot().getSchema().getFields();
}
/**
* 读取下一批数据,并将上一批数据close(如果有)
*
* @param mapper 数据映射
* @return org.apache.arrow.vector.VectorSchemaRoot
* @author rui.chen2
* @date 2023/2/14
*/
public static VectorSchemaRoot readNextData(DataMapper mapper) throws IOException {
if (mapper.vectorSchemaRoot != null) {
mapper.vectorSchemaRoot.close();
}
if (mapper.reader.loadNextBatch()) {
mapper.vectorSchemaRoot = mapper.reader.getVectorSchemaRoot();
}
return mapper.vectorSchemaRoot;
}
/**
* 获取时间列
*
* @param vectorSchemaRoot 数据
* @return org.apache.arrow.vector.BigIntVector
* @author rui.chen2
* @date 2023/2/15
*/
public static BigIntVector getTimestamps(VectorSchemaRoot vectorSchemaRoot) {
return (BigIntVector) vectorSchemaRoot.getVector(“Time“);
}
/**
* 通过指定字段,获取字段下的所有数据
*
* @param schemas 数据集合
* @param fields 字段名
* @return java.util.List<java.util.List < ?>>
* @author rui.chen2
* @date 2023/2/15
*/
public static List<List<?>> getList(VectorSchemaRoot schemas, List<String> fields) {
int rowCount = schemas.getRowCount();
List<List<?>> data = new ArrayList<>(fields.size());
for (String field : fields) {
FieldVector vector = schemas.getVector(field);
if (vector == null) {
throw new NullPointerException(“字段 “ + field + “ 的列不存在数据中“);
}
if (vector instanceof TinyIntVector) {
List<Integer> list = new ArrayList<>(rowCount);
data.add(list);
toInt(list, vector, rowCount);
} else if (vector instanceof SmallIntVector) {
List<Integer> list = new ArrayList<>(rowCount);
data.add(list);
toInt(list, vector, rowCount);
} else if (vector instanceof IntVector) {
List<Integer> list = new ArrayList<>(rowCount);
data.add(list);
toInt(list, vector, rowCount);
} else if (vector instanceof BigIntVector) {
List<Long> list = new ArrayList<>(rowCount);
data.add(list);
toLong(list, vector, rowCount);
} else if (vector instanceof UInt1Vector) {
List<Integer> list = new ArrayList<>(rowCount);
data.add(list);
toInt(list, vector, rowCount);
} else if (vector instanceof UInt2Vector) {
List<Integer> list = new ArrayList<>(rowCount);
data.add(list);
toInt(list, vector, rowCount);
} else if (vector instanceof UInt4Vector) {
List<Integer> list = new ArrayList<>(rowCount);
data.add(list);
toInt(list, vector, rowCount);
} else if (vector instanceof UInt8Vector) {
List<Long> list = new ArrayList<>(rowCount);
data.add(list);
toLong(list, vector, rowCount);
} else if (vector instanceof Float4Vector) {
List<Float> list = new ArrayList<>(rowCount);
data.add(list);
toFloat(list, vector, rowCount);
} else if (vector instanceof Float8Vector) {
List<Double> list = new ArrayList<>(rowCount);
data.add(list);
toDouble(list, vector, rowCount);
} else if (vector instanceof BitVector) {
List<Boolean> list = new ArrayList<>(rowCount);
data.add(list);
toBoolean(list, vector, rowCount);
} else if (vector instanceof VarCharVector) {
List<String> list = new ArrayList<>(rowCount);
data.add(list);
toText(list, (VarCharVector) vector, rowCount);
} else if (vector instanceof TimeStampVector) {
List<LocalDateTime> list = new ArrayList<>(rowCount);
data.add(list);
toTimestamp(list, vector, rowCount);
} else {
throw new RuntimeException(“不支持的数据类型:“ + vector.getField().getFieldType().getType());
}
}
return data;
}
public static void writeInit() {
}
public static void toLong(long[] values, UInt8Vector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values[i] = vector.getObjectNoOverflow(i).longValue();
}
}
public static void toLong(Long[] values, UInt8Vector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values[i] = vector.getObjectNoOverflow(i).longValue();
}
}
public static void toLong(List<Long> values, UInt8Vector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values.add(vector.getObjectNoOverflow(i).longValue());
}
}
public static void toLong(long[] values, UInt4Vector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values[i] = vector.getObjectNoOverflow(i);
}
}
public static void toLong(Long[] values, UInt4Vector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values[i] = vector.getObjectNoOverflow(i);
}
}
public static void toLong(List<Long> values, UInt4Vector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values.add(vector.getObjectNoOverflow(i));
}
}
public static void toLong(long[] values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Object object = vector.getObject(i);
values[i] = (long) object;
}
}
public static void toLong(Long[] values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Object object = vector.getObject(i);
values[i] = (long) object;
}
}
public static void toLong(List<Long> values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Object object = vector.getObject(i);
values.add((long) object);
}
}
public static void toInt(int[] values, UInt1Vector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values[i] = (int) vector.getObjectNoOverflow(i);
}
}
public static void toInt(Integer[] values, UInt1Vector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values[i] = (int) vector.getObjectNoOverflow(i);
}
}
public static void toInt(List<Integer> values, UInt1Vector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values.add((int) vector.getObjectNoOverflow(i));
}
}
public static void toInt(int[] values, UInt2Vector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values[i] = (int) vector.get(i) & UInt2Vector.MAX_UINT2;
}
}
public static void toInt(Integer[] values, UInt2Vector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values[i] = (int) vector.get(i) & UInt2Vector.MAX_UINT2;
}
}
public static void toInt(List<Integer> values, UInt2Vector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values.add((int) vector.get(i) & UInt2Vector.MAX_UINT2);
}
}
public static void toInt(int[] values, UInt4Vector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values[i] = vector.getObjectNoOverflow(i).intValue();
}
}
public static void toInt(Integer[] values, UInt4Vector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values[i] = (int) vector.get(i) & UInt2Vector.MAX_UINT2;
}
}
public static void toInt(List<Integer> values, UInt4Vector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values.add((int) vector.get(i) & UInt2Vector.MAX_UINT2);
}
}
public static void toInt(int[] values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Object object = vector.getObject(i);
if (object instanceof Character) {
object = (int) ((Character) object);
} else if (object instanceof Byte) {
object = ((Byte) object).intValue();
} else if (object instanceof Short) {
object = ((Short) object).intValue();
}
values[i] = (int) object;
}
}
public static void toInt(Integer[] values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Object object = vector.getObject(i);
if (object instanceof Character) {
object = (int) ((Character) object);
} else if (object instanceof Byte) {
object = ((Byte) object).intValue();
} else if (object instanceof Short) {
object = ((Short) object).intValue();
}
values[i] = (int) object;
}
}
public static void toInt(List<Integer> values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Object object = vector.getObject(i);
if (object instanceof Character) {
object = (int) ((Character) object);
} else if (object instanceof Byte) {
object = ((Byte) object).intValue();
} else if (object instanceof Short) {
object = ((Short) object).intValue();
}
values.add((int) object);
}
}
public static void toBoolean(boolean[] values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Object object = vector.getObject(i);
values[i] = (boolean) object;
}
}
public static void toBoolean(Boolean[] values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Object object = vector.getObject(i);
values[i] = (boolean) object;
}
}
public static void toBoolean(List<Boolean> values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Object object = vector.getObject(i);
values.add((boolean) object);
}
}
public static void toFloat(float[] values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Object object = vector.getObject(i);
values[i] = (float) object;
}
}
public static void toFloat(Float[] values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Object object = vector.getObject(i);
values[i] = (float) object;
}
}
public static void toFloat(List<Float> values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Object object = vector.getObject(i);
values.add((float) object);
}
}
public static void toDouble(double[] values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Object object = vector.getObject(i);
values[i] = (double) object;
}
}
public static void toDouble(Double[] values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Object object = vector.getObject(i);
values[i] = (double) object;
}
}
public static void toDouble(List<Double> values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Object object = vector.getObject(i);
values.add((double) object);
}
}
public static void toText(String[] values, VarCharVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Text text = vector.getObject(i);
values[i] = new String(text.getBytes());
}
}
public static void toText(List<String> values, VarCharVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
Text text = vector.getObject(i);
values.add(new String(text.getBytes()));
}
}
private static void toTimestamp(LocalDateTime[] values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values[i] = (LocalDateTime) vector.getObject(i);
}
}
private static void toTimestamp(List<LocalDateTime> values, FieldVector vector, int rowCount) {
for (int i = 0; i < rowCount; i++) {
values.add((LocalDateTime) vector.getObject(i));
}
}
}
如想要基于内存的通信
https://arrow.apache.org/docs/java/ipc.html#
基于内存通信,也需要文件的映射,采用channel进行交互,给予内存地址进行读取,arrow对数据类型存储进行了统一定义,减少了90%序列化需求
//写入核心代码
try (
ByteArrayOutputStream out = new ByteArrayOutputStream();
ArrowStreamWriter writer = new ArrowStreamWriter(root, /*DictionaryProvider=*/null, Channels.newChannel(out));
) {
// ... do write into the ArrowStreamWriter
writer.start();
// write the first batch
writer.writeBatch();
// write another four batches.
for (int i = 0; i < 4; i++) {
// populate VectorSchemaRoot data and write the second batch
BitVector childVector1 = (BitVector)root.getVector(0);
VarCharVector childVector2 = (VarCharVector)root.getVector(1);
childVector1.reset();
childVector2.reset();
// ... do some populate work here, could be different for each batch
writer.writeBatch();
}
writer.end();
}
//读取核心代码
try (ArrowStreamReader reader = new ArrowStreamReader(new ByteArrayInputStream(out.toByteArray()), allocator)) {
reader.loadNextBatch();
VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
// get the encoded vector
IntVector intVector = (IntVector) readRoot.getVector(0);
// get dictionaries and decode the vector
Map<Long, Dictionary> dictionaryMap = reader.getDictionaryVectors();
long dictionaryId = intVector.getField().getDictionary().getId();
try (VarCharVector varCharVector =
(VarCharVector) DictionaryEncoder.decode(intVector, dictionaryMap.get(dictionaryId))) {
// ... use decoded vector
}
}
此处只写了java的用例,我们是于python pandas通信,pandas代码则非常简单
DataFrame.to_feather(path, **kwargs)