TigerGraph核心特性初探

最新推荐文章于 2024-09-19 12:42:27 发布

MeAndJack

最新推荐文章于 2024-09-19 12:42:27 发布

阅读量5.5k

点赞数 2

分类专栏：图数据库文章标签： TigerGraph 图数据库分布式计算 Graph

本文链接：https://blog.csdn.net/temotemo/article/details/83382687

版权

本文介绍了商业图数据库TigerGraph的核心特性，包括其原生分布式图存储、存储压缩与快速访问、并行处理和共享值、MPP计算模型等。TigerGraph采用C++编写，拥有自己的图查询语言GSQL，并支持高效的图分析。其自动分区和分布式计算模式确保了大规模数据处理的性能。

摘要由CSDN通过智能技术生成

这里简单介绍目前商业市场上出现的宣称是“第三代”图数据库产品，能支持OLAP和OLTP的场景。这个厂商提出了一个新的名词叫NPG（Native Parallel Graph）原生并行图（感觉广告软文在创造新词汇...o(╯□╰)o）。

因为TigerGraph不是开源的，因此我们可以从官宣的资料中了解了解它的核心设计。蓝色部分使个人的一点思考。

A Native Distributed Graph（原生分布式图）

Its data store holds nodes, links, and their attributes. Some graph database products on the market are really wrappers built on top of a more generic NoSQL data store. This virtual graph strategy has a double penalty when it comes to performance.

图数据库引擎将节点，连接和属性直接存储在本地，也就是上层怎么建模的，数据就是怎么存储的。这点跟部分图数据库的设计是不太一样的，比如一些数据库在NoSQL的存储引擎上设计一套图的模型，这种实现的方式被它成为虚拟图，而且会有存在潜在的性能开销。（Titan，别跑，说的就是你啊）

另外，Neo4j也是native graph, index-free的形式，看来没有别的捷径，要想图数据库引擎跑得快，需要最大程度减少磁盘IO和网络IO，native+memory是最自然的实现方式，除非后续计算机业界有新的存储技术突破

Compact Storage with Fast Access（存储压缩与快速访问）

Internally hash indices are used to reference nodes and links. In Big-O terms, our average access time is O(1) and our average index update time is also O(1).

Users can set parameters that specify how much of the available memory may be used for holding the graph. If the full graph does not fit in memory, then the excess is stored on disk. Best performance is achieved when the full graph fits in memory, of course.

Data values are stored in encoded formats that effectively compress the data. The compression factor varies with the graph structure and data, but typical compression factors are between 2x and 10x. Compression has two advantages: First, a larger amount of graph data can fit in memory and in CPU cache. Such compression reduces not only the memory footprint, but also CPU cache misses, speeding up overall query performance. Second, for users with very large graphs, hardware costs are reduced.

In general, decompression is needed only for displaying the data. When values are used internally, often they may remain encoded and compressed.

引擎内部的hash索引用于节点和连接，平均的访问速度达到O(1)，索引的更新时间也是O(1)。

总的来说，用户可以配置内存的大小，如果图数据内存加载不下，会存储于硬盘中去，当整个图通过内存完整加载的情况下，性能最优（那还用说啊，没有IO开销啊）;

通常情况下，数据的压缩（编码）效率达到2-10倍，这样能带来2个优势：

1、将可能大数据量的图尽量通过压缩之后存储于内存和CPU缓存中，减少内存访问路径和提升CPU缓存命中率，提升性能；

2、压缩后，可以有助于减少内存占用，降低成本；

通常经常下，用于展示的场景（比如多属性的节点或者边）数据会进行压缩。

这部分的特性非常有意思：策略是从数据压缩和工程的角度来提升性能。

压缩这部分非常有意思，从香农的《通信的数学原理》论文我们知道，数据的压缩是跟数据的分布情况有关的，2-10倍的压缩我觉得不一定对所有的数据集特性上都能实现.