Katana的高性能图形分析库

最新推荐文章于 2024-06-19 19:00:00 发布

寒冰屋

最新推荐文章于 2024-06-19 19:00:00 发布

阅读量459

点赞数

分类专栏： python 文章标签： Katana Graph

原文链接：https://www.codeproject.com/Articles/5317383/Katana-s-High-Performance-Graph-Analytics-Library

版权

python 专栏收录该内容

261 篇文章 9 订阅

订阅专栏

来自Pandas DataFrame的输入

根据 Gartner, Inc.的数据，图形处理是2021年十大数据分析趋势之一。它是一个新兴的应用领域，也是数据科学家处理关联数据集（例如社交、电信和金融）的必要工具网络；网络流量；和生化途径）。实际应用中的图往往很大，而且越来越大。例如，当今的社交网络可以拥有数十亿个节点和边缘，因此高性能并行计算至关重要。

为此，Katana Graph与英特尔合作，设计了一个高性能、易于使用的图形分析Python库，具有(a)高度优化的重要图形分析算法的并行实现；(b)一个高级Python接口，用于在底层C++图形引擎之上编写自定义并行算法；(c)与pandas、scikit-learn和Apache Arrow以及英特尔AI 软件堆栈中的工具和库的互操作性；(d)全面支持各种格式的提取、转换和加载(ETL)；(e) Metagraph 插件。

本文将介绍库中的内容、如何访问库、使用示例和基准数据以突出性能。

库中的图形分析算法

图形处理管道中常用的关键算法预先打包在Katana库中。目前可用的算法如下：

广度优先搜索： 返回从源节点开始的广度优先搜索构造的定向树
单源最短路径： 计算从源节点开始到所有节点的最短路径
连接的组件： 查找图的内部连接但未连接到其他组件的组件（即节点组）
网页级别： 根据传入链接的结构计算图中节点的排名
中介中心性： 根据通过每个节点的最短路径数计算图中节点的中心性
三角形计数： 计算图形中三角形的数量
鲁汶社区检测： 使用鲁汶启发式计算最大化模块化的图社区
子图提取： 提取图的诱导子图
Jaccard相似度： 计算给定节点与图中每个其他节点的Jaccard系数
使用标签传播的社区检测： 使用标签传播算法计算图中的社区
局部聚类系数函数： 衡量图中节点倾向于聚集在一起的程度
K-Truss：找到包含至少三个顶点的图的最大诱导子图，其中每条边都与至少K-2个三角形相关
K-Core：查找包含度数为K或以上的节点的最大子图

更多算法被添加到库中，用户可以轻松添加自己的算法，我们将在下面演示。

获取Katana图形库

Katana Graph的分析库是开源的，在 3-Clause BSD许可下免费提供。它可以在 GitHub 上找到，也可以从Anaconda.org轻松安装：

$ conda install -c katanagraph/label/dev -c conda-forge katana-python

使用Katana图形库

Katana的Python库支持各种格式的ETL，例如邻接矩阵、pandas DataFrames、NumPy数组、边列表、GraphML、NetworkX等。下面显示了几个示例：

import numpy as np
import pandas
from katana.local import Graph
from katana.local.import_data import (
     from_adjacency_matrix,
     from_edge_list_arrays,
     from_edge_list_dataframe,
     from_edge_list_matrix,
     from_graphml)

邻接矩阵的输入

katana_graph = from_adjacency_matrix(
                    np.array([[0, 1, 0], [0, 0, 2], [3, 0, 0]]))

来自边缘列表的输入

katana_graph = from_edge_list_arrays(
                    np.array([0, 1, 10]), np.array([1, 2, 0]),
                    prop = np.array([1, 2, 3]))

来自Pandas DataFrame的输入

katana_graph = from_edge_list_dataframe(
                    pandas.DataFrame(dict(source=[0, 1, 10],
                                          destination=[1, 2, 0],
                                     prop = [1, 2, 3])))

来自GraphML的输入

katana_graph = from_graphml(input_file)

执行图分析算法

以下示例计算输入图的介数中心性：

import katana.local
from katana.example_utils import get_input
from katana.property_graph import PropertyGraph
from katana.analytics import betweenness_centrality,
                             BetweennessCentralityPlan,
                             BetweennessCentralityStatistics
katana.local.initialize()

property_name = "betweenness_centrality"
betweenness_centrality(katana_graph, property_name, 16,
                       BetweennessCentralityPlan.outer())
stats = BetweennessCentralityStatistics(g, property_name)

print("Min Centrality:", stats.min_centrality)
print("Max Centrality:", stats.max_centrality)
print("Average Centrality:", stats.average_centrality)

Katana的Python库可与pandas、scikit-learn和Apache Arrow互操作。

除了前面列出的预打包例程外，数据科学家还可以使用简单的Python接口编写自己的图形算法，该接口公开了Katana Graph的优化C++引擎及其并发数据结构和并行循环结构。Katana Graph库已经包含广度优先搜索实现，但以下示例说明了使用API实现此类算法是多么容易：

def bfs(graph: Graph, source):
    """
    Compute the BFS distance to all nodes from source.

    The algorithm in bulk-synchronous level by level.

    :param graph: The input graph.
    :param source: The source node for the traversal.
    :return: An array of distances, indexed by node ID.
    """
    next_level_number = 0

    # The work lists for the current and next levels using a 
    # Katana concurrent data structure.
    curr_level_worklist = InsertBag[np.uint32]()
    next_level_worklist = InsertBag[np.uint32]()

    # Create and initialize the distance array.
    # source is 0, everywhere else is INFINITY
    distance = np.empty((len(graph),), dtype=np.uint32)
    distance[:] = INFINITY
    distance[source] = 0

    # Start processing with just the source node.
    next_level_worklist.push(source)
    
    # Execute until the worklist is empty.
    while not next_level_worklist.empty():
        # Swap the current and next work lists
        curr_level_worklist, next_level_worklist = next_level_worklist,
                                                   curr_level_worklist

        # Clear the worklist for the next level.
        next_level_worklist.clear()
        next_level_number += 1

        # Process the current worklist in parallel by applying
        # bfs_operator for each element of the worklist.
        do_all(
            curr_level_worklist,
            # The call here binds the initial arguments of bfs_operator.
            bfs_operator(graph, next_level_worklist,
                                next_level_number, distance)
        )

    return distance

# This function is marked as a Katana operator, meaning that it will
# be compiled to native code and prepared for use with Katana do_all.
@do_all_operator()
def bfs_operator(graph: Graph, next_level_worklist,
                               next_level_number, distance, node_id):
    """
    The operator called for each node in the work list.

    The initial four arguments are provided by bfs above.
    node_id is taken from the worklist and passed to this
    function by do_all.

    :param next_level_worklist: The work list to add next nodes to.
    :param next_level_number: The level to assign to nodes we find.
    :param distance: The distance array to fill with data.
    :param node_id: The node we are processing.
    :return:
    """
    # Iterate over the out edges of our node
    for edge_id in graph.edges(node_id):
        # Get the destination of the edge
        dst = graph.get_edge_dest(edge_id)
        
        # If the destination has not yet been reached, set its level
        # and add it to the work list so its out edges can be processed
        # in the next level.
        if distance[dst] == INFINITY:
            distance[dst] = next_level_number
            next_level_worklist.push(dst)
        # There is a race here, but it's safe. If multiple calls to
        # operator add the same destination, they will all set the
        # same level. It will create more work because the node will
        # be processed more than once in the next level, but it avoids
        # atomic operations so it can still be a win in low-degree graphs.

Metagraph支持

Katana Graph的Python分析库将通过 Metagraph 插件提供。Metagraph为图形分析提供了一致的Python入口点。可以使用标准API编写图形工作流，然后将其分发到可插入Metagraph的兼容图形库。现在，开源图形社区将能够直接使用Katana Graph的高性能应用程序。Metagraph插件包含在Anaconda包中，可以按如下方式安装和调用：

$ conda create -n metagraph-test -c conda-forge \
                                 -c katanagraph/label/dev \
                                 -c metagraph metagraph-katana

import metagraph as mg
bfs = mg.algos.traversal.bfs_iter(katana_graph, <start node>)

Katana图形库有多快？

Katana库已针对其他图形分析框架进行了广泛的基准测试，并且始终显示出与 GAP Benchmark Suite 相当或更好的性能。表1 显示了Katana Graph相对于来自不同领域的各种图的GAP参考实现的性能。

表 1. 使用GAP Benchmark Suite测量Katana Graph性能。该数据取自Azad等人。(2020)2。系统：双路 2.0 GHz Intel® Xeon® Platinum 8153 处理器（64个逻辑内核）和 384 GB DDR4 内存。有关性能和基准测试结果的更完整信息，请访问 www.intel.com/benchmarks。

Katana Graph库在最近的字节可寻址内存技术上也被证明在超大图上表现良好，例如Clueweb12和WDC12（分别有42和1280亿条边，这些是一些最大的公开可用图）例如英特尔® 傲腾™ DC持久内存（图 1）。

图 1. Katana Graph BFS在超大图上的性能。它将单个基于英特尔傲腾内存的节点的性能与具有多个节点的集群进行了比较。每个TACC Stampede Skylake集群节点都有两个2.1 GHz Intel Xeon Platinum 8160处理器和192 GB DDR4内存。Cascade Lake服务器具有两个2.2 GHz第二代Intel Xeon可扩展处理器，配备6 TB Intel Optane PMM和384 GB DDR4 DRAM。Ice Lake服务器有两个2.2 GHz Intel Xeon Platinum 8352Y处理器，8 TB Intel Optane PMM和1 TB DDR4 DRAM。有关性能和基准测试结果的更完整信息，请访问 www.intel.com/benchmarks。

我在哪里可以了解更多信息？

我希望您确信Katana Graph库是用于图形分析的多功能、高性能选项。您可以在GitHub 站点上了解有关该库的更多信息、提出问题、发布功能请求等。

参考

Nguyen, D.、Lenharth, A.和Pingali, K. (2013)。用于图形分析的轻量级基础架构。第24届ACM操作系统原理研讨会论文集 (SOSP '13)。
阿扎德，A. 等人。（2020年）。使用GAP基准套件、IEEE工作负载表征国际研讨会(IISWC)评估图形分析框架。
ClueWeb12数据集
Web Data Commons – 超链接图
Gill, G.、Dathathri, R.、Hoang, L.、Peri, R.和Pingali, K. (2020) 使用英特尔傲腾DC持久内存对海量数据集进行单机图分析。VLDB捐赠基金会议记录，13(8), 1304–1318。
Dathathri, R. 等人。（2019）。 Gluon-Async：用于分布式和异构图形分析的批量异步系统。第28届并行架构和编译技术国际会议论文集(PACT)。