基于pyspark GraphFrames实现图查询和计算
GraphFrames基本操作
GraphFrames,该类库是构建在Spark DataFrames之上,它既能利用DataFrame良好的扩展性和强大的性能,同时也为Scala、Java和Python提供了统一的图处理API。GraphX基于RDD API,不支持Python API; 但GraphFrame基于DataFrame,并且支持Python API。
创建图
创建图的方式很简单,分别向GraphFrame中传入一个顶点数据集和一个边数据集即可。
from pyspark import SparkContext
from pyspark.sql import SQLContext
from graphframes import GraphFrame
sc = SparkContext("local", appName="mysqltest")
sqlContext = SQLContext(sc)
vertices = sqlContext.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)], ["id", "name", "age"])
edges = sqlContext.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend"),
("a", "e", "friend")
], ["src", "dst", "relationship"])
# 生成图
g = GraphFrame(vertices, edges)
print(g)
# GraphFrame(v:[id: string, name: string ... 1 more field], e:[src: string, dst: string ... 1 more field])
print(type(g)