GraphX基于RDD API,不支持Python API;但GraphFrame基于DataFrame,并且支持Python API。
“GraphFrames is a DataFrame-based external Spark package that provides performance optimizations and also additional functionalities such as motif finding.”
“While the GraphX framework is based on the RDD API, GraphFrames is an external Spark package built on top of the DataFrames API. It inherits the performance advantages of DataFrames using the catalyst optimizer. It can be used in the Java, Scala, and Python programming languages. GraphFrames provides additional functionalities over GraphX such as motif nding, DataFrame-based serialization, and graph queries. GraphX does not provide the Python API, but GraphFrames exposes the Python API as well.”
开始GraphFrame,主要从如下几步开始入手:
1.导入jar包。导入graphframes-0.5.0-spark2.1-s_2.11.jar
如下几种方法。
1)$SPARK_HOME/bin/spark-shell --packages graphframes-0.5.0-spark2.1-s_2.11.jar
2)直接放到$SPARK_HOME/jars目录下
3)idea中导入jar包
其中,1) 和 2)适用于spark-shell,3)适用于idea。
2. 示例代码
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.graphframes._
object Graphs {
def main(args: Array[String]){
// 屏蔽不必要的日志显示在终端上
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
val spark = SparkSession
.builder()
.appName("Graphs")
.getOrCreate()
val vertex = spark.createDataFrame(List(
("1","Jacob",48),
("2","Jessica",45),
("3","Andrew",25),
("4","Ryan",53),
("5","Emily",22),
("6","Lily",52)
)).toDF("id", "name", "age")
vertex.show()
// +---+-------+---+
// | id| name|age|
// +---+-------+---+
// | 1| Jacob| 48|
// | 2|Jessica| 45|
// | 3| Andrew| 25|
// | 4| Ryan| 53|
// | 5| Emily| 22|
// | 6| Lily| 52|
// +---+-------+---
val edges = spark.createDataFrame(List(
("6","1","Sister"),
("1","2","Husband"),
("2","1","Wife"),
("5","1","Daughter"),
("5","2","Daughter"),
("3","1","Son"),
("3","2","Son"),
("4","1","Friend"),
("1","5","Father"),
("1","3","Father"),
("2","5","Mother"),
("2","3","Mother")
)).toDF("src", "dst", "relationship")
edges.show()
val graph = GraphFrame(vertex, edges)
// +---+---+------------+
// |src|dst|relationship|
// +---+---+------------+
// | 6| 1| Sister|
// | 1| 2| Husband|
// | 2| 1| Wife|
// | 5| 1| Daughter|
// | 5| 2| Daughter|
// | 3| 1| Son|
// | 3| 2| Son|
// | 4| 1| Friend|
// | 1| 5| Father|
// | 1| 3| Father|
// | 2| 5| Mother|
// | 2| 3| Mother|
// +---+---+------------+
graph.vertices.show()
graph.edges.show()
graph.vertices.groupBy().min("age").show()
// +--------+
// |min(age)|
// +--------+
// | 22|
// +--------+
// Motif finding
val motifs = graph.find("(a)-[e]->(b); (b)-[e2]->(a)")
motifs.show()
// +--------------+--------------+--------------+--------------+
// | a| e| b| e2|
// +--------------+--------------+--------------+--------------+
// | [1,Jacob,48]| [1,2,Husband]|[2,Jessica,45]| [2,1,Wife]|
// |[2,Jessica,45]| [2,1,Wife]| [1,Jacob,48]| [1,2,Husband]|
// | [5,Emily,22]|[5,1,Daughter]| [1,Jacob,48]| [1,5,Father]|
// | [5,Emily,22]|[5,2,Daughter]|[2,Jessica,45]| [2,5,Mother]|
// | [3,Andrew,25]| [3,1,Son]| [1,Jacob,48]| [1,3,Father]|
// | [3,Andrew,25]| [3,2,Son]|[2,Jessica,45]| [2,3,Mother]|
// | [1,Jacob,48]| [1,5,Father]| [5,Emily,22]|[5,1,Daughter]|
// | [1,Jacob,48]| [1,3,Father]| [3,Andrew,25]| [3,1,Son]|
// |[2,Jessica,45]| [2,5,Mother]| [5,Emily,22]|[5,2,Daughter]|
// |[2,Jessica,45]| [2,3,Mother]| [3,Andrew,25]| [3,2,Son]|
// +--------------+--------------+--------------+--------------+
// filter results
motifs.filter("b.age > 30").show()
// +--------------+--------------+--------------+-------------+
// | a| e| b| e2|
// +--------------+--------------+--------------+-------------+
// | [1,Jacob,48]| [1,2,Husband]|[2,Jessica,45]| [2,1,Wife]|
// |[2,Jessica,45]| [2,1,Wife]| [1,Jacob,48]|[1,2,Husband]|
// | [5,Emily,22]|[5,1,Daughter]| [1,Jacob,48]| [1,5,Father]|
// | [5,Emily,22]|[5,2,Daughter]|[2,Jessica,45]| [2,5,Mother]|
// | [3,Andrew,25]| [3,1,Son]| [1,Jacob,48]| [1,3,Father]|
// | [3,Andrew,25]| [3,2,Son]|[2,Jessica,45]| [2,3,Mother]|
// +--------------+--------------+--------------+-------------+
//3.Loading and saving GraphFrames
graph.vertices.write.parquet("file:///Users/sws/IdeaProjects/JavaScala/src/main/scala/Data/vertices")
graph.edges.write.parquet("file:///Users/sws/IdeaProjects/JavaScala/src/main/scala/Data/edges")
val verticesDF = spark.read.parquet("file:///Users/sws/IdeaProjects/JavaScala/src/main/scala/Data/vertices")
val edgesDF = spark.read.parquet("file:///Users/sws/IdeaProjects/JavaScala/src/main/scala/Data/edges")
val sameGraph = GraphFrame(verticesDF, edgesDF)
}
关键部分整理:
1)import org.graphframes._
2)创建语句:val graph = GraphFrame(vertex, edges)
3)
vertices.show() //是个普通的DataFrame
graph.vertices.show() // 是GraphFrame的
4)graph.find("(a)-[e]->(b); (b)-[e2]->(a)")
This is GraphFrame-based motif finding uses DataFrame-based DSL for finding structural patterns.
It will search for pairs of vertices a, and b, connected by edges in both directions. It will return a DataFrame of all such structures in the graph with columns for each of the named elements (vertices or edges) in the motif.
可以参考neo4j链接
5)查询示例:
motifs.filter("b.age > 30").show()
6)读写文件:
Since GraphFrames are built on top of DataFrames, they inherit all DataFrame-supported DataSources. You can write GraphFrames to the Parquet, JSON, and CSV formats. 文件位置可以是本地(file:///…),可以是HDFS等。
graph.vertices.write.parquet("vertices") // hdfs上
spark.read.parquet("vertices")