第1关:海量数据导入:SparkSQL大数据导入处理
任务描述
工欲善其事必先利其器,大数据分析中最重要的是熟练掌握数据导入工具的使用方法。Spark SQL是Spark自带的数据库,本关你将应用Spark SQL的数据导入工具实现文本数据的导入。其中,graphx-wiki-vertices.txt
文件中含有网页及其id
数据,graphx-wiki-edges.txt
文件中含有网页及其连接网页id
数据。
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql._
object SparkSQLHive {
def main(args: Array[String]) = {
val sparkConf=new SparkConf().setAppName("PageRank")
val sc=new SparkContext(sparkConf)
val spark = SparkSession.builder.master("local").appName("tester").enableHiveSupport().getOrCreate()
spark.sql("use default")
import spark.implicits._
//drop table if it exists
spark.sql("DROP TABLE IF EXISTS vertices")
spark.sql("DROP TABLE IF EXISTS edges")
//create table here
spark.sql("CREATE TABLE IF NOT EXISTS vertices(ID BigInt,Title String)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'")
//load data from file system
spark.sql("LOAD DATA LOCAL INPATH 'file:///root/graphx-wiki-vertices.txt' INTO TABLE vertices")
//***************begin***************//
println("begin to create table in databases")
//***********end***********//
//***************begin***************//
println("begin to load data in text file")
//***********end***********//
println("success")
}
}
第2关:翻帐查数:Spark大数据查询
任务描述
上一关我们将网页数据导入到Spark SQL数据库中,本关你将再次利用Spark SQL语句查询vertices
表中的数据,并返回前5条网页数据。
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql._
object SparkSQLHive2 {
def main(args: Array[String]) = {
val sparkConf=new SparkConf().setAppName("PageRank")
val sc=new SparkContext(sparkConf)
val spark = SparkSession.builder.master("local").appName("tester").enableHiveSupport().getOrCreate()
//chose database
spark.sql("use default")
import spark.implicits._
spark.sql("DROP TABLE IF EXISTS vertices")
//create table
spark.sql("CREATE TABLE IF NOT EXISTS vertices(ID BigInt,Title String)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'")
//load data in file system
spark.sql("LOAD DATA LOCAL INPATH 'file:///root/graphx-wiki-vertices.txt' INTO TABLE vertices")
//***********begin***********//
//query data from databases
val res1=spark.sql("select * from vertices limit 5")
//***********end***********//
res1.collect().foreach(println)
}
}
第3关:垃圾中觅黄金:网页评分算法处理
任务描述
本关你将学习并了解PageRank算法的基本原理,并使用该算法计算A,B,C,D四个网页被访问的概率值并输出。
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}
object PageRank {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val conf = new SparkConf().setAppName("PageRank").setMaster("local")
val sc = new SparkContext(conf)
//initial origin data
val links = sc.parallelize(List(
("A",List("B","C")),
("B",List("A","D")),
("C",List("A")),
("D",List("A","B","C"))
)).partitionBy(new HashPartitioner(10))
.persist()
var ranks = links.mapValues(v => 0.25)
//page rank start
//***********begin***********//
for(i <- 0 until 10){
val contributions = links.join(ranks).flatMap{
case(pageId, (links, rank)) =>
links.map(link => (link, rank / links.size))
}
//***********end***********//
//***********begin***********//
ranks = contributions
.reduceByKey((x,y) =>x+y )
.mapValues(v => 0.2*0.25 + 0.8*v)
}
//***********end***********//
//print result
ranks.collect().foreach(println)
}
}