spark学习实战:
需要创建一个scala项目,需要在idea中安装scala的插件,重启之后便可创建scala项目
maven配置:
<properties>
<scala.version>2.11.8</scala.version>
<sparkSql.version>2.1.0</sparkSql.version>
</properties>
<!--scala-->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!--sparkSql-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${sparkSql.version}</version>
</dependency>
将scala程序打包,可以参见 https://www.cnblogs.com/xxbbtt/p/8143593.html 这篇文章,使用maven打包后运行会报错
打包之后将程序包上传到linux服务器上,可以在linux上安装 rz 命令包便于文件的上传
使用以下命令来运行spark程序:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
- 官网参数解释:
--class
: The entry point for your application (e.g.org.apache.spark.examples.SparkPi
)--master
: The master URL for the cluster (e.g.spark://23.195.26.187:7077
)--deploy-mode
: Whether to deploy your driver on the worker nodes (cluster
) or locally as an external client (client
) (default:client
) †--conf
: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).application-jar
: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, anhdfs://
path or afile://
path that is present on all nodes.application-arguments
: Arguments passed to the main method of your main class, if any
初识spark程序:
sqlContext的使用 (spark 1.x版本)
package com.scala
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
/**
*
*/
object SQLContextTest {
def main(args: Array[String]): Unit = {
var path = args(0)
//创建sqlcontext
var sparkConf = new SparkConf()
//sparkConf.setSparkHome("192.168.2.130/home/lc/spark-2.4.0-bin-hadoop2.6").setAppName("SQLContextTest").setMaster("local[2]")
var sc = new SparkContext()
var sqlContext = new SQLContext(sc)
//相关处理
var people = sqlContext.read.format("json").load(path)
people.printSchema()//json串中的key
people.show()//展示处理之后的表的视图
//关闭资源
}
}
hiveContext的使用:用来操作hive:
package com.scala import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.hive.HiveContext /** * 使用hivecontext来操作hive */ object HiveContextApp { def main(args: Array[String]): Unit = { var sparkconf = new SparkConf(); var sparkContext = new SparkContext(sparkconf); var hiveContext = new HiveContext(sparkContext); hiveContext.table("TBLS").show } }
sparkSession的使用(spark2.x的统一入口)
package com.scala import org.apache.spark.sql.SparkSession /** * 使用sparksession来操作 * sparkSession是spark2.0的统一入口 */ object SparkSessionApp { def main(args: Array[String]): Unit = { var path = args(0); var sparksession = SparkSession. builder(). appName("SparkSessionApp") .master("local[2]") .getOrCreate() //读取数据 var people = sparksession.read.json(path) //展示数据 people.show() //关闭资源 sparksession.close() } }