spark-shell数据处理

代码:

[WBQ@westgisB068 ~]$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/04/18 10:26:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://westgisB068:4040
Spark context available as 'sc' (master = local[*], app id = local-1681784767038).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.2
      /_/
         
Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_271)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession

scala> val dfraw = spark.read.format("csv").option("header",value="true").option("encoding","utf-8").load("file:///home/WBQ/soft/data/traffic/part-*")
dfraw: org.apache.spark.sql.DataFrame = [卡号: string, 交易时间: string ... 2 more fields]

scala> dfraw.show(5)
+-------+--------------------+--------+--------+
|   卡号|            交易时间|线路站点|交易类型|
+-------+--------------------+--------+--------+
|3697647|2018-10-01T18:47:...|    大新|地铁出站|
|3697647|2018-10-01T18:35:...|宝安中心|地铁入站|
|3697647|2018-10-01T13:49:...|    大新|地铁入站|
|3697647|2018-10-01T14:03:...|宝安中心|地铁出站|
|5344820|2018-10-17T09:34:...|    罗湖|地铁入站|
+-------+--------------------+--------+--------+
only showing top 5 rows


scala> val schemas = Seq("cardid","captime","rawstation","trans_type")
schemas: Seq[String] = List(cardid, captime, rawstation, trans_type)

scala> val df01 = dfraw.toDF(schemas:_*)
df01: org.apache.spark.sql.DataFrame = [cardid: string, captime: string ... 2 more fields]

scala> df01.show(5)
+-------+--------------------+----------+----------+
| cardid|             captime|rawstation|trans_type|
+-------+--------------------+----------+----------+
|3697647|2018-10-01T18:47:...|      大新|  地铁出站|
|3697647|2018-10-01T18:35:...|  宝安中心|  地铁入站|
|3697647|2018-10-01T13:49:...|      大新|  地铁入站|
|3697647|2018-10-01T14:03:...|  宝安中心|  地铁出站|
|5344820|2018-10-17T09:34:...|      罗湖|  地铁入站|
+-------+--------------------+----------+----------+
only showing top 5 rows


scala> val df02 = df01.filter(col("trans_type").contains("地铁入站")||col("trans_type").contains("地铁出站"))
df02: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [cardid: string, captime: string ... 2 more fields]

scala> df02.show(5)
+-------+--------------------+----------+----------+
| cardid|             captime|rawstation|trans_type|
+-------+--------------------+----------+----------+
|3697647|2018-10-01T18:47:...|      大新|  地铁出站|
|3697647|2018-10-01T18:35:...|  宝安中心|  地铁入站|
|3697647|2018-10-01T13:49:...|      大新|  地铁入站|
|3697647|2018-10-01T14:03:...|  宝安中心|  地铁出站|
|5344820|2018-10-17T09:34:...|      罗湖|  地铁入站|
+-------+--------------------+----------+----------+
only showing top 5 rows


scala> val df03 = df02.select("cardid","captime").where("rawstation is not null").dropDuplicates("cardid","captime")
df03: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [cardid: string, captime: string]

scala> val df03A = df03.where("captime like '2018-10-%'")
df03A: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [cardid: string, captime: string]


scala> def replaceStationName(station:String):String={
     |   var dststation = station
     |   if(!station.endsWith("站"))
     |      dststation = station+"站"
     |   else
     |      dststation = dststation
     |  
     |   if (dststation.equals("马鞍山站"))
     |      dststation = "马安山站"
     |  else if (dststation.equals("深圳大学站"))
     |      dststation = "深大站"
     |  else if (dststation.equals("?Ⅰ站"))
     |      dststation = "子岭站"
     |  else{}
     |  dststation
     | }
replaceStationName: (station: String)String


scala> val addCol_replaceStation = udf(replaceStationName _)
addCol_replaceStation: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$4066/783363478@28caba4b,StringType,List(Some(class[value[0]: string])),Some(class[value[0]: string]),None,true,true)


scala> df03.show(5)
+-------+--------------------+                                                  
| cardid|             captime|
+-------+--------------------+
|0000164|2018-10-06T20:53:...|
|0001185|2018-10-05T17:13:...|
|0002117|2018-10-19T07:51:...|
|0002495|2018-10-11T22:11:...|
|0003781|2018-10-19T18:45:...|
+-------+--------------------+
only showing top 5 rows


scala> val df03 = df02.select("cardid","captime","rawstation","trans_type").where("rawstation is not null").dropDuplicates("cardid","captime","rawstation","trans_type")
df03: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [cardid: string, captime: string ... 2 more fields]

scala> val df03A = df03.where("captime like '2018-10-%'")
df03A: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [cardid: string, captime: string ... 2 more fields]

scala> def replaceStationName(station:String):String={
     |   var dststation = station
     |   if(!station.endsWith("站"))
     |      dststation = station+"站"
     |   else
     |      dststation = dststation
     | 
     |   if (dststation.equals("马鞍山站"))
     |      dststation = "马安山站"
     |  else if (dststation.equals("深圳大学站"))
     |      dststation = "深大站"
     |  else if (dststation.equals("?Ⅰ站"))
     |      dststation = "子岭站"
     |  else{}
     |  dststation
     | }
replaceStationName: (station: String)String

scala> val addCol_replaceStation = udf(replaceStationName _)
addCol_replaceStation: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$4552/1555689875@313cdbb1,StringType,List(Some(class[value[0]: string])),Some(class[value[0]: string]),None,true,true)

scala> val df04 = df03.withColumn("station",addCol_replaceStation(df03("rawstation")))
df04: org.apache.spark.sql.DataFrame = [cardid: string, captime: string ... 3 more fields]

scala> def replacegjTime(captime:String):String = 
     | {
     |    val glString = captime
     |    val gjsdf = new SimpleDataFormat("yyyy-MM-ddTHH-mm-ss");
     |    val dstsdf = new SimpleDataFormat("yyyy-MM-dd HH:mm:ss");
     |    val dt= gjsdf.prase(gjString)
     |    val longtime = dt.getTime()
     |    val dstString = dstsdf.format(longtime);
     |    dstString
     | }
<console>:26: error: not found: type SimpleDateFormat
          val gjsdf = new SimpleDateFormat("yyyy-MM-ddTHH-mm-ss");
                          ^
<console>:27: error: not found: type SimpleDateFormat
          val dstsdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
                           ^

scala> import java.text.SimpleDateFormat
import java.text.SimpleDateFormat

scala> def replacegjTime(captime:String):String = 
     | {
     |    val glString = captime
     |    val gjsdf = new SimpleDateFormat("yyyy-MM-ddTHH-mm-ss");
     |    val dstsdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
     |    val dt= gjsdf.prase(gjString)
     |    val longtime = dt.getTime()
     |    val dstString = dstsdf.format(longtime);
     |    dstString
     | }
<console>:29: error: value prase is not a member of java.text.SimpleDateFormat
          val dt= gjsdf.prase(gjString)
                        ^
<console>:29: error: not found: value gjString
          val dt= gjsdf.prase(gjString)
                              ^

scala> import java.time.*
<console>:24: error: object * is not a member of package java.time
       import java.time.*
              ^

scala> def replacegjTime(captime:String):String = 
     | {
     |    val glString = captime
     |    val gjsdf = new SimpleDateFormat("yyyy-MM-ddTHH-mm-ss");
     |    val dstsdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
     |    val dt= gjsdf.parse(glString)
     |    val longtime = dt.getTime()
     |    val dstString = dstsdf.format(longtime);
     |    dstString
     | }
replacegjTime: (captime: String)String

scala> val addCol_replacegjTime = udf(replacegjTime _)
addCol_replacegjTime: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$4586/1958400275@f8958ec,StringType,List(Some(class[value[0]: string])),Some(class[value[0]: string]),None,true,true)

scala> var df05R = df05.withColumn("time",addCol_replacegjTime(df04("captime"))).select("cardid","time","")
<console>:26: error: not found: value df05
       var df05R = df05.withColumn("time",addCol_replacegjTime(df04("captime"))).select("cardid","time","")
                   ^

scala> df04.show(5)
+-------+--------------------+----------+----------+----------+                 
| cardid|             captime|rawstation|trans_type|   station|
+-------+--------------------+----------+----------+----------+
|0000001|2018-10-31T19:58:...|      龙华|  地铁出站|    龙华站|
|0000003|2018-10-25T18:12:...|      西丽|  地铁入站|    西丽站|
|0000011|2018-10-09T19:31:...|      赤尾|  地铁出站|    赤尾站|
|0000011|2018-10-15T19:06:...|      赤尾|  地铁出站|    赤尾站|
|0000011|2018-10-16T18:17:...|  福田口岸|  地铁入站|福田口岸站|
+-------+--------------------+----------+----------+----------+
only showing top 5 rows


scala> var df05R = df04.withColumn("time",addCol_replacegjTime(df04("captime"))).select("cardid","captime","rawstation","trans_type","station")
df05R: org.apache.spark.sql.DataFrame = [cardid: string, captime: string ... 3 more fields]

scala> df05R.show(5)
+-------+--------------------+----------+----------+----------+                 
| cardid|             captime|rawstation|trans_type|   station|
+-------+--------------------+----------+----------+----------+
|0000001|2018-10-31T19:58:...|      龙华|  地铁出站|    龙华站|
|0000003|2018-10-25T18:12:...|      西丽|  地铁入站|    西丽站|
|0000011|2018-10-09T19:31:...|      赤尾|  地铁出站|    赤尾站|
|0000011|2018-10-15T19:06:...|      赤尾|  地铁出站|    赤尾站|
|0000011|2018-10-16T18:17:...|  福田口岸|  地铁入站|福田口岸站|
+-------+--------------------+----------+----------+----------+
only showing top 5 rows


scala> var df05R = df04.withColumn("time",addCol_replacegjTime(df04("captime"))).select("cardid","time","rawstation","trans_type","station")

df05R: org.apache.spark.sql.DataFrame = [cardid: string, time: string ... 3 more fields]

scala>

生成轨迹和站点

[WBQ@westgisB068 ~]$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/04/25 10:41:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://westgisB068:4040
Spark context available as 'sc' (master = local[*], app id = local-1682390468167).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.2
      /_/
         
Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_271)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession

scala> val dfraw = spark.read.format("csv").option("header",value="true").option("encoding","utf-8").load("file:///home/WBQ/soft/data/traffic/part-*")
dfraw: org.apache.spark.sql.DataFrame = [卡号: string, 交易时间: string ... 2 more fields]

scala> dfraw.show(5)
+-------+--------------------+--------+--------+
|   卡号|            交易时间|线路站点|交易类型|
+-------+--------------------+--------+--------+
|3697647|2018-10-01T18:47:...|    大新|地铁出站|
|3697647|2018-10-01T18:35:...|宝安中心|地铁入站|
|3697647|2018-10-01T13:49:...|    大新|地铁入站|
|3697647|2018-10-01T14:03:...|宝安中心|地铁出站|
|5344820|2018-10-17T09:34:...|    罗湖|地铁入站|
+-------+--------------------+--------+--------+
only showing top 5 rows


scala> val schemes = Seq("cardid","captime","rawstation","trans_type")
schemes: Seq[String] = List(cardid, captime, rawstation, trans_type)

scala> val df01 = dfraw.toDF(schemes:_*)
df01: org.apache.spark.sql.DataFrame = [cardid: string, captime: string ... 2 more fields]

scala> import java.text.SimpleDateFormat
import java.text.SimpleDateFormat

scala> var separator = "<>"
separator: String = <>

scala> import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

scala> val df02 = df01.select(df01("cardid"),concat_ws(separator,df01("captime"),df01("rawstation")).cast(StringType).as("timelocation"))
df02: org.apache.spark.sql.DataFrame = [cardid: string, timelocation: string]
  
scala> def replaceTime(captime:String):String=
     | {
     |    val gjString = captime
     |    val gjsdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
     |    val dstsdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
     |    val dt = gjsdf.parse(gjString)
     |    val longtime = dt.getTime()
     |    val dstString = dstsdf.format(longtime);
     |    dstString
     | }
replaceTime: (captime: String)String

 
scala>  val addCol_replacegjTime = udf(replaceTime _)
addCol_replacegjTime: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$3994/294314442@1a5fdcae,StringType,List(Some(class[value[0]: string])),Some(class[value[0]: string]),None,true,true)

scala> var df03 = df01.withColumn("time",addCol_replacegjTime(df01("captime"))).select("cardid","time","rawstation","trans_type")
df03: org.apache.spark.sql.DataFrame = [cardid: string, time: string ... 2 more fields]

scala> df03.show(5)
+-------+-------------------+----------+----------+
| cardid|               time|rawstation|trans_type|
+-------+-------------------+----------+----------+
|3697647|2018-10-01 18:47:44|      大新|  地铁出站|
|3697647|2018-10-01 18:35:34|  宝安中心|  地铁入站|
|3697647|2018-10-01 13:49:27|      大新|  地铁入站|
|3697647|2018-10-01 14:03:52|  宝安中心|  地铁出站|
|5344820|2018-10-17 09:34:29|      罗湖|  地铁入站|
+-------+-------------------+----------+----------+
only showing top 5 rows


scala> val df02 = df03.filter(col("trans_type").contains("地铁入站")||col("trans_type").contains("地铁出站"))
df02: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [cardid: string, time: string ... 2 more fields]

scala> df02.show(5)
+-------+-------------------+----------+----------+
| cardid|               time|rawstation|trans_type|
+-------+-------------------+----------+----------+
|3697647|2018-10-01 18:47:44|      大新|  地铁出站|
|3697647|2018-10-01 18:35:34|   宝安中心|  地铁入站|
|3697647|2018-10-01 13:49:27|      大新|  地铁入站|
|3697647|2018-10-01 14:03:52|   宝安中心|  地铁出站|
|5344820|2018-10-17 09:34:29|      罗湖|  地铁入站|
+-------+-------------------+----------+----------+
only showing top 5 rows


scala> val df04 = df02.select(df02("cardid"),concat_ws(separator,df02("time"),df02("rawstation").cast(StringType).as("timelocation")))
df04: org.apache.spark.sql.DataFrame = [cardid: string, concat_ws(<>, time, CAST(rawstation AS STRING) AS timelocation): string]

scala> df04.limit(5).collect.foreach(println)
[3697647,2018-10-01 18:47:44<>大新]
[3697647,2018-10-01 18:35:34<>宝安中心]
[3697647,2018-10-01 13:49:27<>大新]
[3697647,2018-10-01 14:03:52<>宝安中心]
[5344820,2018-10-17 09:34:29<>罗湖]

scala> val df04 = df02.select(df02("cardid"),df02("time"))
df04: org.apache.spark.sql.DataFrame = [cardid: string, time: string]

scala> df04.show(5)
+-------+-------------------+
| cardid|               time|
+-------+-------------------+
|3697647|2018-10-01 18:47:44|
|3697647|2018-10-01 18:35:34|
|3697647|2018-10-01 13:49:27|
|3697647|2018-10-01 14:03:52|
|5344820|2018-10-17 09:34:29|
+-------+-------------------+
only showing top 5 rows


scala> df04.limit(5).collect.foreach(println)
[3697647,2018-10-01 18:47:44]
[3697647,2018-10-01 18:35:34]
[3697647,2018-10-01 13:49:27]
[3697647,2018-10-01 14:03:52]
[5344820,2018-10-17 09:34:29]

 
scala> df03.show(5)
+-------+-------------------+----------+----------+
| cardid|               time|rawstation|trans_type|
+-------+-------------------+----------+----------+
|3697647|2018-10-01 18:47:44|      大新|  地铁出站|
|3697647|2018-10-01 18:35:34|  宝安中心|  地铁入站|
|3697647|2018-10-01 13:49:27|      大新|  地铁入站|
|3697647|2018-10-01 14:03:52|  宝安中心|  地铁出站|
|5344820|2018-10-17 09:34:29|      罗湖|  地铁入站|
+-------+-------------------+----------+----------+
only showing top 5 rows


scala> val df04 = df02.select(df02("cardid"),concat_ws(separator,df02("time"),df02("rawstation")).cast(StringType).as("timelocation"))
df04: org.apache.spark.sql.DataFrame = [cardid: string, timelocation: string]

scala> df04.show(5)
+-------+--------------------+
| cardid|        timelocation|
+-------+--------------------+
|3697647|2018-10-01 18:47:...|
|3697647|2018-10-01 18:35:...|
|3697647|2018-10-01 13:49:...|
|3697647|2018-10-01 14:03:...|
|5344820|2018-10-17 09:34:...|
+-------+--------------------+
only showing top 5 rows


scala> val df05 = df04.groupBy("cardid").agg(collect_set("timelocation"))
df05: org.apache.spark.sql.DataFrame = [cardid: string, collect_set(timelocation): array<string>]

scala> df05.show(5)
+-------+-------------------------+                                             
| cardid|collect_set(timelocation)|
+-------+-------------------------+
|0000029|     [2018-10-14 21:36...|
|0000052|     [2018-10-22 20:27...|
|0000088|     [2018-10-12 18:55...|
|0000102|     [2018-10-07 19:18...|
|0000120|     [2018-10-20 20:59...|
+-------+-------------------------+
only showing top 5 rows


scala> df05.limit(5).collect.foreach(println)
[0000029,WrappedArray(2018-10-14 21:36:02<>市民中心, 2018-10-06 12:15:49<>后海, 2018-10-14 22:00:30<>黄贝岭, 2018-10-14 19:48:00<>黄贝岭, 2018-10-14 20:11:30<>市民中心)]
[0000052,WrappedArray(2018-10-22 20:27:37<>深圳北站, 2018-10-22 21:12:17<>湖贝)]
[0000088,WrappedArray(2018-10-12 18:55:24<>兴东, 2018-10-12 20:07:09<>翠竹)]
[0000102,WrappedArray(2018-10-07 19:18:53<>深圳北站, 2018-10-07 18:57:32<>西丽, 2018-10-07 13:35:33<>世界之窗, 2018-10-07 13:37:47<>世界之窗)]
[0000120,WrappedArray(2018-10-20 20:59:55<>福田口岸, 2018-10-01 16:49:54<>少年宫, 2018-10-01 21:33:55<>上沙, 2018-10-03 17:23:31<>深圳湾公园, 2018-10-17 19:17:00<>上沙, 2018-10-23 18:38:18<>上沙, 2018-10-01 16:05:17<>上沙, 2018-10-20 10:02:00<>上沙, 2018-10-23 18:57:56<>香梅北, 2018-10-04 06:41:55<>上沙, 2018-10-23 20:01:28<>上沙, 2018-10-03 17:01:36<>上沙, 2018-10-23 19:27:56<>香梅北, 2018-10-04 07:25:41<>深圳北站, 2018-10-01 21:20:01<>车公庙, 2018-10-17 18:39:49<>深圳北站, 2018-10-20 10:21:18<>福田口岸, 2018-10-20 21:20:09<>上沙)]

scala> :quit
[WBQ@westgisB068 ~]$ 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: Spark任务可以通过两种方式进行提交:spark-shell和spark-submit。 1. spark-shell:是Spark提供的交互式命令行工具,可以在命令行中直接输入Spark代码进行交互式计算。在spark-shell中提交任务,可以直接在命令行中输入Spark代码,Spark会自动将代码转换为任务并提交到集群中执行。 2. spark-submit:是Spark提供的命令行工具,可以将打包好的Spark应用程序提交到集群中执行。在使用spark-submit提交任务时,需要先将Spark应用程序打包成jar包,然后通过命令行指定jar包路径和其他参数,Spark会自动将jar包提交到集群中执行。 ### 回答2: Apache Spark是一个开源的大数据处理框架,可以用于分布式处理数据、机器学习等等领域。在使用Spark时,我们需要先编写Spark任务,然后将任务提交给Spark集群进行执行,这可以通过两种方式来实现:spark-shell和spark-submit。 1. spark-shellspark-shell是Spark自带的交互式命令行工具,可以让用户在命令行中直接执行Spark操作。如果您想对数据进行简单的操作或者试验Spark一些功能,那么使用spark-shell是最佳选择。在spark-shell中,用户可以直接输入Spark操作,例如读取文件、转换RDD等等,同时还可以在命令中设置各种参数来定制化Spark操作。 2. spark-submit:spark-submit是Spark任务的常用提交方式,它是一个命令行工具,需要用户编写一个Spark任务包含的代码文件和相应的依赖文件,然后将这些文件打包成jar包,使用spark-submit来将jar包提交给Spark集群执行。使用spark-submit有许多优点,例如可以将任务提交给集群让其在后台执行、可以设置任务的各种参数(例如内存设置、CPU核心数等)以优化任务执行效率。同时,在生产环境下,使用spark-submit也可以通过将任务提交到生产环境的Spark集群来实现自动化部署、管理和监控。 总而言之,使用spark-shell和spark-submit的选择取决于您希望达到的目的和需求。对于一些简单的数据处理任务或者试验Spark功能来说,使用spark-shell是比较方便的;而对于一些复杂的数据处理任务或者生产环境下的Spark任务来说,建议使用spark-submit来提交任务。 ### 回答3: Spark作为当前最流行的大数据处理框架之一,它可以通过多种方式来提交Spark任务。其中,最常用的两种方式是通过spark-shell和spark-submit提交Spark任务。 1. spark-shell Spark-shell是一个交互式命令行工具,用户可以在其中编写Spark代码,并且即时运行。在使用Spark-shell时,用户不需要将代码打包成JAR文件并将其提交给Spark集群进行执行。 直接在命令行运行spark-shell命令即可进入Spark-shell交互式命令行界面。在Spark-shell中,用户可以与Spark进行交互,包括创建RDD、进行数据转换和操作等。 2. spark-submit Spark-submit是一个命令行工具,它可以将用户编写的Spark代码打包成JAR文件,并且将该JAR文件提交给Spark集群进行执行。用户可以通过spark-submit命令来提交Spark任务。在提交任务时,用户需要指定JAR文件的路径、主类名以及其他执行参数。具体命令格式如下: ``` ./bin/spark-submit --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ <application-jar> \ [application-arguments] ``` 其中,参数解释如下: - --class:指定主类名。 - --master:指定Spark集群的URL。 - --deploy-mode:指定任务的部署模式,通常有两种,即client模式和cluster模式。 - --conf:指定Spark配置参数。 - <application-jar>:指定需要提交的JAR文件路径。 - [application-arguments]:指定程序运行时的一些参数。 总的来说,虽然spark-shell和spark-submit都可以用于提交Spark任务,但是它们具有不同的优缺点。spark-shell相对于spark-submit来说更加适合小规模的数据处理和代码调试,而对于大规模数据处理任务,则建议使用spark-submit来提交任务,以获得更高的执行效率。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值