【spark on kubernetes】spark operator访问hdfs,hive

前一篇部署了spark operator,这边介绍spark访问hdfs,hive数据

环境

version
hadoop3.1
hive3.1
spark3.0

一. 编写代码

Java代码,访问hdfs及访问hive

package com.seagate.client.zyspark;

import java.io.File;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
import java.util.ArrayList;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.DataFrameReader;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;

public class SparkDemo {

	public static void main(String[] args) throws Exception{
		
        sparkQuery2("select * from attribute_1 limit 10");

		sparkQueryHdfs();				
		
    }

	
	public static void sparkQuery2(String sql) throws Exception{
		
        System.out.println("=========================1");
        String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
        System.out.println("===========================2");
        
        long zhenyang2 =  System.currentTimeMillis();
        SparkSession sparkSession = SparkSession.builder()
					        		.appName("Java Spark Hive Example")
					        		.enableHiveSupport()
					        		.getOrCreate();
        System.out.println("SparkDemo cost:"+(System.currentTimeMillis()-zhenyang2));
        System.out.println("==============================3");
        System.out.println("==============================3.11:"+sql);
        long zhenyang3 =  System.currentTimeMillis();
        
        sparkSession.sql(sql).show();
        Dataset<Row> sqlDF = sparkSession.sql(sql);
        System.out.println("sparkSession cost:"+(System.currentTimeMillis()-zhenyang3));
        System.out.println("======================4");
        System.out.println("===========sqlDF count===========:"+sqlDF.count());
        
        long zhenyang5 =  System.currentTimeMillis();
        List<Row> jaList= sqlDF.javaRDD().collect();
        System.out.println("rdd collect cost:"+(System.currentTimeMillis()-zhenyang5));
        System.out.println("SparkDemo cost:"+(System.currentTimeMillis()-zhenyang3));
        System.out.println("jaList list:"+jaList.size());
        
        List<TaskListModel> list = new ArrayList<TaskListModel>();
        long zhenyang4 =  System.currentTimeMillis();
        jaList.stream().forEachOrdered(result -> {
        	System.out.println("serial_num is :"+result.getString(0));

        });
        System.out.println("SparkDemo foreach cost:"+(System.currentTimeMillis()-zhenyang4));

        sparkSession.close();
	}
	
	public static void sparkQueryHdfs() throws Exception{
		
		SparkConf sparkConf = new SparkConf().setAppName("HDFSQuery");	    
	    
	    SparkSession sparkSession = SparkSession.builder().sparkContext(new SparkContext(sparkConf)).getOrCreate();
	    
	    // /warehouse/alerts/dept=it/delta_0000001_0000001_0000/_orc_acid_version
	    // /tmp/idat/idat.S_20210708160847348/data.txt
	    Dataset<Row> rows = sparkSession.read().option("header", true).option("inferSchema", true).text("hdfs://10.38.149.125:8020/tmp/idat/idat.S_20210708160847348/data.txt").cache();
	    
	    List<Row> jaList= rows.javaRDD().collect();
	    
	    jaList.stream().forEachOrdered(result -> {
	    	
	    	System.out.println(result.getString(0));
	    	
	    	
	    });
		
	}

}

二. 编写Dockerfile

将程序打包,及hadoop的相关配置文件打包镜像,设置HADOOP_CONF_DIR环境变量,将配置文件放在环境变量指向的地址。打包镜像并上传到harbor

FROM 10.38.199.203:8090/library/spark:v3.0.0

MAINTAINER zhenyang

USER root

RUN  useradd --create-home --no-log-init --shell /bin/bash hive

RUN adduser hive sudo

RUN mkdir -p /opt/spark/work-dir/hadoop

ADD core-site.xml /opt/spark/work-dir/hadoop

ADD hdfs-site.xml /opt/spark/work-dir/hadoop

ADD hive-site.xml /opt/spark/work-dir/hadoop

ENV HADOOP_CONF_DIR /opt/spark/work-dir/hadoop

ENV PATH $PATH:$HADOOP_CONF_DIR

ADD zyspark-0.0.1-SNAPSHOT.jar /tmp

RUN chmod -R 777 /tmp

RUN chmod -R 777 /opt

USER hive

三. 编写spark部署程序

编写first-test.yaml

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: first-test
  namespace: default
spec:
  type: Scala
  mode: cluster
  image: "10.38.199.203:8090/library/sparkjar:v0.1"
  imagePullPolicy: Always
  mainClass: com.seagate.client.zyspark.SparkDemo
  mainApplicationFile: "local:///tmp/zyspark-0.0.1-SNAPSHOT.jar"
  sparkVersion: "3.0.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.0.0
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.0.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

四. 部署spark程序

部署程序,并查看log结果。

查看pod和sevice

kubectl get all|grep first

 查看SparkApplications 启动log

kubectl describe SparkApplications first-test

查看程序输出log,可以看到正确输出了attribute_1的hive表的数据,并且打印了hdfs文件tmp/idat/idat.S_20210708160847348/data.txt的内容

kubectl logs first-test-driver
++ id -u
+ myuid=1000
++ id -g
+ mygid=1000
+ set +e
++ getent passwd 1000
+ uidentry=hive:x:1000:1000::/home/hive:/bin/bash
+ set -e
+ '[' -z hive:x:1000:1000::/home/hive:/bin/bash ']'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ '[' -n '' ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/work-dir/hadoop::/opt/spark/jars/*'
+ case "$1" in
+ shift 1
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.42.4.185 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class com.seagate.client.zyspark.SparkDemo local:///tmp/zyspark-0.0.1-SNAPSHOT.jar
21/12/30 03:01:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
=========================1
===========================2
log4j:WARN No appenders could be found for logger (org.apache.hadoop.hive.conf.HiveConf).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/12/30 03:01:06 INFO SparkContext: Running Spark version 3.0.0
21/12/30 03:01:07 INFO ResourceUtils: ==============================================================
21/12/30 03:01:07 INFO ResourceUtils: Resources for spark.driver:

21/12/30 03:01:07 INFO ResourceUtils: ==============================================================
21/12/30 03:01:07 INFO SparkContext: Submitted application: Java Spark Hive Example
21/12/30 03:01:07 INFO SecurityManager: Changing view acls to: hive,root
21/12/30 03:01:07 INFO SecurityManager: Changing modify acls to: hive,root
21/12/30 03:01:07 INFO SecurityManager: Changing view acls groups to:
21/12/30 03:01:07 INFO SecurityManager: Changing modify acls groups to:
21/12/30 03:01:07 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hive, root); groups with view permissions: Set(); users  with modify permissions: Set(hive, root); groups with modify permissions: Set()
21/12/30 03:01:07 INFO Utils: Successfully started service 'sparkDriver' on port 7078.
21/12/30 03:01:07 INFO SparkEnv: Registering MapOutputTracker
21/12/30 03:01:07 INFO SparkEnv: Registering BlockManagerMaster
21/12/30 03:01:07 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/12/30 03:01:07 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/12/30 03:01:07 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
21/12/30 03:01:07 INFO DiskBlockManager: Created local directory at /var/data/spark-8576fef9-3d38-4374-8817-89cc5c6a4d2f/blockmgr-35e9cb7a-f55c-4cda-ac62-2b55592fc7a0
21/12/30 03:01:07 INFO MemoryStore: MemoryStore started with capacity 117.0 MiB
21/12/30 03:01:07 INFO SparkEnv: Registering OutputCommitCoordinator
21/12/30 03:01:08 INFO Utils: Successfully started service 'SparkUI' on port 4040.
21/12/30 03:01:08 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://first-test-f27b697e094863f6-driver-svc.default.svc:4040
21/12/30 03:01:08 INFO SparkContext: Added JAR local:///tmp/zyspark-0.0.1-SNAPSHOT.jar at file:/tmp/zyspark-0.0.1-SNAPSHOT.jar with timestamp 1640833268488
21/12/30 03:01:08 WARN SparkContext: The jar local:///tmp/zyspark-0.0.1-SNAPSHOT.jar has been added already. Overwriting of added jars is not supported in the current version.
21/12/30 03:01:08 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
21/12/30 03:01:10 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
21/12/30 03:01:10 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
21/12/30 03:01:10 INFO NettyBlockTransferService: Server created on first-test-f27b697e094863f6-driver-svc.default.svc:7079
21/12/30 03:01:10 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/12/30 03:01:10 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, first-test-f27b697e094863f6-driver-svc.default.svc, 7079, None)
21/12/30 03:01:11 INFO BlockManagerMasterEndpoint: Registering block manager first-test-f27b697e094863f6-driver-svc.default.svc:7079 with 117.0 MiB RAM, BlockManagerId(driver, first-test-f27b697e094863f6-driver-svc.default.svc, 7079, None)
21/12/30 03:01:11 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, first-test-f27b697e094863f6-driver-svc.default.svc, 7079, None)
21/12/30 03:01:11 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, first-test-f27b697e094863f6-driver-svc.default.svc, 7079, None)
21/12/30 03:01:15 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 512, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
21/12/30 03:01:16 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.42.4.186:52492) with ID 1
21/12/30 03:01:16 INFO KubernetesClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
SparkDemo cost:9993
==============================3
==============================3.11:select * from attribute_1 limit 10
21/12/30 03:01:16 INFO BlockManagerMasterEndpoint: Registering block manager 10.42.4.186:37394 with 117.0 MiB RAM, BlockManagerId(1, 10.42.4.186, 37394, None)
21/12/30 03:01:16 INFO SharedState: loading hive config file: file:/opt/spark/work-dir/hadoop/hive-site.xml
21/12/30 03:01:17 INFO SharedState: spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('/warehouse').
21/12/30 03:01:17 INFO SharedState: Warehouse path is '/warehouse'.
21/12/30 03:01:19 INFO HiveUtils: Initializing HiveMetastoreConnection version 2.3.7 using Spark classes.
21/12/30 03:01:19 INFO HiveConf: Found configuration file file:/opt/spark/work-dir/hadoop/hive-site.xml
21/12/30 03:01:20 WARN HiveConf: HiveConf of name hive.materializedview.rewriting.incremental does not exist
21/12/30 03:01:20 WARN HiveConf: HiveConf of name hive.server2.webui.cors.allowed.headers does not exist
21/12/30 03:01:20 WARN HiveConf: HiveConf of name hive.hook.proto.base-directory does not exist
21/12/30 03:01:20 WARN HiveConf: HiveConf of name hive.load.data.owner does not exist
21/12/30 03:01:20 WARN HiveConf: HiveConf of name hive.service.metrics.codahale.reporter.classes does not exist
21/12/30 03:01:20 WARN HiveConf: HiveConf of name hive.strict.managed.tables does not exist
21/12/30 03:01:20 WARN HiveConf: HiveConf of name hive.create.as.insert.only does not exist
21/12/30 03:01:20 WARN HiveConf: HiveConf of name hive.mapred.supports.subdirectories does not exist
21/12/30 03:01:20 WARN HiveConf: HiveConf of name hive.metastore.db.type does not exist
21/12/30 03:01:20 WARN HiveConf: HiveConf of name hive.tez.cartesian-product.enabled does not exist
21/12/30 03:01:20 WARN HiveConf: HiveConf of name hive.metastore.warehouse.external.dir does not exist
21/12/30 03:01:20 WARN HiveConf: HiveConf of name hive.heapsize does not exist
21/12/30 03:01:20 WARN HiveConf: HiveConf of name hive.server2.webui.enable.cors does not exist
21/12/30 03:01:20 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
21/12/30 03:01:21 INFO SessionState: Created local directory: /tmp/hive
21/12/30 03:01:21 INFO SessionState: Created HDFS directory: /tmp/hive/hive/5c2423c4-15f5-4090-aacb-b8903187fc1e
21/12/30 03:01:21 INFO SessionState: Created local directory: /tmp/hive/5c2423c4-15f5-4090-aacb-b8903187fc1e
21/12/30 03:01:21 INFO SessionState: Created HDFS directory: /tmp/hive/hive/5c2423c4-15f5-4090-aacb-b8903187fc1e/_tmp_space.db
21/12/30 03:01:21 INFO HiveClientImpl: Warehouse location for Hive client (version 2.3.7) is /warehouse
21/12/30 03:01:21 INFO metastore: Trying to connect to metastore with URI thrift://seadoop-test128.wux.chin.seagate.com:9083
21/12/30 03:01:21 INFO metastore: Opened a connection to metastore, current connections: 1
21/12/30 03:01:21 INFO metastore: Connected to metastore.
21/12/30 03:01:26 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 396.9 KiB, free 116.6 MiB)
21/12/30 03:01:26 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 34.3 KiB, free 116.5 MiB)
21/12/30 03:01:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on first-test-f27b697e094863f6-driver-svc.default.svc:7079 (size: 34.3 KiB, free: 116.9 MiB)
21/12/30 03:01:26 INFO SparkContext: Created broadcast 0 from
21/12/30 03:01:26 INFO FileInputFormat: Total input paths to process : 1
21/12/30 03:01:26 INFO SparkContext: Starting job: show at SparkDemo.java:182
21/12/30 03:01:26 INFO DAGScheduler: Got job 0 (show at SparkDemo.java:182) with 1 output partitions
21/12/30 03:01:26 INFO DAGScheduler: Final stage: ResultStage 0 (show at SparkDemo.java:182)
21/12/30 03:01:26 INFO DAGScheduler: Parents of final stage: List()
21/12/30 03:01:26 INFO DAGScheduler: Missing parents: List()
21/12/30 03:01:27 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[4] at show at SparkDemo.java:182), which has no missing parents
21/12/30 03:01:27 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 10.2 KiB, free 116.5 MiB)
21/12/30 03:01:27 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 5.1 KiB, free 116.5 MiB)
21/12/30 03:01:27 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on first-test-f27b697e094863f6-driver-svc.default.svc:7079 (size: 5.1 KiB, free: 116.9 MiB)
21/12/30 03:01:27 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1200
21/12/30 03:01:27 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[4] at show at SparkDemo.java:182) (first 15 tasks are for partitions Vector(0))
21/12/30 03:01:27 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
21/12/30 03:01:27 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.42.4.186, executor 1, partition 0, ANY, 7404 bytes)
21/12/30 03:01:27 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.42.4.186:37394 (size: 5.1 KiB, free: 117.0 MiB)
21/12/30 03:01:28 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.42.4.186:37394 (size: 34.3 KiB, free: 116.9 MiB)
21/12/30 03:01:30 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2885 ms on 10.42.4.186 (executor 1) (1/1)
21/12/30 03:01:30 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
21/12/30 03:01:30 INFO DAGScheduler: ResultStage 0 (show at SparkDemo.java:182) finished in 3.203 s
21/12/30 03:01:30 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
21/12/30 03:01:30 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
21/12/30 03:01:30 INFO DAGScheduler: Job 0 finished: show at SparkDemo.java:182, took 3.336185 s
21/12/30 03:01:30 INFO CodeGenerator: Code generated in 404.729466 ms
+----------+---------+-------------------+----------------+------------+------+---------+----------+
|serial_num|trans_seq|          test_date|            name|       value|family|operation|event_date|
+----------+---------+-------------------+----------------+------------+------+---------+----------+
|  WCJ1MHZ2|        8|04/04/2019 06:11:16|3TempMove_ENABLE|           1|   2AN|     PRE2|  20190405|
|  WCJ1MHZ2|        8|04/04/2019 06:11:16|             AAB|      501.42|   2AN|     PRE2|  20190405|
|  WCJ1MHZ2|        8|04/04/2019 06:11:16|     AMTS_PERIOD|201904040500|   2AN|     PRE2|  20190405|
|  WCJ1MHZ2|        8|04/04/2019 06:11:16|    BALANCE_RING|       117FN|   2AN|     PRE2|  20190405|
|  WCJ1MHZ2|        8|04/04/2019 06:11:16|  BAL_RING_ANGLE|         0.0|   2AN|     PRE2|  20190405|
|  WCJ1MHZ2|        8|04/04/2019 06:11:16|   BAL_RING_SIZE|          09|   2AN|     PRE2|  20190405|
|  WCJ1MHZ2|        8|04/04/2019 06:11:16| BEARING_LOT_NUM|           B|   2AN|     PRE2|  20190405|
|  WCJ1MHZ2|        8|04/04/2019 06:11:16|      BIRTH_DATE|    20190404|   2AN|     PRE2|  20190405|
|  WCJ1MHZ2|        8|04/04/2019 06:11:16|     BUILD_GROUP|         NEW|   2AN|     PRE2|  20190405|
|  WCJ1MHZ2|        8|04/04/2019 06:11:16|         CELL_ID|         F24|   2AN|     PRE2|  20190405|
+----------+---------+-------------------+----------------+------------+------+---------+----------+

sparkSession cost:14626
======================4
21/12/30 03:01:31 INFO CodeGenerator: Code generated in 63.434136 ms
21/12/30 03:01:31 INFO CodeGenerator: Code generated in 9.203573 ms
21/12/30 03:01:31 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 396.7 KiB, free 116.1 MiB)
21/12/30 03:01:31 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 34.2 KiB, free 116.1 MiB)
21/12/30 03:01:31 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on first-test-f27b697e094863f6-driver-svc.default.svc:7079 (size: 34.2 KiB, free: 116.9 MiB)
21/12/30 03:01:31 INFO SparkContext: Created broadcast 2 from
21/12/30 03:01:31 INFO SparkContext: Starting job: count at SparkDemo.java:186
21/12/30 03:01:31 INFO FileInputFormat: Total input paths to process : 1
21/12/30 03:01:31 INFO DAGScheduler: Registering RDD 10 (count at SparkDemo.java:186) as input to shuffle 0
21/12/30 03:01:31 INFO DAGScheduler: Got job 1 (count at SparkDemo.java:186) with 1 output partitions
21/12/30 03:01:31 INFO DAGScheduler: Final stage: ResultStage 2 (count at SparkDemo.java:186)
21/12/30 03:01:31 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
21/12/30 03:01:31 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
21/12/30 03:01:31 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[10] at count at SparkDemo.java:186), which has no missing parents
21/12/30 03:01:31 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 13.1 KiB, free 116.1 MiB)
21/12/30 03:01:31 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 6.6 KiB, free 116.1 MiB)
21/12/30 03:01:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on first-test-f27b697e094863f6-driver-svc.default.svc:7079 (size: 6.6 KiB, free: 116.9 MiB)
21/12/30 03:01:31 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1200
21/12/30 03:01:31 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[10] at count at SparkDemo.java:186) (first 15 tasks are for partitions Vector(0, 1))
21/12/30 03:01:31 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
21/12/30 03:01:31 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, 10.42.4.186, executor 1, partition 0, ANY, 7393 bytes)
21/12/30 03:01:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 10.42.4.186:37394 (size: 6.6 KiB, free: 116.9 MiB)
21/12/30 03:01:31 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.42.4.186:37394 (size: 34.2 KiB, free: 116.9 MiB)
21/12/30 03:01:32 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, 10.42.4.186, executor 1, partition 1, ANY, 7393 bytes)
21/12/30 03:01:32 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 308 ms on 10.42.4.186 (executor 1) (1/2)
21/12/30 03:01:32 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 41 ms on 10.42.4.186 (executor 1) (2/2)
21/12/30 03:01:32 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
21/12/30 03:01:32 INFO DAGScheduler: ShuffleMapStage 1 (count at SparkDemo.java:186) finished in 0.429 s
21/12/30 03:01:32 INFO DAGScheduler: looking for newly runnable stages
21/12/30 03:01:32 INFO DAGScheduler: running: Set()
21/12/30 03:01:32 INFO DAGScheduler: waiting: Set(ResultStage 2)
21/12/30 03:01:32 INFO DAGScheduler: failed: Set()
21/12/30 03:01:32 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[13] at count at SparkDemo.java:186), which has no missing parents
21/12/30 03:01:32 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 11.5 KiB, free 116.1 MiB)
21/12/30 03:01:32 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 5.3 KiB, free 116.1 MiB)
21/12/30 03:01:32 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on first-test-f27b697e094863f6-driver-svc.default.svc:7079 (size: 5.3 KiB, free: 116.9 MiB)
21/12/30 03:01:32 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1200
21/12/30 03:01:32 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[13] at count at SparkDemo.java:186) (first 15 tasks are for partitions Vector(0))
21/12/30 03:01:32 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
21/12/30 03:01:32 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 3, 10.42.4.186, executor 1, partition 0, NODE_LOCAL, 7344 bytes)
21/12/30 03:01:32 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 10.42.4.186:37394 (size: 5.3 KiB, free: 116.9 MiB)
21/12/30 03:01:32 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 10.42.4.186:52492
21/12/30 03:01:32 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 3) in 210 ms on 10.42.4.186 (executor 1) (1/1)
21/12/30 03:01:32 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
21/12/30 03:01:32 INFO DAGScheduler: ResultStage 2 (count at SparkDemo.java:186) finished in 0.221 s
21/12/30 03:01:32 INFO DAGScheduler: Job 1 is finished. Cancelling potential speculative or zombie tasks for this job
21/12/30 03:01:32 INFO TaskSchedulerImpl: Killing all running tasks in stage 2: Stage finished
21/12/30 03:01:32 INFO DAGScheduler: Job 1 finished: count at SparkDemo.java:186, took 0.724169 s
===========sqlDF count===========:10
21/12/30 03:01:32 INFO BlockManagerInfo: Removed broadcast_3_piece0 on first-test-f27b697e094863f6-driver-svc.default.svc:7079 in memory (size: 6.6 KiB, free: 116.9 MiB)
21/12/30 03:01:32 INFO CodeGenerator: Code generated in 18.799198 ms
21/12/30 03:01:32 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 10.42.4.186:37394 in memory (size: 6.6 KiB, free: 116.9 MiB)
21/12/30 03:01:32 INFO BlockManagerInfo: Removed broadcast_4_piece0 on first-test-f27b697e094863f6-driver-svc.default.svc:7079 in memory (size: 5.3 KiB, free: 116.9 MiB)
21/12/30 03:01:32 INFO BlockManagerInfo: Removed broadcast_4_piece0 on 10.42.4.186:37394 in memory (size: 5.3 KiB, free: 116.9 MiB)
21/12/30 03:01:32 INFO CodeGenerator: Code generated in 20.996038 ms
21/12/30 03:01:32 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 396.9 KiB, free 115.7 MiB)
21/12/30 03:01:32 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 34.3 KiB, free 115.7 MiB)
21/12/30 03:01:32 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on first-test-f27b697e094863f6-driver-svc.default.svc:7079 (size: 34.3 KiB, free: 116.9 MiB)
21/12/30 03:01:32 INFO SparkContext: Created broadcast 5 from
21/12/30 03:01:32 INFO SparkContext: Starting job: collect at SparkDemo.java:189
21/12/30 03:01:33 INFO FileInputFormat: Total input paths to process : 1
21/12/30 03:01:33 INFO DAGScheduler: Registering RDD 19 (javaRDD at SparkDemo.java:189) as input to shuffle 1
21/12/30 03:01:33 INFO DAGScheduler: Got job 2 (collect at SparkDemo.java:189) with 1 output partitions
21/12/30 03:01:33 INFO DAGScheduler: Final stage: ResultStage 4 (collect at SparkDemo.java:189)
21/12/30 03:01:33 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 3)
21/12/30 03:01:33 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 3)
21/12/30 03:01:33 INFO DAGScheduler: Submitting ShuffleMapStage 3 (MapPartitionsRDD[19] at javaRDD at SparkDemo.java:189), which has no missing parents
21/12/30 03:01:33 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 17.4 KiB, free 115.7 MiB)
21/12/30 03:01:33 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 7.8 KiB, free 115.7 MiB)
21/12/30 03:01:33 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on first-test-f27b697e094863f6-driver-svc.default.svc:7079 (size: 7.8 KiB, free: 116.8 MiB)
21/12/30 03:01:33 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1200
21/12/30 03:01:33 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 3 (MapPartitionsRDD[19] at javaRDD at SparkDemo.java:189) (first 15 tasks are for partitions Vector(0, 1))
21/12/30 03:01:33 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks
21/12/30 03:01:33 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 4, 10.42.4.186, executor 1, partition 0, ANY, 7393 bytes)
21/12/30 03:01:33 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 10.42.4.186:37394 (size: 7.8 KiB, free: 116.9 MiB)
21/12/30 03:01:33 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 10.42.4.186:37394 (size: 34.3 KiB, free: 116.8 MiB)
21/12/30 03:01:33 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 5, 10.42.4.186, executor 1, partition 1, ANY, 7393 bytes)
21/12/30 03:01:33 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 4) in 195 ms on 10.42.4.186 (executor 1) (1/2)
21/12/30 03:01:33 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 5) in 51 ms on 10.42.4.186 (executor 1) (2/2)
21/12/30 03:01:33 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
21/12/30 03:01:33 INFO DAGScheduler: ShuffleMapStage 3 (javaRDD at SparkDemo.java:189) finished in 0.257 s
21/12/30 03:01:33 INFO DAGScheduler: looking for newly runnable stages
21/12/30 03:01:33 INFO DAGScheduler: running: Set()
21/12/30 03:01:33 INFO DAGScheduler: waiting: Set(ResultStage 4)
21/12/30 03:01:33 INFO DAGScheduler: failed: Set()
21/12/30 03:01:33 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[24] at javaRDD at SparkDemo.java:189), which has no missing parents
21/12/30 03:01:33 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 22.2 KiB, free 115.6 MiB)
21/12/30 03:01:33 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 10.2 KiB, free 115.6 MiB)
21/12/30 03:01:33 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on first-test-f27b697e094863f6-driver-svc.default.svc:7079 (size: 10.2 KiB, free: 116.8 MiB)
21/12/30 03:01:33 INFO SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:1200
21/12/30 03:01:33 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[24] at javaRDD at SparkDemo.java:189) (first 15 tasks are for partitions Vector(0))
21/12/30 03:01:33 INFO TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
21/12/30 03:01:33 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 6, 10.42.4.186, executor 1, partition 0, NODE_LOCAL, 7344 bytes)
21/12/30 03:01:33 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on 10.42.4.186:37394 (size: 10.2 KiB, free: 116.8 MiB)
21/12/30 03:01:33 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 10.42.4.186:52492
21/12/30 03:01:34 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 6) in 1456 ms on 10.42.4.186 (executor 1) (1/1)
21/12/30 03:01:34 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
21/12/30 03:01:34 INFO DAGScheduler: ResultStage 4 (collect at SparkDemo.java:189) finished in 1.493 s
21/12/30 03:01:34 INFO DAGScheduler: Job 2 is finished. Cancelling potential speculative or zombie tasks for this job
21/12/30 03:01:34 INFO TaskSchedulerImpl: Killing all running tasks in stage 4: Stage finished
21/12/30 03:01:34 INFO DAGScheduler: Job 2 finished: collect at SparkDemo.java:189, took 1.824811 s
rdd collect cost:2429
SparkDemo cost:18332
jaList list:10
serial_num is :WCJ1MHZ2
serial_num is :WCJ1MHZ2
serial_num is :WCJ1MHZ2
serial_num is :WCJ1MHZ2
serial_num is :WCJ1MHZ2
serial_num is :WCJ1MHZ2
serial_num is :WCJ1MHZ2
serial_num is :WCJ1MHZ2
serial_num is :WCJ1MHZ2
serial_num is :WCJ1MHZ2
SparkDemo foreach cost:1
=========================5
21/12/30 03:01:34 INFO SparkUI: Stopped Spark web UI at http://first-test-f27b697e094863f6-driver-svc.default.svc:4040
21/12/30 03:01:34 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
21/12/30 03:01:34 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
21/12/30 03:01:34 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
21/12/30 03:01:34 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/12/30 03:01:34 INFO MemoryStore: MemoryStore cleared
21/12/30 03:01:34 INFO BlockManager: BlockManager stopped
21/12/30 03:01:34 INFO BlockManagerMaster: BlockManagerMaster stopped
21/12/30 03:01:34 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/12/30 03:01:34 INFO SparkContext: Successfully stopped SparkContext
21/12/30 03:01:34 INFO SparkContext: Running Spark version 3.0.0
21/12/30 03:01:34 INFO ResourceUtils: ==============================================================
21/12/30 03:01:34 INFO ResourceUtils: Resources for spark.driver:

21/12/30 03:01:34 INFO ResourceUtils: ==============================================================
21/12/30 03:01:34 INFO SparkContext: Submitted application: HDFSQuery
21/12/30 03:01:34 INFO SecurityManager: Changing view acls to: hive,root
21/12/30 03:01:34 INFO SecurityManager: Changing modify acls to: hive,root
21/12/30 03:01:34 INFO SecurityManager: Changing view acls groups to:
21/12/30 03:01:34 INFO SecurityManager: Changing modify acls groups to:
21/12/30 03:01:34 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hive, root); groups with view permissions: Set(); users  with modify permissions: Set(hive, root); groups with modify permissions: Set()
21/12/30 03:01:35 INFO Utils: Successfully started service 'sparkDriver' on port 7078.
21/12/30 03:01:35 INFO SparkEnv: Registering MapOutputTracker
21/12/30 03:01:35 INFO SparkEnv: Registering BlockManagerMaster
21/12/30 03:01:35 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/12/30 03:01:35 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/12/30 03:01:35 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
21/12/30 03:01:35 INFO DiskBlockManager: Created local directory at /var/data/spark-8576fef9-3d38-4374-8817-89cc5c6a4d2f/blockmgr-132da755-6e4c-485d-b5c0-ab372e6e075d
21/12/30 03:01:35 INFO MemoryStore: MemoryStore started with capacity 117.0 MiB
21/12/30 03:01:35 INFO SparkEnv: Registering OutputCommitCoordinator
21/12/30 03:01:35 INFO Utils: Successfully started service 'SparkUI' on port 4040.
21/12/30 03:01:35 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://first-test-f27b697e094863f6-driver-svc.default.svc:4040
21/12/30 03:01:35 INFO SparkContext: Added JAR local:///tmp/zyspark-0.0.1-SNAPSHOT.jar at file:/tmp/zyspark-0.0.1-SNAPSHOT.jar with timestamp 1640833295364
21/12/30 03:01:35 WARN SparkContext: The jar local:///tmp/zyspark-0.0.1-SNAPSHOT.jar has been added already. Overwriting of added jars is not supported in the current version.
21/12/30 03:01:35 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
21/12/30 03:01:35 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
21/12/30 03:01:35 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
21/12/30 03:01:35 INFO NettyBlockTransferService: Server created on first-test-f27b697e094863f6-driver-svc.default.svc:7079
21/12/30 03:01:35 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/12/30 03:01:35 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, first-test-f27b697e094863f6-driver-svc.default.svc, 7079, None)
21/12/30 03:01:35 INFO BlockManagerMasterEndpoint: Registering block manager first-test-f27b697e094863f6-driver-svc.default.svc:7079 with 117.0 MiB RAM, BlockManagerId(driver, first-test-f27b697e094863f6-driver-svc.default.svc, 7079, None)
21/12/30 03:01:35 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, first-test-f27b697e094863f6-driver-svc.default.svc, 7079, None)
21/12/30 03:01:35 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, first-test-f27b697e094863f6-driver-svc.default.svc, 7079, None)
21/12/30 03:01:36 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
21/12/30 03:01:37 INFO ExecutorPodsAllocator: Deleting 1 excess pod requests (1).
21/12/30 03:01:41 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.42.4.187:40532) with ID 2
21/12/30 03:01:41 INFO KubernetesClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
21/12/30 03:01:41 INFO SharedState: loading hive config file: file:/opt/spark/work-dir/hadoop/hive-site.xml
21/12/30 03:01:41 INFO SharedState: spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('/warehouse').
21/12/30 03:01:41 INFO SharedState: Warehouse path is '/warehouse'.
21/12/30 03:01:41 INFO BlockManagerMasterEndpoint: Registering block manager 10.42.4.187:36916 with 117.0 MiB RAM, BlockManagerId(2, 10.42.4.187, 36916, None)
21/12/30 03:01:41 INFO InMemoryFileIndex: It took 106 ms to list leaf files for 1 paths.
21/12/30 03:01:41 INFO FileSourceStrategy: Pruning directories with:
21/12/30 03:01:41 INFO FileSourceStrategy: Pushed Filters:
21/12/30 03:01:41 INFO FileSourceStrategy: Post-Scan Filters:
21/12/30 03:01:41 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
21/12/30 03:01:41 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 393.1 KiB, free 116.6 MiB)
21/12/30 03:01:41 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 34.0 KiB, free 116.5 MiB)
21/12/30 03:01:41 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on first-test-f27b697e094863f6-driver-svc.default.svc:7079 (size: 34.0 KiB, free: 116.9 MiB)
21/12/30 03:01:41 INFO SparkContext: Created broadcast 0 from javaRDD at SparkDemo.java:258
21/12/30 03:01:41 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
21/12/30 03:01:42 INFO SparkContext: Starting job: collect at SparkDemo.java:258
21/12/30 03:01:42 INFO DAGScheduler: Got job 0 (collect at SparkDemo.java:258) with 1 output partitions
21/12/30 03:01:42 INFO DAGScheduler: Final stage: ResultStage 0 (collect at SparkDemo.java:258)
21/12/30 03:01:42 INFO DAGScheduler: Parents of final stage: List()
21/12/30 03:01:42 INFO DAGScheduler: Missing parents: List()
21/12/30 03:01:42 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[7] at javaRDD at SparkDemo.java:258), which has no missing parents
21/12/30 03:01:42 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 15.7 KiB, free 116.5 MiB)
21/12/30 03:01:42 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 7.6 KiB, free 116.5 MiB)
21/12/30 03:01:42 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on first-test-f27b697e094863f6-driver-svc.default.svc:7079 (size: 7.6 KiB, free: 116.9 MiB)
21/12/30 03:01:42 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1200
21/12/30 03:01:42 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[7] at javaRDD at SparkDemo.java:258) (first 15 tasks are for partitions Vector(0))
21/12/30 03:01:42 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
21/12/30 03:01:42 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.42.4.187, executor 2, partition 0, ANY, 7784 bytes)
21/12/30 03:01:42 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.42.4.187:36916 (size: 7.6 KiB, free: 117.0 MiB)
21/12/30 03:01:43 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.42.4.187:36916 (size: 34.0 KiB, free: 116.9 MiB)
21/12/30 03:01:44 INFO BlockManagerInfo: Added rdd_2_0 in memory on 10.42.4.187:36916 (size: 1712.0 B, free: 116.9 MiB)
21/12/30 03:01:46 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 3787 ms on 10.42.4.187 (executor 2) (1/1)
21/12/30 03:01:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
21/12/30 03:01:46 INFO DAGScheduler: ResultStage 0 (collect at SparkDemo.java:258) finished in 3.853 s
21/12/30 03:01:46 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
21/12/30 03:01:46 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
21/12/30 03:01:46 INFO DAGScheduler: Job 0 finished: collect at SparkDemo.java:258, took 3.864630 s
G1,WRQ01F2Z,9,2TJ,SCOPY,2021-06-20 19:03:33,20210620
G1,WRQ01F2Z,13,2TJ,PRE2,2021-06-22 02:22:38,20210622
G1,WRQ01F2Z,15,2TJ,CAL2,2021-06-24 08:07:09,20210624
G1,WRQ01F2Z,17,2TJ,FNC2,2021-06-29 16:43:22,20210629
G1,WRQ01F2Z,19,2TJ,SPSC2,2021-06-30 03:44:12,20210630
G1,WRQ01F2Z,21,2TJ,CRT2,2021-07-01 03:11:58,20210701
G1,WRQ01F2Z,23,2TJ,FIN2,2021-07-02 15:23:09,20210702
G1,WRQ01F31,9,2TJ,SCOPY,2021-06-20 19:07:12,20210620
G1,WRQ01F31,13,2TJ,PRE2,2021-06-22 06:35:12,20210622
G1,WRQ01F31,15,2TJ,CAL2,2021-06-23 09:35:08,20210623
G1,WRQ01F43,9,2TJ,SCOPY,2021-06-20 22:23:48,20210620
G1,WRQ01F43,13,2TJ,PRE2,2021-06-22 07:53:39,20210622
G1,WRQ01F43,15,2TJ,CAL2,2021-06-23 10:37:49,20210623
G1,WRQ01F9B,9,2TJ,SCOPY,2021-06-20 20:18:36,20210620
G1,WRQ01F9B,13,2TJ,PRE2,2021-06-22 04:25:05,20210622
G1,WRQ01F9B,15,2TJ,CAL2,2021-06-24 08:48:47,20210624
G1,WRQ01F9B,18,2TJ,FNC2,2021-06-30 19:29:23,20210630
G1,WRQ01F9B,20,2TJ,SPSC2,2021-07-01 06:54:52,20210701
G1,WRQ01F9B,22,2TJ,CRT2,2021-07-04 09:02:30,20210704
G1,WRQ01F9B,24,2TJ,FIN2,2021-07-05 21:36:56,20210705
G1,WRQ01FA1,9,2TJ,SCOPY,2021-06-20 20:29:24,20210620
G1,WRQ01FA1,13,2TJ,PRE2,2021-06-22 08:00:46,20210622
G1,WRQ01FA1,15,2TJ,CAL2,2021-06-23 11:57:59,20210623

五. 问题

在访问hive表的时候,出现问题,总是访问不到attribute_1的数据,这是hive内部表,尝试修改了程序,并重建这张表,仍然访问不到数据。于是先用show tables在程序里打印出是否有这张表,显示是有的,但是就是访问不到表内的数据,很奇怪。

去百度查问题,原来hive 3.0之后默认开启ACID功能,而且新建的内表默认是ACID表(hive事务表)。而spark目前还不支持hive的ACID功能,因此无法读取ACID表的数据。

解决方案是修改acid的配置,临时方案

hive.strict.managed.tables=false 
hive.create.as.insert.only=false 
metastore.create.as.acid=false

在修改了这几个属性之后,重启hive,重新建表,再部署,成功访问到hive表数据。此问题先mark一下!

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Spark on Kubernetes有三种不同的方式可以使用:spark-submit、Spark on Kubernetes OperatorSpark Operator for Kubernetes。下面是对这三种方式的对比: 1. spark-submit:这是最普遍的使用Spark on Kubernetes的方式。它通过命令行工具spark-submit来提交Spark应用程序到Kubernetes集群上运行。使用spark-submit,用户可以指定Spark应用程序的依赖、资源需求和应用程序脚本等信息。这种方式相对简单,适合快速测试和开发。 2. Spark on Kubernetes Operator:这是Kubernetes项目中一种常见的资源抽象方式。它基于Kubernetes的Custom Resource Definitions(CRD)来定义SparkApplication资源类型,使得Spark应用程序可以像常规的Kubernetes Pods一样被管理。Spark on Kubernetes Operator提供了更多的灵活性和可扩展性,可以通过定义自定义资源来描述和管理复杂的Spark应用程序。 3. Spark Operator for Kubernetes:这是由Google开发的一种专门为Kubernetes设计的Spark操作符。与Spark on Kubernetes Operator不同,Spark Operator for Kubernetes提供了更高级别的抽象,可以通过定义自定义资源和控制器来描述和管理Spark应用程序。此外,Spark Operator for Kubernetes还提供了其他功能,如动态资源分配、高可用性和故障转移等。 总之,这三种方式都可以在Kubernetes上运行Spark应用程序,但它们在抽象程度和功能上有所不同。spark-submit方式简单易用,而Spark on Kubernetes OperatorSpark Operator for Kubernetes提供了更多的灵活性和高级功能。选择哪种方式取决于具体的使用场景和需求。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值