工作中常用&常见问题_spark todf requirement failed: the number of colum-CSDN博客

本文链接：https://blog.csdn.net/qiyong7578/article/details/108525751

工作中常用&常见问题

Scala

1、判断数据类型，类型转换
10.isInstanceOf[Int]
10.asInstanceOf[Double]
2、可变的
ArrayBuffer Map/HashMap List
3、scala常用的算子
find、zip、groupby、sortby
4、异常处理就是用模式匹配来完成的
5、scalike jdbc

Spark03

1、如果是存数据库的操作，记得用foreachPartition
2、coalesce是解决小文件的解决发案
filter完和coalesce结合使用。不然可能filter过后，每个分区的数据量贼小，每个task都用一个core，占用资源啊
3、repartition：生产上如果嫌弃数据分区太少，就用它。解决数据倾斜用
4、case class用在排序中比较多
case class不用new的，直接用
case class完成了序列化的功能

Spark04

1、工作中写出去，用saveAsTextFile多
sc.textfile 读本地、读HDFS文件，是否压缩都可以
代码打包到服务器上，不用指定HDFS的参数
读写SequenceFile、Object这类对象时，才采用API，但是工作中用的少
写出去，JDBCRDD，工作中用的多，foreachPartition，每个partition拿到一个连接、batchsize
写hbase

？？？生产上是foreachPartition+JDBC写出去吗

Spark05

1、Support for running on YARN (Hadoop NextGen) was added to Spark in version 0.6.0, and improved in subsequent releases.
spark on

Spark06

1、工作中杜绝在executor端，new对象，每次搞一个，太不好了
2、工作中，用getInstance，不是最优的
最优的方式，是拿mapPartition处理
3、工作中，spark on yarn，–deploy-mode用哪种
spark on yarn，一定要配置HADOOP_CONF_DIR
这样spark在提交作业时，才能加载信息到yarn上

Spark07

1、怎么验证reduceByKey在map端有聚合

Spark08

1、为什么读accss_log.del，会有5个task，只有131M。
因为！读mac的数据，就是每个块按32M算
读hdfs本地的，就是每个块128M
2、yarn模式下，两个executor，同一时间最多只能跑2个task
3、persist和cache的区别
cache调用的是默认的persist，
sparkcore默认的存储策略是 memory-only
大数据是用memory_only_ser，选择一个好的序列化框架

/** Persist this RDD with the default storage level (MEMORY_ONLY).
*/
def cache(): this.type = persist()
/**Persist this RDD with the default storage level (MEMORY_ONLY).
*/
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

//这篇文章很重要！
spark默认的序列化是java serialization，还有kryo serialization可选择
但是kryo不是所有类都支持，使用要先注册。
1、数据本地化等待时间，一般设置等40秒

Spark09

常见错误：RDD里嵌套了RDD（RDD不支持RDD嵌套RDD）

Caused by: org.apache.spark.SparkException: This RDD lacks a
SparkContext. It could happen in the following cases: (1) RDD
transformations and actions are NOT invoked by the driver, but inside
of other transformations; for example, rdd1.map(x =>
rdd2.values.count() * x) is invalid because the values transformation
and count action cannot be performed inside of the rdd1.map
transformation. For more information, see SPARK-5063. (2) When a Spark
Streaming job recovers from checkpoint, this exception will be hit if
a reference to an RDD not defined by the streaming job is used in
DStream operations.

SparkSQL01

1、sparksql和sparkcore cache策略的区别
绿在：InMemoryTableScan
sparksql cache 是eagle的，sparkcore是lazy的
sparksql uncache是eagle的，sparkcore是eagle的

语法 cache table tablename
uncache table tablename

spark.table("tablename").cache  是lazy的
val df = spark.sql("select * from emp")
df.cache()	// lazy
df.show()	//	这才在storage里有效

SparkSQL02

1、RDD转DF时，发生类型不匹配
Caused by: java.lang.RuntimeException: java.lang.Integer is not a valid external type for schema of bigint

toInt 和LongType不搭，得改成toLong

val rdd: RDD[Row] = spark.sparkContext.textFile("xuanfeng-spark-sql/data/people.txt")
      .map(_.split(",")).map(x => Row(x(0), x(1).trim.toInt))

    //  Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
    //  从structType的源码里面翻
    val schema =
      StructType(
        StructField("name", StringType, true) ::
          StructField("age", LongType, false) :: Nil)

2、存在疑问:RDD和DF互转，除了那两种方式，在LogApp里，那种方式算第三种方式？
看了底层代码，toDF就是用了case 的方式

def toDF(colNames: String*): DataFrame = {
    require(schema.size == colNames.size,
      "The number of columns doesn't match.\n" +
        s"Old column names (${schema.size}): " + schema.fields.map(_.name).mkString(", ") + "\n" +
        s"New column names (${colNames.size}): " + colNames.mkString(", "))

    val newCols = logicalPlan.output.zip(colNames).map { case (oldAttribute, newName) =>
      Column(oldAttribute).as(newName)
    }
    select(newCols : _*)
  }

SparkSQL03

1、spark-shell加驱动启动，只加–jars有时不一定管用
jars说是会自动传到driver和executor端，但不一定的
driver-class-path 额外的class path传递给driver，注意，–jars里的东西是自动地被加入到classpath中

解决方案

val df6 = spark.read.format("jdbc")
      .option("url", "jdbc:mysql://xuanfeng001:3306")
      .option("dbtable", "bigdata.mr_etl_log")
      .option("driver","com.mysql.jdbc.Driver")	//直接写死在连接里
      .option("user", "root")
      .option("password", "zihaoahaha")
      .load()
      .select('job_id)

2、怎么看spark.option的key都有啥，一般搜XXXOptions.scala看看

3、TODO… format标准实现的视频再看一下，debug

4、自定义外部数据源

SparkSQL04

1、calatlog测试
本地：返回本地文件系统的warehouse
服务器：如果$SPARK_HOME/conf中有hive-site.xml，就能侦查到hive在hdfs上的元数据地址
如果没有hive-site，那就在服务器本地，创建了一个spark的warehouse

SparkSQL05

1、df2.select(expr("xuanfeng_concat_ws('|',province,city,district)").as("cw2")).show(false)

SS01

1、exactly-once
2、Combine Streaming with batch and interactive queries 代码复用
比如离线和实时的etl处理

SS02

1、textFileStream
2、Combine Streaming with batch and interactive queries 代码复用
比如离线和实时的etl处理
3、生产上很少将流的东西写到hdfs上
小文件
流的数据时效性要求高，落到文件系统上，读写还得序列化，还有副本
实在要写，saveAsTextFile
3、测试用print算子
只要是对DStream做操作，对于写到各种不同数据库，foreachRDD，然后再调用RDD的foreachPartition
DStream是要对RDD做操作的，就要用transform

SS05

exactly-once如何实现
若老公司：结果保存，维持：幂等性：对于RDBMS
upsert（相当于merge）
结果、offset保存：维持原子性：
事务控制

常用命令

1、spark加mysql驱动启动
spark-shell --jars ～/lib/mysql-connector-java-5.1.28.jar
2、如何在spark-shell中调试运行scala文件
https://blog.csdn.net/zg_hover/article/details/106680040
3、zeppelin-daemon.sh
start、stop、restart
4、启动一个socket
nc -lk 9527
5、ps -ef|grep hbase 得到进程pid
netstat -nlp|grep pid 根据进程看端口
6、防火墙命令
iptables -F 清除防火墙设置（生产上不能这样直接用啊）
iptables -L 查看防火墙设置
7、查看yarn日志
终端执行：
yarn logs -applicaionId [application_id] > log.txt
yarn logs -applicationId application_1602915670097_0005 > logs.txt