Spark version 2.2.3 基础概念方法讲解
1. 代码+案例详解:使用Spark处理大数据最全指南(上)
https://www.jianshu.com/p/826c16298ca6
2. 代码+案例详解:使用Spark处理大数据最全指南(下)
https://zhuanlan.zhihu.com/p/95022557
Spark 部署启动参看
https://github.com/heibaiying/BigData-Notes
Spark之本地模式与集群模式
https://blog.csdn.net/learn_tech/article/details/83654290
spark-shell --master spark://server01:7077 --total-executor-cores 3 --executor-memory 1g
--master spark://server01:7077:指定master进程的机器
--total-executor-cores 3:指定executor的核数(worker数量)
--executor-memory 1g:指定executor执行的内存大小
spark-shell 常用命令
原文参看 https://www.jianshu.com/p/826c16298ca6 惊喜多多
1.Ctrl+l 查看当前ubuntu 文件管理器(我的电脑) 当前打开文件夹路径 此为路径找对方式
2.加载ubuntu本地文件及操作
userRDD=sc.textFile("file:///home/fgq/Downloads/u.user");
val movieRDD=sc.textFile("file:///home/fgq/Downloads/u.item");
val ratingRDD=sc.textFile("file:///home/fgq/Downloads/u.data");
userRDD.first();
userRDD.count();
userRDD.take(1);
\t分隔符分隔 且 返回指定索引位置的内容
#Create a RDD from RatingRDD that only contains the two columns of interest i.e. movie_id,rating.
val rdd_movid_rating=ratingRDD.map(x=>(x.split("\t")(1),x.split("\t")(2)));
| 分隔符分隔 且 返回指定索引位置的内容
# Create a RDD from MovieRDD that only contains the two columns of interest i.e. movie_id,title.
val rdd_movid_title=movieRDD.map(x=>(x.split('|')(0),x.split('|')(1)));
leftOuterJoin使用
# merge these two pair RDDs based on movie_id. For this we will use the transformation leftOuterJoin(). See the transformation document.
val rdd_movid_title_rating=rdd_movid_rating.leftOuterJoin(rdd_movid_title);
Array((736,(4,Some(Shadowlands (1993))))) 获取Some位置值 t._2._2
# use the RDD in previous step to create (movie,1) tuple pair RDD
val rdd_title_rating=rdd_movid_title_rating.map(t=>(t._2._2,1));
# Use the reduceByKey transformation to reduce on the basis of movie_title
val rdd_title_ratingcnt=rdd_title_rating.reduceByKey((x,y)=>x+y);
# Get the final answer by using takeOrdered Transformation
val finalResultRDD=rdd_title_ratingcnt.map(x=>(x._2,x._1));
finalResultRDD.top(25);
takeOrdered与top是相反的
top是将RDD中的每个元素进行降序排序后取topN。
而takeOrdered是将RDD中的每个元素进行升序排序后取topN。
在Spark中,可以对RDD执行两种不同类型的操作:转换和操作
1. 转换:从现有的RDD中创建新的数据集
Map函数: Filter函数:
distinct函数: flatmap函数:
Reduce By Key函数: Group By Key函数:
2. 操作:从Spark中获取结果的机制
collect ,reduce ,take ,takeOrdered
Spark DataFrame的创建&操作 https://www.jianshu.com/p/009126dec52f
val ratings=spark.read.format("csv").load("file:///home/fgq/Downloads/u.data");
或
val ratings= spark.read.option("header","false").option("inferSchema","false").csv("file:///home/fgq/Downloads/u.data");
默认第一行作为表头
val ratings= spark.read.option("header","true").option("inferSchema","true").csv("file:///home/fgq/Downloads/u.data");
更多 DataFrame的创建&操作
1.val dfUsers = spark.read.format("csv").option("header", "true").load("file:///root/data/user.csv")
2.scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> import spark.implicits._
import spark.implicits._
// 读取文件并转换成RDD[Row]类型
scala> val uRdd = spark.sparkContext.textFile("file:///root/data/user.csv")
.map(x = x.split(","))
.mapPartitionsWithIndex((index, iter) => if (index == 0) iter.drop(1) else iter)
.map(Row.fromSeq(_))
uRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[26] at map at <console>:30
// 定义Schema
scala> val schema = StructType(Array(StructField("user_id", StringType, true),
StructField("locale", StringType, true),StructField("birthyear", StringType, true),
StructField("gender",StringType, true), StructField("joinedAt", StringType, true),
StructField("location", StringType, true), StructField("timezone", StringType, true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(user_id,StringType,true), StructField(locale,StringType,true), StructField(birthyear,StringType,true), StructField(gender,StringType,true), StructField(joinedAt,StringType,true), StructField(location,StringType,true), StructField(timezone,StringType,true))
// 创建DataFrame
scala> val dfUsers = spark.createDataFrame(uRdd, schema)
dfUsers: org.apache.spark.sql.DataFrame = [user_id: string, locale: string ... 5 more fields]
scala> dfUsers.printSchema
// root
// |-- user_id: string (nullable = true)
// |-- locale: string (nullable = true)
// |-- birthyear: string (nullable = true)
// |-- gender: string (nullable = true)
// |-- joinedAt: string (nullable = true)
// |-- location: string (nullable = true)
// |-- timezone: string (nullable = true)
scala> dfUsers show 3
注:由于该文件首行是列名,所以使用mapPartitionsWithIndex()函数过滤掉
3.scala> val dfUsers = spark.sparkContext.textFile("file:///root/data/users.csv")
.map(_.split(","))
.mapPartitionsWithIndex((index, iter) => if (index == 0) iter.drop(1) else iter)
.map(x => (x(0), x(1), x(2), x(3), x(4), x(5), x(6)))
.toDF("user_id", "locale", "birthyear", "gender", "joinedAt", "location", "timezone")
dfUsers: org.apache.spark.sql.DataFrame = [user_id: string, locale: string ... 5 more fields]
scala> dfUsers show 3
https://www.jianshu.com/p/009126dec52f
CM管理的Spark,hive 实际操作文件命令
1.问题 df.createOrReplaceTempView("dfTable")
20/08/17 16:12:07 WARN hive.metastore: Failed to connect to the MetaStore Server...
20/08/17 16:12:08 WARN hive.metastore: Failed to connect to the MetaStore Server...
20/08/17 16:12:09 WARN hive.metastore: Failed to connect to the MetaStore Server...
20/08/17 16:12:10 WARN metadata.Hive: Failed to register all functions.
java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
或
FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
解决问题 根本原因是hive元数据库未初始化 参考BigData-Notes/notes/installation/Linux环境下Hive的安装部署.md 中的步骤手动配置hive-site.xml文件mysql内容 然后得出以下解决方法:
1.使用 schematool 命令行工具,导航到它所在的目录:
如果您使用parcel安装CDH,schematool 位于:/opt/cloudera/parcels/CDH/lib/hive/bin/schematool
如果您使用软件包安装CDH, schematool 通常位于:/usr/lib/hive/bin/schematool
2. 在hive重新进行初始化时,出现以下错误。(配置文件中配置好mysql相关连接信息)
Schema initialization FAILED! Metastore state would be inconsistent !!
架构初始化失败! Metastore状态会不一致!!
在出现以上错误时,解决方法如下
删除metastore_db文件(路径在根路径下或者在cm的hive配置中找到路径)
删除mysql数据库中对应的database hive_new
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://node1:3306/hive_new?createDatabaseIfNotExist=true</value>
</property>
在配置文件中如果配置的数据库名字是:hive_new,则将该数据库删除即可。
3.进行以上操作后,进行重新初始化
schematool -dbType mysql -initSchema
4.最后重启CM的hive服务
5.然后登陆mysql进行数据库的hive_new的查看 use hive_new 知道数据库切换成功(可能需要点时间)
然后进入hive命令窗口进行执行show database 知道查询出内容
即可解决问题(可能需要打开hue)
2.如果部署环境仅仅是(CM)Yarn管理的话 那么spark操作文件路径只能是hdfs上的内容,否则找不到资源
val df=spark.read.format("json").load("hdfs:///nameservice1/user/spark/2015-summary.json")
#dataframe查询
df.select("DEST_COUNTRY_NAME").show(2);
df.createOrReplaceTempView("dfTable")
#spark sql 查询
spark.sql("SELECT DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME FROM dfTable LIMIT 2").show()
3. cloudera manager管理集群下的hive配置文件所在路径
/etc/hive/conf