创建RDD
1,从字符串创建rdd
sc.parallelize(xxx)
如:val testrdd=sc.parallelize(Seq((1,Array("1.0"),3),(2,Array("2.0"),6),(3,Array("3.0"),7),(1,Array("3.0"),7)))
2,从文件创建rdd
读文本文件
val citylevel = sc.textFile(HDFS_PATH)
.map(_.split(","))
.map(attributes=>Row(attributes(0).trim,attributes(1).trim))
创建DataFrame
1,从字符串创建dataframe
var test_df = Seq((1,Array("1.0"),3),(2,Array("2.0"),6),(3,Array("3.0"),7),(1,Array("3.0"),7)).toDF("imei","feature","id")
2,从rdd创建dataframe
rdd.toDF(xxx)
如:import spark.implicits._
val testrdd=sc.parallelize(Seq((1,Array("1.0"),3),(2,Array("2.0"),6),(3,Array("3.0"),7),(1,Array("3.0"),7)))
val testDF=testrdd.toDF("id","score","iemi")
3,从文件创建dataframe
(1)读parquet格式文件 val parquetFileDF = spark.read.parquet(HDFS_PATH)
(2)文本文件 :先从文件创建rdd,再从rdd转成dataframe
import spark.implicits._
val citylevel = sc.textFile(HDFS_PATH)
.map(_.split(","))
.map(attributes=>Row(attributes(0).trim,attributes(1).trim))
val cityDF = citylevel.toDF("cityid","citylevel")