在写spark程序中,查询csv文件中某个字段,一般是这样的写法:
**方法(1),**直接使用dataframe 查询
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.schema(customSchema)
.load("cars.csv")
val selectedData = df.select("year", "model")
参考索引:https://github.com/databricks/spark-csv
以上读csv文件是spark1.x的写法,spark2.x的写法又不太一样:
val df = sparkSession.read.format("com.databricks.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").load("people.csv")
.cache()
方法(2),构建case class.
case class Person(name: String, age: Long)
// For implicit conversions from RDDs to DataFrames
import spark.implicits._
// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")
// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql(