spark查询任意字段,并使用dataframe输出结果

最新推荐文章于 2024-07-31 15:54:40 发布

公众号【禅与大数据】，欢迎订阅

最新推荐文章于 2024-07-31 15:54:40 发布

阅读量1.2w

点赞数 1

本文链接：https://blog.csdn.net/cafebar123/article/details/78636605

版权

本文介绍了在Spark中处理CSV文件时如何灵活查询字段，特别是面对不确定的字段数量和名称的情况。通过使用dataframe和StructField、StructType，结合案例分析了如何构建查询策略，包括创建新的数组来记录字段位置，从而实现无需具体字段名的读写操作。这种方法在处理大量字段且需要动态查询时非常有用。

摘要由CSDN通过智能技术生成

在写spark程序中,查询csv文件中某个字段,一般是这样的写法:
**方法(1)，**直接使用dataframe 查询

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .schema(customSchema)
    .load("cars.csv")
val selectedData = df.select("year", "model")

参考索引:https://github.com/databricks/spark-csv

以上读csv文件是spark1.x的写法,spark2.x的写法又不太一样:
val df = sparkSession.read.format("com.databricks.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").load("people.csv").cache()

方法(2)，构建case class.

case class Person(name: String, age: Long)
// For implicit conversions from RDDs to DataFrames
import spark.implicits._

// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
  .textFile("examples/src/main/resources/people.txt")
  .map(_.split(","))
  .map(attributes => Person(attributes(0), attributes(1).trim.toInt))
  .toDF()
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql(