How do I convert csv file to rdd
假设csv文件是这种格式:
user, topic, hits
om, scala, 120
daniel, spark, 80
3754978, spark, 1
我们可以使用第一行来定义一个header class:
class SimpleCSVHeader(header:Array[String])
extends Serializable{
val index = header.zipWithIndex.toMap
def apply(array:Array[String],
key:String)
:String =array(index(key))
}
然后我们可以利用这个header class 来得到数据:
val csv = sc.textFile("file.csv") // original file
val data = csv.map(line => line.split(",").map(elem=>elem.trim)) //lines ion rows
val header = new SimpleCSVHeader(data.take(1)(0))
// 取出第一行来创建header
val rows = data.filter(line => header(line,"user") != "user") // 去掉header
val users = rows.map(row => header(row,"user"))
val usersByHits = rows.map(row => header(row,"user") -> header(row,"hits").toInt)