环境
- centos 6.5
- CHD 5.15
- spark 1
csv内容
$ cat test.txt
1|2|3|test
2|4|6|wwww
使用pyspark
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.types import *
if __name__ == "__main__":
sc = SparkContext(appName="CSV2Parquet")
sqlContext = SQLContext(sc)
schema = StructType([
StructField("id", StringType(), True),
StructField("num1", StringType(), True),
StructField("num2", StringType(), True),
StructField("string", StringType(), True),
])
rdd = sc.textFile("/var/tmp/test.txt").map(lambda line: line.split("|"))
df = sqlContext.createDataFrame(rdd, schema)
df.write.parquet('/var/tmp/test.parq')
CDH提供parquet-tools命令查看parquet文件
parquet-tools cat sample.parq
parquet-tools head -n 2 sample.parq
parquet-tools schema sample.parq
parquet-tools meta sample.parq
parquet-tools dump