更多Spark学习examples代码请见:https://github.com/xubo245/SparkLearning
1.说明:
使用不同的压缩level对avro数据进行压缩
2.代码:
/**
* @author xubo
* @time 20160502
* ref https://github.com/databricks/spark-avro
*/
package org.apache.spark.avro.learning
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import java.text.SimpleDateFormat
import java.util.Date
/**
* Avro Compression different level
*/
object AvroCompression {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("AvroCompression").setMaster("local")
val sc = new SparkContext(conf)
// import needed for the .avro method to be added
import com.databricks.spark.avro._
val sqlContext = new SQLContext(sc)
sqlContext.setConf("spark.sql.avro.compression.codec", "deflate")
sqlContext.setConf("spark.sql.avro.deflate.level", "5")
// The Avro records get converted to Spark types, filtered, and
// then written back out as Avro records
val df = sqlContext.read
.format("com.databricks.spark.avro")
.load("file/data/avro/input/episodes.avro")
df.show
val iString = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date())
df.filter("doctor > 5").write
.format("com.databricks.spark.avro")
.save("file/data/avro/output/episodes/AvroCompression" + iString)
df.filter("doctor > 5").show
}
}
3.结果:
+--------------------+----------------+------+
| title| air_date|doctor|
+--------------------+----------------+------+
| The Eleventh Hour| 3 April 2010| 11|
| The Doctor's Wife| 14 May 2011| 11|
| Horror of Fang Rock|3 September 1977| 4|
| An Unearthly Child|23 November 1963| 1|
|The Mysterious Pl...|6 September 1986| 6|
| Rose| 26 March 2005| 9|
|The Power of the ...| 5 November 1966| 2|
| Castrolava| 4 January 1982| 5|
+--------------------+----------------+------+
+--------------------+----------------+------+
| title| air_date|doctor|
+--------------------+----------------+------+
| The Eleventh Hour| 3 April 2010| 11|
| The Doctor's Wife| 14 May 2011| 11|
|The Mysterious Pl...|6 September 1986| 6|
| Rose| 26 March 2005| 9|
+--------------------+----------------+------+
4.存储文件:
Objavro.codecdeflateavro.schema�{"type":"record","name":"topLevelRecord","fields":[{"name":"title","type":["string","null"]},{"name":"air_date","type":["string","null"]},{"name":"doctor","type":["int","null"]}]}
379bytes
没修改默认压缩level的为391bytes