5.加载,然后保存数据
工程师希望发现更多的输出格式来适合下游的消费者。数据科学家可能关注数据已经存在的格式。
5.1动机
Spark提供了一个各种各样输入输出数据源的操作。
文件格式与文件系统
文件格式:text、JSON、SequenceFIles、protocol buffers
文件系统:NFS、HDFS、S3
通过SparkSQL结构化数据源
数据库与键值存储
5.2文件格式
非结构化的数据像是text、版结构化像是JSON、结构化像是SequenceFiles。
5.3文本文件
在Spark中加载和保存文本文件是非常简单的。当我们将文本文件作为RDD加载,每一行作为RDD的一个元素。
我们还可以将多个整个文本文件同时加载到一对RDD中,其中键是名称,值是每个文件的内容。
加载文本文件
Example 5-1. Loading a text file in Python
input = sc.textFile("file:///home/holden/repos/spark/README.md")
Example 5-2. Loading a text file in Scala
val input = sc.textFile("file:///home/holden/repos/spark/README.md")
Example 5-3. Loading a text file in Java
JavaRDD<String> input = sc.textFile("file:///home/holden/repos/spark/README.md")
当每个文件表示某个时间段的数据时,wholeTextFiles()可能非常有用。 如果我们有来自不同时期的销售数据的文件,我们可以轻松计算每个时期的平均值
Example 5-4. Average value per file in Scala
val input = sc.wholeTextFiles("file://home/holden/salesFiles")
val result = input.mapValues{y =>
val nums = y.split(" ").map(x => x.toDouble)
nums.sum / nums.size.toDouble
}
保存文本文件
Example 5-5. Saving as a text file in Python
result.saveAsTextFile(outputFile)
5.4JSON
Example 5-6. Loading unstructured JSON in Python
import json
data = input.map(lambda x: json.loads(x))
Example 5-7. Loading JSON in Scala
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.databind.DeserializationFeature
...
case class Person(name: String, lovesPandas: Boolean) // Must be a top-level class
...
// Parse it into a specific case class. We use flatMap to handle errors
// by returning an empty list (None) if we encounter an issue and a
// list with one element if everything is ok (Some(_)).
val result = input.flatMap(record => {
try {
Some(mapper.readValue(record, classOf[Person]))
} catch {
case e: Exception => None
}})
Example 5-8. Loading JSON in Java
class ParseJson implements FlatMapFunction<Iterator<String>, Person> {
public Iterable<Person> call(Iterator<String> lines) throws Exception {
ArrayList<Person> people = new ArrayList<Person>();
ObjectMapper mapper = new ObjectMapper();
while (lines.hasNext()) {
String line = lines.next();
try {
people.add(mapper.readValue(line, Person.class));
} catch (Exception e) {
// skip records on failure
}
}
return people;
}
}
JavaRDD<String> input = sc.textFile("file.json");
JavaRDD<Person> result = input.mapPartitions(new ParseJson());
保存JSON
Example 5-9. Saving JSON in Python
(data.filter(lambda x: x['lovesPandas']).map(lambda x: json.dumps(x))
.saveAsTextFile(outputFile))
Example 5-10. Saving JSON in Scala
result.filter(p => P.lovesPandas).map(mapper.writeValueAsString(_))
.saveAsTextFile(outputFile)
Example 5-11. Saving JSON in Java
class WriteJson implements FlatMapFunction<Iterator<Person>, String> {
public Iterable<String> call(Iterator<Person> people) throws Exception {
ArrayList<String> text = new ArrayList<String>();
ObjectMapper mapper = new ObjectMapper();
while (people.hasNext()) {
Person person = people.next();
text.add(mapper.writeValueAsString(person));
}
return text;
}
}
JavaRDD<Person> result = input.mapPartitions(new ParseJson()).filter(
new LikesPandas());
JavaRDD<String> formatted = result.mapPartitions(new WriteJson());
formatted.saveAsTextFile(outfile);
5.5CSV和\t分割
加载CSV
Example 5-12. Loading CSV with textFile() in Python
import csv
import StringIO
...
def loadRecord(line):
"""Parse a CSV line"""
input = StringIO.StringIO(line)
reader = csv.DictReader(input, fieldnames=["name", "favouriteAnimal"])
return reader.next()
input = sc.textFile(inputFile).map(loadRecord)
Example 5-13. Loading CSV with textFile() in Scala
import Java.io.StringReader
import au.com.bytecode.opencsv.CSVReader
...
val input = sc.textFile(inputFile)
val result = input.map{ line =>
val reader = new CSVReader(new StringReader(line));
reader.readNext();
}
Example 5-14. Loading CSV with textFile() in Java
import au.com.bytecode.opencsv.CSVReader;
import Java.io.StringReader;
...
public static class ParseLine implements Function<String, String[]> {
public String[] call(String line) throws Exception {
CSVReader reader = new CSVReader(new StringReader(line));
return reader.readNext();
}
}
JavaRDD<String> csvFile1 = sc.textFile(inputFile);
JavaPairRDD<String[]> csvData = csvFile1.map(new ParseLine());
Example 5-15. Loading CSV in full in Python
def loadRecords(fileNameContents):
"""Load all the records in a given file"""
input = StringIO.StringIO(fileNameContents[1])
reader = csv.DictReader(input, fieldnames=["name", "favoriteAnimal"])
return reader
fullFileData = sc.wholeTextFiles(inputFile).flatMap(loadRecords)
Example 5-16. Loading CSV in full in Scala
case class Person(name: String, favoriteAnimal: String)
val input = sc.wholeTextFiles(inputFile)
val result = input.flatMap{ case (_, txt) =>
val reader = new CSVReader(new StringReader(txt));
reader.readAll().map(x => Person(x(0), x(1)))
}
Example 5-17. Loading CSV in full in Java
public static class ParseLine
implements FlatMapFunction<Tuple2<String, String>, String[]> {
public Iterable<String[]> call(Tuple2<String, String> file) throws Exception {
CSVReader reader = new CSVReader(new StringReader(file._2()));
return reader.readAll();
}
}
JavaPairRDD<String, String> csvData = sc.wholeTextFiles(inputFile);
JavaRDD<String[]> keyedRDD = csvData.flatMap(new ParseLine());
保存CSV
Example 5-18. Writing CSV in Python
def writeRecords(records):
"""Write out CSV lines"""
output = StringIO.StringIO()
writer = csv.DictWriter(output, fieldnames=["name", "favoriteAnimal"])
for record in records:
writer.writerow(record)
return [output.getvalue()]
pandaLovers.mapPartitions(writeRecords).saveAsTextFile(outputFile)
Example 5-19. Writing CSV in Scala
pandaLovers.map(person => List(person.name, person.favoriteAnimal).toArray)
.mapPartitions{people =>
val stringWriter = new StringWriter();
val csvWriter = new CSVWriter(stringWriter);
csvWriter.writeAll(people.toList)
Iterator(stringWriter.toString)
}.saveAsTextFile(outFile)
5.6SequenceFiles
SequenceFiles有同步标记允许Spark寻找到文件中某一点,然后重新同步记录的边界。这样就允许Spark从多个节点以并行的方式读取SequenceFiles。SequenceFiles通常是Hadoop MapReduce jobs的input/ouput格式。
SequenceFiles包含了实现hadoop的写入的接口,Hadoop使用自定义的序列化框架。
加载SequenceFiles
Example 5-20. Loading a SequenceFile in Python
val data = sc.sequenceFile(inFile,
"org.apache.hadoop.io.Text", "org.apache.hadoop.io.IntWritable")
Example 5-21. Loading a SequenceFile in Scala
val data = sc.sequenceFile(inFile, classOf[Text], classOf[IntWritable]).
map{case (x, y) => (x.toString, y.get())}
Example 5-22. Loading a SequenceFile in Java
public static class ConvertToNativeTypes implements
PairFunction<Tuple2<Text, IntWritable>, String, Integer> {
public Tuple2<String, Integer> call(Tuple2<Text, IntWritable> record) {
return new Tuple2(record._1.toString(), record._2.get());
}
}
JavaPairRDD<Text, IntWritable> input = sc.sequenceFile(fileName, Text.class,
IntWritable.class);
JavaPairRDD<String, Integer> result = input.mapToPair(
new ConvertToNativeTypes());
保存SequenceFiles
Example 5-23. Saving a SequenceFile in Scala
val data = sc.parallelize(List(("Panda", 3), ("Kay", 6), ("Snail", 2)))
data.saveAsSequenceFile(outputFile)
5.7Object Files
5.8Hadoop的输入和输出格式
加载其他的Hadoop输入格式
要使用新的Hadoop API读取文件,我们需要告诉Spark一些事情。一个路径和三个类。第一个类是format类,这个类表示我们的输入格式。类似的功能hadoopFile(),用于使用旧API实现的Hadoop输入格式。第二个类是key,第三个类是value。
Hadoop中一个简单输入格式是KeyValueTextInputFormat,它可以从文本文件中读取键值数据。
Example 5-24. Loading KeyValueTextInputFormat() with old-style API in Scala
val input = sc.hadoopFile[Text, Text, KeyValueTextInputFormat](inputFile).map{
case (x, y) => (x.toString, y.toString)
}
Example 5-25. Loading LZO-compressed JSON with Elephant Bird in Scala
val input = sc.newAPIHadoopFile(inputFile, classOf[LzoJsonInputFormat],
classOf[LongWritable], classOf[MapWritable], conf)
// Each MapWritable in "input" represents a JSON object
保存Hadoop输出格式
Example 5-26. Saving a SequenceFile in Java
public static class ConvertToWritableTypes implements
PairFunction<Tuple2<String, Integer>, Text, IntWritable> {
public Tuple2<Text, IntWritable> call(Tuple2<String, Integer> record) {
return new Tuple2(new Text(record._1), new IntWritable(record._2));
}
}
JavaPairRDD<String, Integer> rdd = sc.parallelizePairs(input);
JavaPairRDD<Text, IntWritable> result = rdd.mapToPair(new ConvertToWritableTypes());
result.saveAsHadoopFile(fileName, Text.class, IntWritable.class,
SequenceFileOutputFormat.class);
5.9非文件系统数据源
5.10示例:协议缓冲区
协议缓冲区首先被Google开发用于内部远程过程调用(RPCs)现在被开源了。PBs(Protocal buffers)是结构化数据,字段和字段类型都有明确的定义。它的编码和解码都经过优化并且占用空间较小。相对于XML,PBs要小3~10倍,编码解码PBs要快20~100倍。
Example 5-27. Sample protocol buffer definition
message Venue {
required int32 id = 1;
required string name = 2;
required VenueType type = 3;
optional string address = 4;
enum VenueType {
COFFEESHOP = 0;
WORKPLACE = 1;
CLUB = 2;
OMNOMNOM = 3;
OTHER = 4;
}
}
message VenueResponse {
repeated Venue results = 1;
}
Example 5-28. Elephant Bird protocol buffer writeout in Scala
val job = new Job()
val conf = job.getConfiguration
LzoProtobufBlockOutputFormat.setClassConf(classOf[Places.Venue], conf);
val dnaLounge = Places.Venue.newBuilder()
dnaLounge.setId(1);
dnaLounge.setName("DNA Lounge")
dnaLounge.setType(Places.Venue.VenueType.CLUB)
val data = sc.parallelize(List(dnaLounge.build()))
val outputData = data.map{ pb =>
val protoWritable = ProtobufWritable.newInstance(classOf[Places.Venue]);
protoWritable.set(pb)
(null, protoWritable)
}
outputData.saveAsNewAPIHadoopFile(outputFile, classOf[Text],
classOf[ProtobufWritable[Places.Venue]],
classOf[LzoProtobufBlockOutputFormat[ProtobufWritable[Places.Venue]]], conf)
5.11文件压缩
通常在使用大数据时,我们发现自己需要使用压缩数据来节省空间和网络开销。对于大多数Hadoop输出格式,我们可以指定压缩数据的压缩编解码器。我们已经看到,Spark的本机输入格式(textFile和sequenceFile)可以自动处理某些类型的压缩。当您读取压缩数据时,有一些压缩编解码器可用于自动猜测压缩类型。
这些压缩选项仅适用于支持压缩的Hadoop格式,即写入文件系统的Hadoop格式。 数据库Hadoop格式通常不实现压缩支持,或者它们具有在数据库本身中配置的压缩记录。选择输出压缩编解码器可能会对数据的未来用户产生很大的影响。对于Spark等分布式系统,我们通常会尝试从多台不同的机器读取我们的数据。 为了实现这一点,每个工作人员需要能够找到新记录的开始。一些压缩格式使得这不可能,这需要单个节点读取所有数据,从而容易导致瓶颈。可以从多台机器轻松读取的格式称为“splittable”
5.12文件系统
本地规则文件
Example 5-29. Loading a compressed text file from the local filesystem in Scala
val rdd = sc.textFile("file:///home/holden/happypandas.gz")
Amazon S3
要访问S3需要AWS_ACCESS_KEY_ID和AWS_SECRET_ACCESS_KEY。访问的路径是s3n://开头
HDFS
Spark和HDFS可以配置在一台机器上,Spark可以利用本地数据优势避开网络开销。访问路径可以是hdfs://master:port/path
注意:HDFS协议在Hadoop各个版本中都有改变,所以要注意版本兼容问题。Spark是针对Hadoop1.0.4构建的
5.13使用SparkSQL操作结构化数据
Spark SQL作为一个组件被加入到Spark1.0中,很快就变成了Spark首选的处理结构化和半结构化的数据。结构化数据意味着数据是有schema的,意思是说有一组连续的跨数据记录的字段。
Apache Hive
一个推荐使用的结构化数据源是Apache Hive。将Spark SQL连接到已经安装的Hive上,需要提供Hive的配置。复制hive-site.xml文件到Spark的conf下,一旦你完成了上面的操作,你可以在代码中创建HiveContext对象,它是Spark SQL的入口点,然后你可以通过Hive Query Language从行的RDD中查询数据。
Example 5-30. Creating a HiveContext and selecting data in Python
from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)
rows = hiveCtx.sql("SELECT name, age FROM users")
firstRow = rows.first()
print firstRow.name
Example 5-31. Creating a HiveContext and selecting data in Scala
import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
val rows = hiveCtx.sql("SELECT name, age FROM users")
val firstRow = rows.first()
println(firstRow.getString(0)) // Field 0 is the name
Example 5-32. Creating a HiveContext and selecting data in Java
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SchemaRDD;
HiveContext hiveCtx = new HiveContext(sc);
SchemaRDD rows = hiveCtx.sql("SELECT name, age FROM users");
Row firstRow = rows.first();
System.out.println(firstRow.getString(0)); // Field 0 is the name
JSON
如果您的JSON数据在记录之间具有一致的模式,Spark SQL可以推断出它们的模式,并将这些数据作为行加载,这样可以很简单地拉出所需的字段。加载JSON数据,首先创建HiveContext。然后调用HiveContext.jsonFile方法获得一个RDD,该RDD包括整个文件的所有行对象。除了使用整个Row对象,您还可以将此RDD注册为表,并从中选择特定字段。
Example 5-33. Sample tweets in JSON
{"user": {"name": "Holden", "location": "San Francisco"}, "text": "Nice day out today"}
{"user": {"name": "Matei", "location": "Berkeley"}, "text": "Even nicer here :)"}
Example 5-34. JSON loading with Spark SQL in Python
tweets = hiveCtx.jsonFile("tweets.json")
tweets.registerTempTable("tweets")
results = hiveCtx.sql("SELECT user.name, text FROM tweets")
Example 5-35. JSON loading with Spark SQL in Scala
val tweets = hiveCtx.jsonFile("tweets.json")
tweets.registerTempTable("tweets")
val results = hiveCtx.sql("SELECT user.name, text FROM tweets")
Example 5-36. JSON loading with Spark SQL in Java
SchemaRDD tweets = hiveCtx.jsonFile(jsonFile);
tweets.registerTempTable("tweets");
SchemaRDD results = hiveCtx.sql("SELECT user.name, text FROM tweets");
5.14数据库
JDBC(包括MySQL、Postgres等等),我们需要构造一个org.apache.spark.rdd.JdbcRDD。
def createConnection() = {
Class.forName("com.mysql.jdbc.Driver").newInstance();
DriverManager.getConnection("jdbc:mysql://localhost/test?user=holden");
}
def extractValues(r: ResultSet) = {
(r.getInt(1), r.getString(2))
}
val data = new JdbcRDD(sc,
createConnection, "SELECT * FROM panda WHERE ? <= id AND id <= ?",
lowerBound = 1, upperBound = 3, numPartitions = 2, mapRow = extractValues)
println(data.collect().toList)
JdbcRDD的参数
1.我们提供一个功能来建立与我们数据库的连接。这允许每个节点在执行连接所需的任何配置后创建自己的连接以加载数据。
2.我们提供一个可以读取数据范围的查询,以及此查询参数的下限Bound和upperBound值。 这些参数允许Spark查询不同机器上的不同数据范围,因此我们不需要为在一个节点上加载所有数据而为难。
3.最后一个参数是一个函数,它从java.sql.ResultSet转为我们方便操作的数据类型
Cassandra
Example 5-38. sbt requirements for Cassandra connector
"com.datastax.spark" %% "spark-cassandra-connector" % "1.0.0-rc5",
"com.datastax.spark" %% "spark-cassandra-connector-java" % "1.0.0-rc5"
Example 5-39. Maven requirements for Cassandra connector
<dependency> <!-- Cassandra -->
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector</artifactId>
<version>1.0.0-rc5</version>
</dependency>
<dependency> <!-- Cassandra -->
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector-java</artifactId>
<version>1.0.0-rc5</version>
</dependency>
我们设置spark.cassandra.connection.host来指定Cassandra集群,如果我们有用户名和密码,可以设置spark.cassandra.auth.username和spark.cassandra.auth.passwork
Example 5-40. Setting the Cassandra property in Scala
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "hostname")
val sc = new SparkContext(conf)
Example 5-41. Setting the Cassandra property in Java
SparkConf conf = new SparkConf(true)
.set("spark.cassandra.connection.host", cassandraHost);
JavaSparkContext sc = new JavaSparkContext(
sparkMaster, "basicquerycassandra", conf);
Example 5-42. Loading the entire table as an RDD with key/value data in Scala
// Implicits that add functions to the SparkContext & RDDs.
import com.datastax.spark.connector._
// Read entire table as an RDD. Assumes your table test was created as
// CREATE TABLE test.kv(key text PRIMARY KEY, value int);
val data = sc.cassandraTable("test" , "kv")
// Print some basic stats on the value field.
data.map(row => row.getInt("value")).stats()
Example 5-43. Loading the entire table as an RDD with key/value data in Java
import com.datastax.spark.connector.CassandraRow;
import static com.datastax.spark.connector.CassandraJavaUtil.javaFunctions;
// Read entire table as an RDD. Assumes your table test was created as
// CREATE TABLE test.kv(key text PRIMARY KEY, value int);
JavaRDD<CassandraRow> data = javaFunctions(sc).cassandraTable("test" , "kv");
// Print some basic stats.
System.out.println(data.mapToDouble(new DoubleFunction<CassandraRow>() {
public double call(CassandraRow row) { return row.getInt("value"); }
}).stats());
除了载入整个表,我们可以查询数据的子集。我们可以通过cassandraTable()调用where()增加条件,例如sc.cassandraTable().where("key=?","panda")
Cassandra连接器提供从各种RDD类型保存到Cassandra。我们可以直接保存RDD的CassandraRow对象,这对于在表之间复制数据是很有用的。
Example 5-44. Saving to Cassandra in Scala
val rdd = sc.parallelize(List(Seq("moremagic", 1)))
rdd.saveToCassandra("test" , "kv", SomeColumns("key", "value"))
HBase
Spark可以通过Hadoop的输入格式访问HBase,实现的类是org.apache.hadoop.hbase.mapreduce.TableInputFormat。输入的格式返回键值对,键的类型是org.apache.hadoop.hbase.io.ImmutableBytesWritable,值的类型是org.apache.hadoop.hbase.client.Result。
Example 5-45. Scala example of reading from HBase
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, "tablename") // which table to scan
val rdd = sc.newAPIHadoopRDD(
conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
想要优化HBase的读取,TableInputFormat包括很多设置,像是将扫描限制为一个组列,限制扫描的时间范围等。
Elasticsearch
Spark可以使用Elasticsearch-Hadoop从Elasticsearch读写数据。Elasticsearch是一个新开源的,基础是Lucene的搜索系统。
Example 5-46. Elasticsearch output in Scala
val jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("mapred.output.format.class", "org.elasticsearch.hadoop.mr.EsOutputFormat")
jobConf.setOutputCommitter(classOf[FileOutputCommitter])
jobConf.set(ConfigurationOptions.ES_RESOURCE_WRITE, "twitter/tweets")
jobConf.set(ConfigurationOptions.ES_NODES, "localhost")
FileOutputFormat.setOutputPath(jobConf, new Path("-"))
output.saveAsHadoopDataset(jobConf)
Example 5-47. Elasticsearch input in Scala
def mapWritableToInput(in: MapWritable): Map[String, String] = {
in.map{case (k, v) => (k.toString, v.toString)}.toMap
}
val jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set(ConfigurationOptions.ES_RESOURCE_READ, args(1))
jobConf.set(ConfigurationOptions.ES_NODES, args(2))
val currentTweets = sc.hadoopRDD(jobConf,
classOf[EsInputFormat[Object, MapWritable]], classOf[Object],
classOf[MapWritable])
// Extract only the map
// Convert the MapWritable[Text, Text] to Map[String, String]
val tweets = currentTweets.map{ case (key, value) => mapWritableToInput(value) }
工程师希望发现更多的输出格式来适合下游的消费者。数据科学家可能关注数据已经存在的格式。
5.1动机
Spark提供了一个各种各样输入输出数据源的操作。
文件格式与文件系统
文件格式:text、JSON、SequenceFIles、protocol buffers
文件系统:NFS、HDFS、S3
通过SparkSQL结构化数据源
数据库与键值存储
5.2文件格式
非结构化的数据像是text、版结构化像是JSON、结构化像是SequenceFiles。
5.3文本文件
在Spark中加载和保存文本文件是非常简单的。当我们将文本文件作为RDD加载,每一行作为RDD的一个元素。
我们还可以将多个整个文本文件同时加载到一对RDD中,其中键是名称,值是每个文件的内容。
加载文本文件
Example 5-1. Loading a text file in Python
input = sc.textFile("file:///home/holden/repos/spark/README.md")
Example 5-2. Loading a text file in Scala
val input = sc.textFile("file:///home/holden/repos/spark/README.md")
Example 5-3. Loading a text file in Java
JavaRDD<String> input = sc.textFile("file:///home/holden/repos/spark/README.md")
当每个文件表示某个时间段的数据时,wholeTextFiles()可能非常有用。 如果我们有来自不同时期的销售数据的文件,我们可以轻松计算每个时期的平均值
Example 5-4. Average value per file in Scala
val input = sc.wholeTextFiles("file://home/holden/salesFiles")
val result = input.mapValues{y =>
val nums = y.split(" ").map(x => x.toDouble)
nums.sum / nums.size.toDouble
}
保存文本文件
Example 5-5. Saving as a text file in Python
result.saveAsTextFile(outputFile)
5.4JSON
Example 5-6. Loading unstructured JSON in Python
import json
data = input.map(lambda x: json.loads(x))
Example 5-7. Loading JSON in Scala
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.databind.DeserializationFeature
...
case class Person(name: String, lovesPandas: Boolean) // Must be a top-level class
...
// Parse it into a specific case class. We use flatMap to handle errors
// by returning an empty list (None) if we encounter an issue and a
// list with one element if everything is ok (Some(_)).
val result = input.flatMap(record => {
try {
Some(mapper.readValue(record, classOf[Person]))
} catch {
case e: Exception => None
}})
Example 5-8. Loading JSON in Java
class ParseJson implements FlatMapFunction<Iterator<String>, Person> {
public Iterable<Person> call(Iterator<String> lines) throws Exception {
ArrayList<Person> people = new ArrayList<Person>();
ObjectMapper mapper = new ObjectMapper();
while (lines.hasNext()) {
String line = lines.next();
try {
people.add(mapper.readValue(line, Person.class));
} catch (Exception e) {
// skip records on failure
}
}
return people;
}
}
JavaRDD<String> input = sc.textFile("file.json");
JavaRDD<Person> result = input.mapPartitions(new ParseJson());
保存JSON
Example 5-9. Saving JSON in Python
(data.filter(lambda x: x['lovesPandas']).map(lambda x: json.dumps(x))
.saveAsTextFile(outputFile))
Example 5-10. Saving JSON in Scala
result.filter(p => P.lovesPandas).map(mapper.writeValueAsString(_))
.saveAsTextFile(outputFile)
Example 5-11. Saving JSON in Java
class WriteJson implements FlatMapFunction<Iterator<Person>, String> {
public Iterable<String> call(Iterator<Person> people) throws Exception {
ArrayList<String> text = new ArrayList<String>();
ObjectMapper mapper = new ObjectMapper();
while (people.hasNext()) {
Person person = people.next();
text.add(mapper.writeValueAsString(person));
}
return text;
}
}
JavaRDD<Person> result = input.mapPartitions(new ParseJson()).filter(
new LikesPandas());
JavaRDD<String> formatted = result.mapPartitions(new WriteJson());
formatted.saveAsTextFile(outfile);
5.5CSV和\t分割
加载CSV
Example 5-12. Loading CSV with textFile() in Python
import csv
import StringIO
...
def loadRecord(line):
"""Parse a CSV line"""
input = StringIO.StringIO(line)
reader = csv.DictReader(input, fieldnames=["name", "favouriteAnimal"])
return reader.next()
input = sc.textFile(inputFile).map(loadRecord)
Example 5-13. Loading CSV with textFile() in Scala
import Java.io.StringReader
import au.com.bytecode.opencsv.CSVReader
...
val input = sc.textFile(inputFile)
val result = input.map{ line =>
val reader = new CSVReader(new StringReader(line));
reader.readNext();
}
Example 5-14. Loading CSV with textFile() in Java
import au.com.bytecode.opencsv.CSVReader;
import Java.io.StringReader;
...
public static class ParseLine implements Function<String, String[]> {
public String[] call(String line) throws Exception {
CSVReader reader = new CSVReader(new StringReader(line));
return reader.readNext();
}
}
JavaRDD<String> csvFile1 = sc.textFile(inputFile);
JavaPairRDD<String[]> csvData = csvFile1.map(new ParseLine());
Example 5-15. Loading CSV in full in Python
def loadRecords(fileNameContents):
"""Load all the records in a given file"""
input = StringIO.StringIO(fileNameContents[1])
reader = csv.DictReader(input, fieldnames=["name", "favoriteAnimal"])
return reader
fullFileData = sc.wholeTextFiles(inputFile).flatMap(loadRecords)
Example 5-16. Loading CSV in full in Scala
case class Person(name: String, favoriteAnimal: String)
val input = sc.wholeTextFiles(inputFile)
val result = input.flatMap{ case (_, txt) =>
val reader = new CSVReader(new StringReader(txt));
reader.readAll().map(x => Person(x(0), x(1)))
}
Example 5-17. Loading CSV in full in Java
public static class ParseLine
implements FlatMapFunction<Tuple2<String, String>, String[]> {
public Iterable<String[]> call(Tuple2<String, String> file) throws Exception {
CSVReader reader = new CSVReader(new StringReader(file._2()));
return reader.readAll();
}
}
JavaPairRDD<String, String> csvData = sc.wholeTextFiles(inputFile);
JavaRDD<String[]> keyedRDD = csvData.flatMap(new ParseLine());
保存CSV
Example 5-18. Writing CSV in Python
def writeRecords(records):
"""Write out CSV lines"""
output = StringIO.StringIO()
writer = csv.DictWriter(output, fieldnames=["name", "favoriteAnimal"])
for record in records:
writer.writerow(record)
return [output.getvalue()]
pandaLovers.mapPartitions(writeRecords).saveAsTextFile(outputFile)
Example 5-19. Writing CSV in Scala
pandaLovers.map(person => List(person.name, person.favoriteAnimal).toArray)
.mapPartitions{people =>
val stringWriter = new StringWriter();
val csvWriter = new CSVWriter(stringWriter);
csvWriter.writeAll(people.toList)
Iterator(stringWriter.toString)
}.saveAsTextFile(outFile)
5.6SequenceFiles
SequenceFiles有同步标记允许Spark寻找到文件中某一点,然后重新同步记录的边界。这样就允许Spark从多个节点以并行的方式读取SequenceFiles。SequenceFiles通常是Hadoop MapReduce jobs的input/ouput格式。
SequenceFiles包含了实现hadoop的写入的接口,Hadoop使用自定义的序列化框架。
加载SequenceFiles
Example 5-20. Loading a SequenceFile in Python
val data = sc.sequenceFile(inFile,
"org.apache.hadoop.io.Text", "org.apache.hadoop.io.IntWritable")
Example 5-21. Loading a SequenceFile in Scala
val data = sc.sequenceFile(inFile, classOf[Text], classOf[IntWritable]).
map{case (x, y) => (x.toString, y.get())}
Example 5-22. Loading a SequenceFile in Java
public static class ConvertToNativeTypes implements
PairFunction<Tuple2<Text, IntWritable>, String, Integer> {
public Tuple2<String, Integer> call(Tuple2<Text, IntWritable> record) {
return new Tuple2(record._1.toString(), record._2.get());
}
}
JavaPairRDD<Text, IntWritable> input = sc.sequenceFile(fileName, Text.class,
IntWritable.class);
JavaPairRDD<String, Integer> result = input.mapToPair(
new ConvertToNativeTypes());
保存SequenceFiles
Example 5-23. Saving a SequenceFile in Scala
val data = sc.parallelize(List(("Panda", 3), ("Kay", 6), ("Snail", 2)))
data.saveAsSequenceFile(outputFile)
5.7Object Files
5.8Hadoop的输入和输出格式
加载其他的Hadoop输入格式
要使用新的Hadoop API读取文件,我们需要告诉Spark一些事情。一个路径和三个类。第一个类是format类,这个类表示我们的输入格式。类似的功能hadoopFile(),用于使用旧API实现的Hadoop输入格式。第二个类是key,第三个类是value。
Hadoop中一个简单输入格式是KeyValueTextInputFormat,它可以从文本文件中读取键值数据。
Example 5-24. Loading KeyValueTextInputFormat() with old-style API in Scala
val input = sc.hadoopFile[Text, Text, KeyValueTextInputFormat](inputFile).map{
case (x, y) => (x.toString, y.toString)
}
Example 5-25. Loading LZO-compressed JSON with Elephant Bird in Scala
val input = sc.newAPIHadoopFile(inputFile, classOf[LzoJsonInputFormat],
classOf[LongWritable], classOf[MapWritable], conf)
// Each MapWritable in "input" represents a JSON object
保存Hadoop输出格式
Example 5-26. Saving a SequenceFile in Java
public static class ConvertToWritableTypes implements
PairFunction<Tuple2<String, Integer>, Text, IntWritable> {
public Tuple2<Text, IntWritable> call(Tuple2<String, Integer> record) {
return new Tuple2(new Text(record._1), new IntWritable(record._2));
}
}
JavaPairRDD<String, Integer> rdd = sc.parallelizePairs(input);
JavaPairRDD<Text, IntWritable> result = rdd.mapToPair(new ConvertToWritableTypes());
result.saveAsHadoopFile(fileName, Text.class, IntWritable.class,
SequenceFileOutputFormat.class);
5.9非文件系统数据源
5.10示例:协议缓冲区
协议缓冲区首先被Google开发用于内部远程过程调用(RPCs)现在被开源了。PBs(Protocal buffers)是结构化数据,字段和字段类型都有明确的定义。它的编码和解码都经过优化并且占用空间较小。相对于XML,PBs要小3~10倍,编码解码PBs要快20~100倍。
Example 5-27. Sample protocol buffer definition
message Venue {
required int32 id = 1;
required string name = 2;
required VenueType type = 3;
optional string address = 4;
enum VenueType {
COFFEESHOP = 0;
WORKPLACE = 1;
CLUB = 2;
OMNOMNOM = 3;
OTHER = 4;
}
}
message VenueResponse {
repeated Venue results = 1;
}
Example 5-28. Elephant Bird protocol buffer writeout in Scala
val job = new Job()
val conf = job.getConfiguration
LzoProtobufBlockOutputFormat.setClassConf(classOf[Places.Venue], conf);
val dnaLounge = Places.Venue.newBuilder()
dnaLounge.setId(1);
dnaLounge.setName("DNA Lounge")
dnaLounge.setType(Places.Venue.VenueType.CLUB)
val data = sc.parallelize(List(dnaLounge.build()))
val outputData = data.map{ pb =>
val protoWritable = ProtobufWritable.newInstance(classOf[Places.Venue]);
protoWritable.set(pb)
(null, protoWritable)
}
outputData.saveAsNewAPIHadoopFile(outputFile, classOf[Text],
classOf[ProtobufWritable[Places.Venue]],
classOf[LzoProtobufBlockOutputFormat[ProtobufWritable[Places.Venue]]], conf)
5.11文件压缩
通常在使用大数据时,我们发现自己需要使用压缩数据来节省空间和网络开销。对于大多数Hadoop输出格式,我们可以指定压缩数据的压缩编解码器。我们已经看到,Spark的本机输入格式(textFile和sequenceFile)可以自动处理某些类型的压缩。当您读取压缩数据时,有一些压缩编解码器可用于自动猜测压缩类型。
这些压缩选项仅适用于支持压缩的Hadoop格式,即写入文件系统的Hadoop格式。 数据库Hadoop格式通常不实现压缩支持,或者它们具有在数据库本身中配置的压缩记录。选择输出压缩编解码器可能会对数据的未来用户产生很大的影响。对于Spark等分布式系统,我们通常会尝试从多台不同的机器读取我们的数据。 为了实现这一点,每个工作人员需要能够找到新记录的开始。一些压缩格式使得这不可能,这需要单个节点读取所有数据,从而容易导致瓶颈。可以从多台机器轻松读取的格式称为“splittable”
5.12文件系统
本地规则文件
Example 5-29. Loading a compressed text file from the local filesystem in Scala
val rdd = sc.textFile("file:///home/holden/happypandas.gz")
Amazon S3
要访问S3需要AWS_ACCESS_KEY_ID和AWS_SECRET_ACCESS_KEY。访问的路径是s3n://开头
HDFS
Spark和HDFS可以配置在一台机器上,Spark可以利用本地数据优势避开网络开销。访问路径可以是hdfs://master:port/path
注意:HDFS协议在Hadoop各个版本中都有改变,所以要注意版本兼容问题。Spark是针对Hadoop1.0.4构建的
5.13使用SparkSQL操作结构化数据
Spark SQL作为一个组件被加入到Spark1.0中,很快就变成了Spark首选的处理结构化和半结构化的数据。结构化数据意味着数据是有schema的,意思是说有一组连续的跨数据记录的字段。
Apache Hive
一个推荐使用的结构化数据源是Apache Hive。将Spark SQL连接到已经安装的Hive上,需要提供Hive的配置。复制hive-site.xml文件到Spark的conf下,一旦你完成了上面的操作,你可以在代码中创建HiveContext对象,它是Spark SQL的入口点,然后你可以通过Hive Query Language从行的RDD中查询数据。
Example 5-30. Creating a HiveContext and selecting data in Python
from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)
rows = hiveCtx.sql("SELECT name, age FROM users")
firstRow = rows.first()
print firstRow.name
Example 5-31. Creating a HiveContext and selecting data in Scala
import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
val rows = hiveCtx.sql("SELECT name, age FROM users")
val firstRow = rows.first()
println(firstRow.getString(0)) // Field 0 is the name
Example 5-32. Creating a HiveContext and selecting data in Java
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SchemaRDD;
HiveContext hiveCtx = new HiveContext(sc);
SchemaRDD rows = hiveCtx.sql("SELECT name, age FROM users");
Row firstRow = rows.first();
System.out.println(firstRow.getString(0)); // Field 0 is the name
JSON
如果您的JSON数据在记录之间具有一致的模式,Spark SQL可以推断出它们的模式,并将这些数据作为行加载,这样可以很简单地拉出所需的字段。加载JSON数据,首先创建HiveContext。然后调用HiveContext.jsonFile方法获得一个RDD,该RDD包括整个文件的所有行对象。除了使用整个Row对象,您还可以将此RDD注册为表,并从中选择特定字段。
Example 5-33. Sample tweets in JSON
{"user": {"name": "Holden", "location": "San Francisco"}, "text": "Nice day out today"}
{"user": {"name": "Matei", "location": "Berkeley"}, "text": "Even nicer here :)"}
Example 5-34. JSON loading with Spark SQL in Python
tweets = hiveCtx.jsonFile("tweets.json")
tweets.registerTempTable("tweets")
results = hiveCtx.sql("SELECT user.name, text FROM tweets")
Example 5-35. JSON loading with Spark SQL in Scala
val tweets = hiveCtx.jsonFile("tweets.json")
tweets.registerTempTable("tweets")
val results = hiveCtx.sql("SELECT user.name, text FROM tweets")
Example 5-36. JSON loading with Spark SQL in Java
SchemaRDD tweets = hiveCtx.jsonFile(jsonFile);
tweets.registerTempTable("tweets");
SchemaRDD results = hiveCtx.sql("SELECT user.name, text FROM tweets");
5.14数据库
JDBC(包括MySQL、Postgres等等),我们需要构造一个org.apache.spark.rdd.JdbcRDD。
def createConnection() = {
Class.forName("com.mysql.jdbc.Driver").newInstance();
DriverManager.getConnection("jdbc:mysql://localhost/test?user=holden");
}
def extractValues(r: ResultSet) = {
(r.getInt(1), r.getString(2))
}
val data = new JdbcRDD(sc,
createConnection, "SELECT * FROM panda WHERE ? <= id AND id <= ?",
lowerBound = 1, upperBound = 3, numPartitions = 2, mapRow = extractValues)
println(data.collect().toList)
JdbcRDD的参数
1.我们提供一个功能来建立与我们数据库的连接。这允许每个节点在执行连接所需的任何配置后创建自己的连接以加载数据。
2.我们提供一个可以读取数据范围的查询,以及此查询参数的下限Bound和upperBound值。 这些参数允许Spark查询不同机器上的不同数据范围,因此我们不需要为在一个节点上加载所有数据而为难。
3.最后一个参数是一个函数,它从java.sql.ResultSet转为我们方便操作的数据类型
Cassandra
Example 5-38. sbt requirements for Cassandra connector
"com.datastax.spark" %% "spark-cassandra-connector" % "1.0.0-rc5",
"com.datastax.spark" %% "spark-cassandra-connector-java" % "1.0.0-rc5"
Example 5-39. Maven requirements for Cassandra connector
<dependency> <!-- Cassandra -->
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector</artifactId>
<version>1.0.0-rc5</version>
</dependency>
<dependency> <!-- Cassandra -->
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector-java</artifactId>
<version>1.0.0-rc5</version>
</dependency>
我们设置spark.cassandra.connection.host来指定Cassandra集群,如果我们有用户名和密码,可以设置spark.cassandra.auth.username和spark.cassandra.auth.passwork
Example 5-40. Setting the Cassandra property in Scala
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "hostname")
val sc = new SparkContext(conf)
Example 5-41. Setting the Cassandra property in Java
SparkConf conf = new SparkConf(true)
.set("spark.cassandra.connection.host", cassandraHost);
JavaSparkContext sc = new JavaSparkContext(
sparkMaster, "basicquerycassandra", conf);
Example 5-42. Loading the entire table as an RDD with key/value data in Scala
// Implicits that add functions to the SparkContext & RDDs.
import com.datastax.spark.connector._
// Read entire table as an RDD. Assumes your table test was created as
// CREATE TABLE test.kv(key text PRIMARY KEY, value int);
val data = sc.cassandraTable("test" , "kv")
// Print some basic stats on the value field.
data.map(row => row.getInt("value")).stats()
Example 5-43. Loading the entire table as an RDD with key/value data in Java
import com.datastax.spark.connector.CassandraRow;
import static com.datastax.spark.connector.CassandraJavaUtil.javaFunctions;
// Read entire table as an RDD. Assumes your table test was created as
// CREATE TABLE test.kv(key text PRIMARY KEY, value int);
JavaRDD<CassandraRow> data = javaFunctions(sc).cassandraTable("test" , "kv");
// Print some basic stats.
System.out.println(data.mapToDouble(new DoubleFunction<CassandraRow>() {
public double call(CassandraRow row) { return row.getInt("value"); }
}).stats());
除了载入整个表,我们可以查询数据的子集。我们可以通过cassandraTable()调用where()增加条件,例如sc.cassandraTable().where("key=?","panda")
Cassandra连接器提供从各种RDD类型保存到Cassandra。我们可以直接保存RDD的CassandraRow对象,这对于在表之间复制数据是很有用的。
Example 5-44. Saving to Cassandra in Scala
val rdd = sc.parallelize(List(Seq("moremagic", 1)))
rdd.saveToCassandra("test" , "kv", SomeColumns("key", "value"))
HBase
Spark可以通过Hadoop的输入格式访问HBase,实现的类是org.apache.hadoop.hbase.mapreduce.TableInputFormat。输入的格式返回键值对,键的类型是org.apache.hadoop.hbase.io.ImmutableBytesWritable,值的类型是org.apache.hadoop.hbase.client.Result。
Example 5-45. Scala example of reading from HBase
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, "tablename") // which table to scan
val rdd = sc.newAPIHadoopRDD(
conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
想要优化HBase的读取,TableInputFormat包括很多设置,像是将扫描限制为一个组列,限制扫描的时间范围等。
Elasticsearch
Spark可以使用Elasticsearch-Hadoop从Elasticsearch读写数据。Elasticsearch是一个新开源的,基础是Lucene的搜索系统。
Example 5-46. Elasticsearch output in Scala
val jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("mapred.output.format.class", "org.elasticsearch.hadoop.mr.EsOutputFormat")
jobConf.setOutputCommitter(classOf[FileOutputCommitter])
jobConf.set(ConfigurationOptions.ES_RESOURCE_WRITE, "twitter/tweets")
jobConf.set(ConfigurationOptions.ES_NODES, "localhost")
FileOutputFormat.setOutputPath(jobConf, new Path("-"))
output.saveAsHadoopDataset(jobConf)
Example 5-47. Elasticsearch input in Scala
def mapWritableToInput(in: MapWritable): Map[String, String] = {
in.map{case (k, v) => (k.toString, v.toString)}.toMap
}
val jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set(ConfigurationOptions.ES_RESOURCE_READ, args(1))
jobConf.set(ConfigurationOptions.ES_NODES, args(2))
val currentTweets = sc.hadoopRDD(jobConf,
classOf[EsInputFormat[Object, MapWritable]], classOf[Object],
classOf[MapWritable])
// Extract only the map
// Convert the MapWritable[Text, Text] to Map[String, String]
val tweets = currentTweets.map{ case (key, value) => mapWritableToInput(value) }