motivation
- File formats and filesystems: 存储在NFS、HDFS上面的text、json、sequential file等。
- Structured data sources through Spark SQL:提供结构化数据的API,比如JSON和HIVE。
- Databases and key-value stores: 将会用内建和第三方的库去连接Cassandra, HBase, Elasticsearch, and JDBC databases.
file format
hdfs://namenodehost/parent/child
hdfs://parent/child
file://parent/child
sc.textFile("hdfs://host:port_no/data/searches")
text files
#读单个数据
input = sc.textFile("file:///home/holden/repos/spark/README.md")
input = sc.textFile("README.md")
input3 = sc.textFile("hdfs://Master:50070/test/sample.txt")
#主机名和端口号在hadoop的core-site.xml中查看
#读目录数据
input = sc.wholeTextFile("file:///home/holden/repos/spark/")
#写数据
result.saveAsTextFile(outputFile)
json
import json
data = input.map(lambda x: json.loads(x))
(data.filter(lambda x: x['lovesPandas']).map(lambda x: json.dumps(x))
.saveAsTextFile(outputFile))
csv tsv
import csv
import StringIO
...
def loadRecord(line):
"""Parse a CSV line"""
input = StringIO.StringIO(line)
reader = csv.DictReader(input, fieldnames=["name", "favouriteAnimal"])
return reader.next()
input = sc.textFile(inputFile).map(loadRecord)
def loadRecords(fileNameContents):
"""Load all the records in a given file"""
input = StringIO.StringIO(fileNameContents[1])
reader = csv.DictReader(input, fieldnames=["name", "favoriteAnimal"])
return reader
fullFileData = sc.wholeTextFiles(inputFile).flatMap(loadRecords)
def writeRecords(records):
"""Write out CSV lines"""
output = StringIO.StringIO()
writer = csv.DictWriter(output, fieldnames=["name", "favoriteAnimal"])
for record in records:
writer.writerow(record)
return [output.getvalue()]
pandaLovers.mapPartitions(writeRecords).saveAsTextFile(outputFile)
sequence files
object files
hadoop input and output values
file compression
file system
local/regular FS
需要注意的是,访问本地的文件地址必须确保路径以及文件在所有节点下面都是存在的。
如果条件不满足,可以先在drive上访问文件,然后利用parallelize
将文件分发到worker上。
但是,分发到worker的过程是很慢的,所以我们推荐将你的文件放在shared filesystem
,比如HDFS, NFS或者S3中。
val rdd = sc.textFile("file:///home/holden/happypandas.gz")
amazon S3
hdfs
hdfs://master:port/path