dataframe数据
保存
创建数据
output_df = sqlContext.createDataFrame(
[{'a': [2, 3, 4], 'b': [1, 2, 3], 'c': ['a', 'b', 'c'], 'd': ['a','b','c']},
{'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']}])
数据展示
output_df.show()
+---------+---------+---------+---------+
| a| b| c| d|
+---------+---------+---------+---------+
|[2, 3, 4]|[1, 2, 3]|[a, b, c]|[a, b, c]|
|[3, 4, 5]|[4, 5, 6]|[c, c, d]|[c, c, d]|
+---------+---------+---------+---------+
此时,如果直接保存数据,会报错
output_df.write.csv('./my',header=True)
Py4JJavaError: An error occurred while calling o37.csv.
: java.lang.UnsupportedOperationException: CSV data source does not support array<bigint> data type.
需要对数据做处理
def to_csv(x):
return [','.join([str(i) for i in x[0]]), ','.join([str(i) for i in x[1]]),
','.join([str(i) for i in x[2]]), ','.join([str(i) for i in x[3]])]
output_df = output_df.rdd.map(to_csv).toDF(['a', 'b', 'c', 'd'])
output_df.show()
+-----+-----+-----+-----+
| a| b| c| d|
+-----+-----+-----+-----+
|2,3,4|1,2,3|a,b,c|a,b,c|
|3,4,5|4,5,6|c,c,d|c,c,d|
+-----+-----+-----+-----+
现在,数据可以保存了
output_df.write.csv('./my',header=True)
数据保存在./my目录下,目录下的文件为
这里,数据是保存为csv格式,可以看到,有3个csv文件,这个是根据我们的分区数目定的。因为我们只建立了两条数据,分区数比我们的数据条数多,所以会产生3个文件,如果分区数小于等于数据条数,那么建立的csv文件数应该等于分区数目。
分区数查看
output_df.rdd.getNumPartitions()
8
接下来建立大于8条数据的数据
outputdf = sqlContext.createDataFrame(
[{'a': [2, 3, 4], 'b': [1, 2, 3], 'c': ['a', 'b', 'c'], 'd': ['a','b','c']},
{'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
{'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
{'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
{'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
{'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
{'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
{'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
{'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
{'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
{'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
{'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']}])
output_df = outputdf.rdd.map(to_csv).toDF(['a', 'b', 'c', 'd'])
output_df.show()
+-----+-----+-----+-----+
| a| b| c| d|
+-----+-----+-----+-----+
|2,3,4|1,2,3|a,b,c|a,b,c|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
+-----+-----+-----+-----+
查看每个分区情况
partitions = output_df.rdd.glom().collect()
partitions
[[Row(a='2,3,4', b='1,2,3', c='a,b,c', d='a,b,c')],
[Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d'),
Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d')],
[Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d')],
[Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d'),
Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d')],
[Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d')],
[Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d'),
Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d')],
[Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d')],
[Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d'),
Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d')]]
可以看到,partitions长度为8,说明8个分区都被占用,这个8是电脑的cpu数目,此时我采用的默认分区,就是等于cpu数目。
保存文件
output_df.write.csv('./my',header=True)
查看保存的文件
可以看到,此时保存的csv文件数是8,和我们的分区数一致。
如果我们手动设置分区,那么分区数就与我们手动设置的一致,默认的会是我们的cpu数目。
读取
读取保存的csv数据
output_df = sqlContext.read.csv('./my', header=True)
将数据还原为可用于计算的数据
def from_to_csv(s):
return [[float(x) for x in s[0].split(',') if len(s[0]) > 0],
[int(x) for x in s[1].split(',') if len(s[1]) > 0],
[str(x) for x in s[2].split(',') if len(s[2]) > 0],
[str(x) for x in s[3].split(',') if len(s[3]) > 0]]
output_df = output_df.rdd.map(from_to_csv).toDF(['a','b','c','d'])
output_df.show()
+---------------+---------+---------+---------+
| a| b| c| d|
+---------------+---------+---------+---------+
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[2.0, 3.0, 4.0]|[1, 2, 3]|[a, b, c]|[a, b, c]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
+---------------+---------+---------+---------+