Consider i have a list of python dictionary key value pairs , where key correspond to column name of a table, so for below list how to convert it into a pyspark dataframe with two cols arg1 arg2?
[{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]
How can i use the following construct to do it?
df = sc.parallelize([
...
]).toDF
Where to place arg1 arg2 in the above code (...)
解决方案
Old way:
sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]).toDF()
New way:
from pyspark.sql import Row
from collections import OrderedDict
def convert_to_row(d: dict) -> Row:
return Row(**OrderedDict(sorted(d.items())))
sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]) \
.map(convert_to_row) \
.toDF()