创建DataFrame, customers, products, sales
- 创建下面三个dataframe (df_customers, df_products, df_sales)
1)创建df_customers:
customers = [(1,'James',21,'M'), (2, "Liz",25,"F"), (3, "John", 31, "M"),\
(4, "Jennifer", 45, "F"), (5, "Robert", 41, "M"), (6, "Sandra", 45, "F")]
df_customers = spark.createDataFrame(customers, ["cID", "name", "age", "gender"]) # list -> DF
df_customers.show()
+---+--------+---+------+
|cID| name|age|gender|
+---+--------+---+------+
| 1| James| 21| M|
| 2| Liz| 25| F|
| 3| John| 31| M|
| 4|Jennifer| 45| F|
| 5| Robert| 41| M|
| 6| Sandra| 45| F|
+---+--------+---+------+
2)创建df_products:
products = [(1, "iPhone", 600, 400), (2, "Galaxy", 500, 400), (3, "iPad", 400, 300),\
(4, "Kindel",200,100), (5, "MacBook", 1200, 900), (6, "Dell",500, 400)]
df_products = sc.parallelize(products).toDF(["pId", "name", "price", "cost"]) # List-> RDD ->DF
df_products.show()
+---+-------+-----+----+
|pId| name|price|cost|
+---+-------+-----+----+
| 1| iPhone| 600| 400|
| 2| Galaxy| 500| 400|
| 3| iPad| 400| 300|
| 4| Kindel| 200| 100|
| 5|MacBook| 1200| 900|
| 6| Dell| 500| 400|
+---+-------+-----+----+
3)创建df_sales:
sales = [("01/01/2015", "iPhone", "USA", 40000), ("01/02/2015", "iPhone",