Spark Example GOGOGO!
1.创建RDD
some_rdd = sc.parallelize([Row(name="John", age=19),Row(name="Smith", age=23),Row(name="Sarah", age=18)])
NOTE:我这里使用HIVE结合MYSQL存放数据元,所以先开启
2.
# 推断SQL DATAFRAME的结构
some_df = sqlContext.createDataFrame(some_rdd)
some_df.printSchema()
3.
可以指定SCHEMA
# Another RDD is created from a list of tuples
another_rdd = sc.parallelize([("John", 19), ("Smith", 23), ("Sarah", 18)])
# Schema with two fields - person_name and person_age
schema = StructType([StructField("person_name", StringType(), False), StructField("person_age", IntegerType(), False)])
# Create a DataFrame by applying the schema to the RDD and print the schema
another_df = sqlContext.createDataFrame(another_rdd, schema)
another_df.printSchema()
4.
从JSON直接创建
if len(sys.argv) < 2:
path = "file://" + os.path.join(os.environ['SPARK_HOME'], "examples/src/main/resources/people.json")
else:
path = sys.argv[1]
# Create a DataFrame from the file(s) pointed to by path
people = sqlContext.jsonFile(path)
5.
接着注册表(metastore)
people.registerAsTable("people")
6.
teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
for each in teenagers.collect():
print(each[0])