Spark ML 之 RDD to DataFrame (python版)

最新推荐文章于 2024-08-16 21:51:12 发布

浮生物语QAQ

最新推荐文章于 2024-08-16 21:51:12 发布

阅读量4k

点赞数

分类专栏： spark 文章标签： spark-mllib python VectorAsse

本文链接：https://blog.csdn.net/chenguangchun1993/article/details/78810955

版权

spark 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

由于工作需要，最近开始用Python写Spark ML程序，基础知识不过关，导致一些简单的问题困扰了好久，这里记录下来，算是一个小的总结，说不定大家也会遇到同样的问题呢，顺便加一句，官方文档才是牛逼的，虽然我英语很菜。

先说下我的需求，使用Iris数据集来训练kmeans模型，Iris是UCI上面一个很著名的数据集，通常用来做聚类（分类）等测试。Iris.txt: http://archive.ics.uci.edu/ml/index.php
然而呢，在spark1.6版本的spark.ml包中并没有kmeans的Python版（并不想用spark.mllib，原因你知道的），解决思路：

每行数据作为一个Row对象并定义schema
使用VectorAssembler将多个列看做一列构建Kmeans算法

lines = sc.textFile("data/Iris.data")
    parts = lines.map(lambda l: l.split(","))
    data = parts.map(lambda p: Row(f1=float(p[0]), f2=float(p[1]), f3=float(p[2]), f4=float(p[3]), f5=p[4]))
    schemaKmeans = sqlContext.createDataFrame(data)
    ass = VectorAssembler(inputCols=["f1", "f2", "f3", "f4"], outputCol="features")
    out = ass.transform(schemaKmeans)
    # out.show(truncate=False)
    kmeansModel = KMeans(k=3, featuresCol="features", predictionCol="prediction").fit(out)
    centers = kmeansModel.clusterCenters()
    print(centers)

还有一种RDD转DataFrame的方法：

from pyspark.sql.types import *

sc = spark.sparkContext

# Load a text file and convert each line to a Row.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))

# The schema is encoded in a string.
schemaString = "name age"

fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)

# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)

# Creates a temporary view using the DataFrame
schemaPeople.createOrReplaceTempView("people")

# SQL can be run over DataFrames that have been registered as a table.
results = spark.sql("SELECT name FROM people")

results.show()