数据下载:
数据为kaggle上的关于商场客户的数据,地址:https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python
数据准备:
数据集很小,四个特征值:性别,年龄,收入能力,消费能力,这里我们用收入能力和消费能力两项对客户进行聚类处理
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[1]').appName('learn_cluster').getOrCreate()
# 导入数据
df = spark.read.csv('file:///home/ffzs/python-projects/learn_spark/Mall_Customers.csv', header=True, inferSchema=True)
# 更换列名
df = df.withColumnRenamed('Annual Income (k$)', 'Income').withColumnRenamed('Spending Score (1-100)', 'Spend')
# 看下数据
df.show(3)
+----------+------+---+------+-----+
|CustomerID|Gender|Age|Income|Spend|
+----------+------+---+------+-----+
| 1| Male