lightGBM专题3:PySpark中的StringIndexer和pipeline功能实现

本文链接：https://blog.csdn.net/fangfanglovezhou/article/details/118312950

StringIndexer和pipeline是pypark中特征提取最常用的两个功能，这里通过实例来讲解其工作原理，首先给出StringIndexer的实例：

from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
#配置spark,创建SparkSession对象
spark = SparkSession.builder.master('local').appName('StringIndexerDemo').getOrCreate()
#创建简单的DataFrame
df = spark.createDataFrame([
    (0, "a", "s" ), (1, "b", "o"), 
    (2, "c", "g"), (3, "a", "l"), 
    (4, "a", "p"), (5, "c", "u")],
    ["id", "category", "name"])
print("*********df**********")
df.show()

#创建StringIndexer对象，设定输入输出参数
indexer1 = StringIndexer(inputCol ='category', outputCol= 'categoryIndex')
#对这个DataFrame进行训练
model1 = indexer1.fit(df)
#利用生成的模型对DataFrame进行transform操作
indexed1 = model1.transform(df)
print("***********indexed1**********")
indexed1.show()

indexer2 = StringIndexer(inputCol ='name', outputCol= 'nameIndex')
model2 = indexer2.fit(indexed1)
indexed2 = model2.transform(indexed1)
print("************indexed2*********")
indexed2.show()

可以看到StringIndexer的实现分三步，首先声明StringIndexer对象，有两个参数一个是输入列字段名，一个是输出列字段名，第二步调用模型进行训练对应的DataFrame数据集（重复字符串按出现频次排序，依次编号为0、1、 2、3...... ），最后调用tranform将输入字段值转换为编号的输出字段值，并返回添加了OutPut字段的DataFrame数据对象。

结果如下：

*********df**********
+---+--------+----+
| id|category|name|
+---+--------+----+
|  0|       a|   s|
|  1|       b|   o|
|  2|       c|   g|
|  3|       a|   l|
|  4|       a|   p|
|  5|       c|   u|
+---+--------+----+
***********indexed1**********
+---+--------+----+-------------+
| id|category|name|categoryIndex|
+---+--------+----+-------------+
|  0|       a|   s|          0.0|
|  1|       b|   o|          2.0|
|  2|       c|   g|          1.0|
|  3|       a|   l|          0.0|
|  4|       a|   p|          0.0|
|  5|       c|   u|          1.0|
+---+--------+----+-------------+
************indexed2*********
+---+--------+----+-------------+---------+
| id|category|name|categoryIndex|nameIndex|
+---+--------+----+-------------+---------+
|  0|       a|   s|          0.0|      0.0|
|  1|       b|   o|          2.0|      5.0|
|  2|       c|   g|          1.0|      2.0|
|  3|       a|   l|          0.0|      3.0|
|  4|       a|   p|          0.0|      4.0|
|  5|       c|   u|          1.0|      1.0|
+---+--------+----+-------------+---------+

我们再看下如何通过Pipeline来实现上面的功能：

from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
#配置spark,创建SparkSession对象
spark = SparkSession.builder.master('local').appName('StringIndexerDemo').getOrCreate()
#创建简单的DataFrame
df = spark.createDataFrame([
    (0, "a", "s" ), (1, "b", "o"),
    (2, "c", "g"), (3, "a", "l"),
    (4, "a", "p"), (5, "c", "u")],
    ["id", "category", "name"])
print("*********df**********")
df.show()
indexer1 = StringIndexer(inputCol ='category', outputCol= 'categoryIndex')
indexer2 = StringIndexer(inputCol ='name', outputCol= 'nameIndex')
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[indexer1,indexer2])
model = pipeline.fit(df)
dfres = model.transform(df)
dfres.show()

执行结果如下：

+---+--------+----+-------------+---------+
| id|category|name|categoryIndex|nameIndex|
+---+--------+----+-------------+---------+
|  0|       a|   s|          0.0|      0.0|
|  1|       b|   o|          2.0|      5.0|
|  2|       c|   g|          1.0|      2.0|
|  3|       a|   l|          0.0|      3.0|
|  4|       a|   p|          0.0|      4.0|
|  5|       c|   u|          1.0|      1.0|
+---+--------+----+-------------+---------+

结果同上面是相通的，正如Pipeline的字面意思，其提供了一个流水线操作平台，将不同操作按顺序放入pipeline，那么就可以按顺序依次执行这些操作。