pyspark:StringIndexer,IndexToString,VectorIndexer转换器

最新推荐文章于 2024-04-06 12:00:00 发布

Gadaite

最新推荐文章于 2024-04-06 12:00:00 发布

阅读量692

点赞数

文章标签： python spark

本文链接：https://blog.csdn.net/weixin_46408961/article/details/120407665

版权

先导入库，引入所需要的类：

from pyspark.ml.feature import StringIndexer,IndexToString, VectorIndexer
from pyspark import SparkConf,SparkContext
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.linalg import Vector,Vectors
spark = SparkSession.builder.config(conf=SparkConf())\
        .getOrCreate()

I.StringIndexer转换器：

转换器可以把一列类别型的特征（或标签）进行编码，按照频度的高低从0开始将标签数值化（0为出现频度最高的标签，若标签为数值类型，先转换成字符类型），可在机器学习中用于提高决策树算法的效率。

#构造dataframe数据集，使用StringIndexer转换器
df = spark.createDataFrame([(1,"spark"),(2,"hadoop")\
    ,(3,"scala"),(4,"java"),(5,"python")\
    ,(6,"spark"),(7,"java"),(8,"java"),(9,"python")\
    ,(10,"python")]\
    ,["id","category"])
df.show()

输出：

设置输入输出列：

#indexer评估器
indexer = StringIndexer(inputCol="category",\
    outputCol="categoryIndex")

模型训练

#%%
#fit()方法模型训练
model = indexer.fit(df)
modelIndex = model.transform(df)
modelIndex.show()

结果展示：

II.IndexToString:与StringIndex相反，将整形索引还原成字符型

#构造转换器
stringer = IndexToString(inputCol="categoryIndex",\
    outputCol="originalcategory")

#%%
#使用fit()方法模型训练
modelString = stringer.transform(modelIndex)
modelString.show()
# %%
#如果只要id和列通过sql查询即可,需要注册临时表
data = modelString.select("id","originalcategory").show()

结果（右边为select查询结果）：

III.VectorIndexer:如果所有特征都被集中在一个向量中，对其中某些单个分量进行处理。maxCategories=2参数含义：不同值的类别个数<=2，转换为类别特征

#VectorIndexer转换器
dfs = spark.createDataFrame([
        (Vectors.dense(-1.0,1.0,1.0),),\
        (Vectors.dense(-1.0,3.0,3.0),),\
        (Vectors.dense(0.0,5.0,1.0),)],\
        ["features"])
dfs.show()

# %%
#构建转换器，设置输入输出列,并训练
vec = VectorIndexer(inputCol="features"\
    ,outputCol="indexvec",maxCategories=2)
#maxCategories：不同值的类别个数<=2，转换为类别特征
modelvec = vec.fit(dfs)

#%%
#通过categoryMap成员获得被转换的特征及其映射
categoryfeatures = modelvec.categoryMaps.keys()

# %%
print("choose "+str(len(categoryfeatures))+\
    " categorical features: "+str(categoryfeatures))

#%%
modelres = modelvec.transform(dfs)
modelres.show()

原始数据与结果数据：

choose 2 categorical features: KeysView({0: {0.0: 0, -1.0: 1}, 2: {1.0: 0, 3.0: 1}})

Gadaite

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
pyspark:StringIndexer,IndexToString,VectorIndexer转换器

先导入库，引入所需要的类：from pyspark.ml.feature import StringIndexer,IndexToString, VectorIndexerfrom pyspark import SparkConf,SparkContextfrom pyspark.sql import SparkSessionfrom pyspark.ml.feature import VectorIndexerfrom pyspark.ml.linalg import Vector,Vect
复制链接

扫一扫