PySpark Feature Tool
1. 数据准备
我们定义了一些测试数据,方便验证函数的有效性;同时对于大多数初学者来说,明白函数的输入是什么,输出是什么,才能更好的理解特征函数和使用特征:
df = spark.createDataFrame([
('zhu', "Hi I heard about pySpark"),
('xiang', "I wish python could use case classes"),
('yu', "Logistic regression models are neat")
], ["id", "sentence"])
# functionTestData
+-----+------------------------------------+
|id |sentence |
+-----+------------------------------------+
|zhu |Hi I heard about pySpark. |
|xiang|I wish python could use case classes|
|yu |Logistic regression models are neat |
+-----+------------------------------------+
2.数据读取
# !/usr/bin/env python
# -*- coding: utf-8 -*-
########################################################################################################################
# Creater : Zhu Xiangyu.DOTA
# Creation Time : 2020-2-17 12:45:09
# Description : PySpark 特征工程工具集
# Modify By :
# Modify Time :
# Modify Content :
# Script Version : 2.0.0.9
########################################################################################################################
import math
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('DOTAd_Features_Tool').enableHiveSupport().getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 1000)
spark.conf.set("spark.default.parallelism", 2000)
def get_params():
return {
# Function Can be Used
'column1' : "TFIDF", # 词频-逆向文件频率
'column2' : "Word2Vec",
'column3' : "CountVectorizer",
'column4' : "OneHotEncoder",
'column5' : "StringIndexer",
'column6' : "IndexToString",
'column7' : "PCA",
'column8' : "Binarizer",
'column9' : "Tokenizer",
'column10': "StopWordsRemover", #
'column11': "NGram", #
'column12': "DCT", # 离散余弦变换
'column13': "ChiSqSelector", # 卡方校验
'column14': "PearsonCorr", # 皮尔逊系数
}
def main():
# Reset params
######################################################################################
#
# 库名.表名
dataset_Name = ""
dataset = spark.sql("select * from {dataset_Name}".format(dataset_Name = dataset_Name)).fillna(0)
#
# 结果存储目标 库名.表名
saveAsTable_Name = ""
#
# 指定对列col进行function操作 {col:function}
params = {'sentence': "TFIDF"}
#
######################################################################################
#
# functionTestData
df = spark.createDataFrame([
('zhu', "Hi I heard about pySpark"),
('xiang', "I wish python could use case classes"),
('yu', "Logistic regression models are neat")
], ["id", "sentence"])
# Feature Transform
features = featureTool(dataset,params) # Test-Model : dataset = df
features.show(5)
# Save Feature as table
saveResult(features,saveAsTable_Name)
3.数据存储
# SaveTableAs
def saveResult(res,saveAsTable_Name='dota_tmp.dota_features_tool_save_result', saveFormat="orc",saveMode="overwrite"):
res.write.saveAsTable(name=saveAsTable_Name, format=saveFormat,mode=saveMode)
4.特征函数
def featureTool(df,params):
dataCols,targetCols = df.columns,params.keys()
exeColumns = list(params.keys())[0]
exeDefFunction = params[exeColumns]
print(exeColumns+"-->"+exeDefFunction+"(df,{exeColumns})".format(exeColumns=exeColumns))
exeOrder = "feat={exeDef}(df,'{exeCols}','{outputCol}')".format(exeCols=exeColumns,exeDef=exeDefFunction,outputCol=exeDefFunction+'_'+exeColumns)
print("exeOrder : "+exeOrder)
exec(exeOrder)
return feat
4.1 TFIDF
权重计算方法经常会和余弦相似度(cosine similarity)一同使用于向量空间模型中,用以判断两份文件之间的相似性。当前,真正在搜索引擎等实际应用中广泛使用的是Tf-idf 模型。Tf-idf 模型的主要思想是:如果词w在一篇文档d中出现的频率高,并且在其他文档中很少出现,则认为词w具有很好的区分能力,适合用来把文章d和其他文章区分开来。
def TFIDF(df,inputCol="sentence",outputCol="tfidf", numFeatures=20):
"""
词频