pyspark.ml.feature模块详解（持续更新中）

最新推荐文章于 2024-08-01 17:39:08 发布

NoOne-csdn

最新推荐文章于 2024-08-01 17:39:08 发布

阅读量2.8k

点赞数 2

分类专栏：机器学习 pyspark

本文链接：https://blog.csdn.net/weixin_40161254/article/details/100146315

版权

pyspark 同时被 2 个专栏收录

63 篇文章 9 订阅

订阅专栏

机器学习

16 篇文章 0 订阅

订阅专栏

Tokenizer(inputCol=None,outputCol=None)

分词器，把字符串转为小写，并以空格分词
eg:

df=spark.createDataFrame([('''Machine learning can be applied to a wide variety
        of data types, such as vectors, text, images, and
        structured data. This API adopts the DataFrame from
        Spark SQL in order to support a variety of data types.''',)],['text'])

print(df.select('text').take(1))

tokenizer=Tokenizer(inputCol='text',outputCol='token')
print(tokenizer.transform(df).select('token').take(1))

[Row(text=‘Machine learning can be applied to a wide variety\n of data types, such as vectors, text, images, and\n structured data. This API adopts the DataFrame from\n Spark SQL in order to support a variety of data types.’)]
[Row(token=[‘machine’, ‘learning’, ‘can’, ‘be’, ‘applied’, ‘to’, ‘a’, ‘wide’, ‘variety’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘of’, ‘data’, ‘types,’, ‘such’, ‘as’, ‘vectors,’, ‘text,’, ‘images,’, ‘and’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘structured’, ‘data.’, ‘this’, ‘api’, ‘adopts’, ‘the’, ‘dataframe’, ‘from’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘spark’, ‘sql’, ‘in’, ‘order’, ‘to’, ‘support’, ‘a’, ‘variety’, ‘of’, ‘data’, ‘types.’])]
注:仅以空格分词

属性及方法：

getInputCol()获取输入列名
getoutputCol()获取输出列名
setInputCol(value)重置输入列名
setOutputCol(value)重置输出列名
setParams(self,inputCol=None,outputCol=None)重置参数
transform（dataset）转换
params得到所有参数

print(tokenizer.params)
print(tokenizer.inputCol)
print(tokenizer.outputCol)
>>>[Param(parent='Tokenizer_f79b7f105329', name='inputCol', doc='input column name.'), Param(parent='Tokenizer_f79b7f105329', name='outputCol', doc='output column name.')]
>>>Tokenizer_f79b7f105329__inputCol
>>>Tokenizer_f79b7f105329__outputCol

RegexTokenizer

RegexTokenizer(minTokenLength=1,gaps=True,pattern=’\s+’,inputCol=None.outputCol=None,toLowercase=True)
基于Tokenizer，RegexTokenizer（可以翻译成正则化分词器或者正则化标记）基于正则表达式匹配提供了更多高级的分词功能。默认情况下，使用空格作为分隔符来分陋输入的文本。或者，用户可以将gaps参数设置为false,表明使用正则表达式匹配标记，而不是使用分隔符，并找到所有匹配到标记结果。
-gaps gaps = Param(parent=‘undefined’, name=‘gaps’, doc=‘whether regex splits on gaps (True) or matches tokens (False)’)¶
决定正则是分割还是匹配
-toLowercase 是否转为小写字母

regtoken=RegexTokenizer(inputCol='text',outputCol='regtoken',gaps=False)
print(regtoken.transform(df).select('regtoken').take(1))
print("*"*12)


regtoken=RegexTokenizer(inputCol='text',outputCol='regtoken',gaps=True,pattern=r'\s+|[,.\"]')
print(regtoken.transform(df).select('regtoken').take(1))
print("*"*12)

[Row(regtoken=[’ ', ’ ', ’ ', ’ ', ’ ', ’ ', ’ ', ’ ', '\n ', ’ ', ’ ', ’ ', ’ ', ’ ', ’ ', ’ ', ’ ', '\n ', ’ ', ’ ', ’ ', ’ ', ’ ', ’ ', ’ ', '\n ', ’ ', ’ ', ’ ', ’ ', ’ ', ’ ', ’ ', ’ ', ’ ', ’ '])]

[Row(regtoken=[‘machine’, ‘learning’, ‘can’, ‘be’, ‘applied’, ‘to’, ‘a’, ‘wide’, ‘variety’, ‘of’, ‘data’, ‘types’, ‘such’, ‘as’, ‘vectors’, ‘text’, ‘images’, ‘and’, ‘structured’, ‘data’, ‘this’, ‘api’, ‘adopts’, ‘the’, ‘dataframe’, ‘from’, ‘spark’, ‘sql’, ‘in’, ‘order’, ‘to’, ‘support’, ‘a’, ‘varity’, ‘of’, ‘data’, ‘types’])]

StopWordsRemover

StopWordsRemover(inputcol=None,outputCol=None,stopWords=None,locale=None)
在输入中过滤掉停止词
eg:过滤掉词data、of

from pyspark.ml.feature import StopWordsRemover
remover=StopWordsRemover(inputCol=tokenizer.getOutputCol(),outputCol='words',stopWords=['data','of'])
remover.transform(df1).select('words').take(1)

查看系统默认的停止词
remover.loadDefaultStopWords(‘english’)

[‘i’,
‘me’,
‘my’,
‘myself’,
‘we’,
‘our’,
‘ours’,
‘ourselves’,
‘you’,
‘your’,
‘yours’,
‘yourself’,
'yourselve…………]

查看已经有的停止词
remover.getStopWords()

[‘data’, ‘of’]

StringIndexer

StringIndexer(inputCol=None,outputCol=None,handleInvalid=‘error’,stringOrderType=‘frequencyDesc’)
StringIndexer可以把字符串的列按照出现频率进行排序，出现次数最高的对应的Index为0 handleInvalid 对空缺数值的处理规定，有如下参数：“skip” - 过滤掉此条数据；“error” - 抛出错误；“keep” - 对其设置一个新的索引值 inputCol 设置要进行索引的列名 outputCol 设置索引保存的列名 stringOrderType 设置索引编号的方式，包含如下取值：“frequencyDesc” - 频率倒序编号，即出现次数多的编号大；“frequencyAsc” - 频率升序编号；“alplabetDesc” - 字母表降序编号；“alphabetAsc” - 字母表升序编号

Path = "/Users/cyy/Desktop/PythonProject/"
df = spark.read.format("csv")\
    .option("header", "true")\
    .option("delimiter", "\t")\
    .load(Path + "data/train.tsv")

df=df.select('alchemy_category')

from pyspark.ml.feature import StringIndexer
indexer=StringIndexer(inputCol='alchemy_category',outputCol='indexer',stringOrderType='frequencyAsc')

model=indexer.fit(df)

df.select('alchemy_category').show()

±-----------------+
| alchemy_category|
±-----------------+
| business|
| recreation|
| health|
| health|
| sports|
| ?|
|arts_entertainment|
| ?|
| ?|
| ?|
| business|
| sports|
| health|
| ?|
| recreation|
| recreation|
| recreation|
|arts_entertainment|
| recreation|
| health|
±-----------------+
only showing top 20 rows

model.transform(df).show()

±-----------------±------+
| alchemy_category|indexer|
±-----------------±------+
| business| 10.0|
| recreation| 12.0|
| health| 9.0|
| health| 9.0|
| sports| 8.0|
| ?| 13.0|
|arts_entertainment| 11.0|
| ?| 13.0|
| ?| 13.0|
| ?| 13.0|
| business| 10.0|
| sports| 8.0|
| health| 9.0|
| ?| 13.0|
| recreation| 12.0|
| recreation| 12.0|
| recreation| 12.0|
|arts_entertainment| 11.0|
| recreation| 12.0|
| health| 9.0|
±-----------------±------+
only showing top 20 rows

indexer.getStringOrderType()

frequencyDesc

indexer.getHandleInvalid()

‘error’

OneHotEncoder

OneHotEncoder(dropLast=True, inputCol=None, outputCol=None)
独热编码
在机器学习算法中，我们经常会遇到分类特征，例如：人的性别有男女，祖国有中国，美国，法国等。
这些特征值并不是连续的，而是离散的，无序的。通常我们需要对其进行特征数字化。
独热码，在英文文献中称做 one-hot code, 直观来说就是有多少个状态就有多少比特，而且只有一个比特为1，其他全为0的一种码制。通常，在通信网络协议栈中，使用八位或者十六位状态的独热码，且系统占用其中一个状态码，余下的可以供用户使用。
One-Hot Encoding 也就是独热码，直观来说就是有多少个状态就有多少比特，而且只有一个比特为1，其他全为0的一种码制。在机器学习（Logistic Regression，SVM等）中对于离散型的分类型的数据，需要对其进行数字化比如说性别这一属性，只能有男性或者女性或者其他这三种值，如何对这三个值进行数字化表达？一种简单的方式就是男性为0，女性为1，其他为2，这样做有什么问题？
使用上面简单的序列对分类值进行表示后，进行模型训练时可能会产生一个问题就是特征的因为数字值得不同影响模型的训练效果，在模型训练的过程中不同的值使得同一特征在样本中的权重可能发生变化，假如直接编码成1000，是不是比编码成1对模型的的影响更大。为了解决上述的问题，使训练过程中不受到因为分类值表示的问题对模型产生的负面影响，引入独热码对分类型的特征进行独热码编码

from pyspark.ml.feature import OneHotEncoder
encoder=OneHotEncoder(inputCol='indexer',outputCol='features',dropLast=False)
encoder.transform(df).head().features

SparseVector(14, {10: 1.0})

encoder.getDropLast()

Fasle

encoder.getInputCol()

‘indexer’

encoder.tranform(df).show()

±-----------------±------±--------------+
| alchemy_category|indexer| features|
±-----------------±------±--------------+
| business| 10.0|(14,[10],[1.0])|
| recreation| 12.0|(14,[12],[1.0])|
| health| 9.0| (14,[9],[1.0])|
| health| 9.0| (14,[9],[1.0])|
| sports| 8.0| (14,[8],[1.0])|
| ?| 13.0|(14,[13],[1.0])|
|arts_entertainment| 11.0|(14,[11],[1.0])|
| ?| 13.0|(14,[13],[1.0])|
| ?| 13.0|(14,[13],[1.0])|
| ?| 13.0|(14,[13],[1.0])|
………………

VectorAssembler(inputCol=None,outputCol=None,handleinvalid=‘error’)

VectorAssembler是将给定列列表组合成单个向量列的转换器。为了训练逻辑回归和决策树等ML模型，将原始特征和不同特征转换器生成的特征组合成一个特征向量是很有用的。VectorAssembler接受以下输入列类型:所有数值类型、布尔类型和向量类型。在每一行中，输入列的值将按照指定的顺序连接到一个向量中。

assembler=VectorAssembler(inputCols=assemblerInputs,outputCol="features")
df3=assembler.transform(df2)
print(df2.take(1))
print(df3.select('features').take(1))

[Row(url=‘http://1000awesomethings.com/2008/12/31/862-the-laugh-echo/’, alchemy_category=‘recreation’, alchemy_category_score=0.303424, avglinksize=2.352941176, commonlinkratio_1=0.722826087, commonlinkratio_2=0.375, commonlinkratio_3=0.304347826, commonlinkratio_4=0.288043478, compression_ratio=0.482968988, embed_ratio=0.0, framebased=0.0, frameTagRatio=0.034038638, hasDomainLink=0.0, html_ratio=0.222980757, image_ratio=0.218579235, is_news=1.0, lengthyLinkDomain=1.0, linkwordscore=14.0, news_front_page=0.0, non_markup_alphanum_characters=9935.0, numberOfLinks=184.0, numwords_in_url=3.0, parametrizedLinkRatio=0.347826087, spelling_errors_ratio=0.13832853, label=1.0, alchemy_category_Index=1.0, alchemy_category_IndexVec=SparseVector(14, {1: 1.0}))]
[Row(features=SparseVector(36, {1: 1.0, 14: 0.3034, 15: 2.3529, 16: 0.7228, 17: 0.375, 18: 0.3043, 19: 0.288, 20: 0.483, 23: 0.034, 25: 0.223, 26: 0.2186, 27: 1.0, 28: 1.0, 29: 14.0, 31: 9935.0, 32: 184.0, 33: 3.0, 34: 0.3478, 35: 0.1383}))]

NGram(n=2,inputCol=None,poutputCol=None)

N-Gram 语言模型
N-Gram（有时也称为N元模型）是自然语言处理中一个非常重要的概念，通常在NLP中，它主要有两个重要应用场景：

（1）、人们基于一定的语料库，可以利用N-Gram来预计或者评估一个句子是否合理。

（2）、另外一方面，N-Gram的另外一个作用是用来评估两个字符串之间的差异程度。这是模糊匹配中常用的一种手段。
汉语语言模型利用上下文中相邻词间的搭配信息，在需要把连续无空格的拼音、笔划，或代表字母或笔划的数字，转换成汉字串(即句子)时，可以计算出具有最大概率的句子，从而实现到汉字的自动转换，无需用户手动选择，避开了许多汉字对应一个相同的拼音(或笔划串，或数字串)的重码问题。
该模型基于这样一种假设，第N个词的出现只与前面N-1个词相关，而与其它任何词都不相关，整句的概率就是各个词出现概率的乘积。这些概率可以通过直接从语料中统计N个词同时出现的次数得到。常用的是二元的Bi-Gram和三元的Tri-Gram。

from pyspark.sql import Row
df=spark.createDataFrame([Row(inputTokens=list('abcde'))])from pyspark.ml.feature import NGram
ngram=NGram(n=2,inputCol='inputTokens',outputCol='nGrams')
ngram.transgorm(df).show()

±--------------±-------------------+
| inputTokens| nGrams|
±--------------±-------------------+
|[a, b, c, d, e]|[a b, b c, c d, d e]|
±--------------±-------------------+

ngram.getN()