ML 其它功能(一)

特征提取

NLP相关特征提取
NGram模型采用标记文本的列表,并生成单词对(n-gram)

from pyspark.sql import SparkSession
import pyspark.ml.feature as ft
import pyspark.sql.functions as func
import pyspark.ml.clustering as clus
from pyspark.ml import Pipeline
import pyspark.sql.types as typ
import numpy as np
spark = SparkSession.builder.master('local').appName('NLP').getOrCreate()
text_data = spark.createDataFrame([
    ['''Machine learning can be applied to a wide variety 
        of data types, such as vectors, text, images, and 
        structured data. This API adopts the DataFrame from 
        Spark SQL in order to support a variety of data types.'''],
    ['''DataFrame supports many basic and structured types; 
        see the Spark SQL datatype reference for a list of 
        supported types. In addition to the types listed in 
        the Spark SQL guide, DataFrame can use ML Vector types.'''],
    ['''A DataFrame can be created either implicitly or 
        explicitly from a regular RDD. See the code examples 
        below and the Spark SQL programming guide for examples.'''],
    ['''Columns in a DataFrame are named. The code examples 
        below use names such as "text," "features," and "label."''']
], ['input'])

在单列的DataFrame中,每一行只是一堆文本,首先要对文本进行标记。使用RegexTokenizer而不是Tokenizer,以便指定拆分文本的模式:

tokenizer = ft.RegexTokenizer(inputCol='input',
                             outputCol='input_arr',
                             pattern='\s+|[,.\"]')
tok = tokenizer.transform(text_data).select('input_arr')
tok.take(1)
[Row(input_arr=['machine', 'learning', 'can', 'be', 'applied', 'to', 'a', 'wide', 'variety', 'of', 'data', 'types', 'such', 'as', 'vectors', 'text', 'images', 'and', 'structured', 'data', 'this', 'api', 'adopts', 'the', 'dataframe', 'from', 'spark', 'sql', 'in', 'order', 'to', 'support', 'a', 'variety', 'of', 'data', 'types'])]

该模式会将文本再所有的空格处分隔,而且会删除逗号,句号,反斜杠和引号

stopwords = ft.StopWordsRemover(inputCol=tokenizer.getOutputCol(),
                               outputCol='input_stop')

文本中仍有很多垃圾内容:如 be、a或通常分析文本时无用的词。因此使用StopWordsRemover()来删除停用词。

stopwords.transform(tok).select('input_stop').take(1)
[Row(input_stop=['machine', 'learning', 'applied', 'wide', 'variety', 'data', 'types', 'vectors', 'text', 'images', 'structured', 'data', 'api', 'adopts', 'dataframe', 'spark', 'sql', 'order', 'support', 'variety', 'data', 'types'])]
ngram = ft.NGram(n=2,
                inputCol=stopwords.getOutputCol(),
                outputCol='nGrams')
pipline = Pipeline(stages=[tokenizer,
                          stopwords,
                          ngram])
data_ngram = pipline.fit(text_data).transform(text_data)
data_ngram.select('nGrams').take(1)
[Row(nGrams=['machine learning', 'learning applied', 'applied wide', 'wide variety', 'variety data', 'data types', 'types vectors', 'vectors text', 'text images', 'images structured', 'structured data', 'data api', 'api adopts', 'adopts dataframe', 'dataframe spark', 'spark sql', 'sql order', 'order support', 'support variety', 'variety data', 'data types'])]

以上处理完毕,得到n-grams,进一步使用NLP处理

离散连续变量

通常需要处理高度非线性的连续特征,很难只用一个系数来供给模型。这种情况下,可能难以用一个系数来解释这样的特征与目标之间的关系。有时候,将值划分成分级类别时很有用的

加入一些假数据

x = np.arange(0, 100)
x = x / 100.0 * np.pi * 4
y = x * np.sin(x / 1.764) + 20.1234
# 创建DataFrame
schema = typ.StructType([
    typ.StructField('continuous_var',
                   typ.DoubleType(),
                   False)
])
data = spark.createDataFrame(
    [[float(e), ] for e in y],
    schema=schema
)

使用antileDiscretizer模型将连续变量分为五个类别(numBuckets参数):

discretizer = ft.QuantileDiscretizer(numBuckets=5,
                                    inputCol='continuous_var',
                                    outputCol='discretized')
data_discretized = discretizer.fit(data).transform(data)
标准化连续变量

标准化连续变量不仅有助于更好地理解特征之间的关系,而且有助于计算效率,并防止运行到某些数字陷阱。

# 首先创建一个向量代表连续变量:
vectorizer = ft.VectorAssembler(
    inputCols=['continuous_var'],
    outputCol='continuous_vec'
)

构建normalizer和管道。通过withMean和withStd设置为True,该方法将删除均值并让方差缩放为单位长度:

normalizer = ft.StandardScaler(
    inputCol=vectorizer.getOutputCol(),
    outputCol='normalized',
    withMean=True,
    withStd=True
)
pipline = Pipeline(stages=[vectorizer,
                          normalizer])
data_standardized = pipline.fit(data).transform(data)

数据以单位方差振荡在0左右

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值