MLlib代表机器学习库。
MLlib
- 数据准备:特征提取、变换、选择、分类特征的散列和一些自然语言处理方法
- 机器学习算法:实现了一些流行和高级的回归、分类和聚类算法
- 使用程序:统计方法,如描述性统计、卡方检验、线性代数(系数稠密矩阵和向量)和模型评估方法
加载和转换数据
import pyspark.sql.types as typ
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
labels = [
('INFANT_ALIVE_AT_REPORT', typ.StringType()),
('BIRTH_YEAR', typ.IntegerType()),
('BIRTH_MONTH', typ.IntegerType()),
('BIRTH_PLACE', typ.StringType()),
('MOTHER_AGE_YEARS', typ.IntegerType()),
('MOTHER_RACE_6CODE', typ.StringType()),
('MOTHER_EDUCATION', typ.StringType()),
('FATHER_COMBINED_AGE', typ.IntegerType()),
('FATHER_EDUCATION', typ.StringType()),
('MONTH_PRECARE_RECODE', typ.StringType()),
('CIG_BEFORE', typ.IntegerType()),
('CIG_1_TRI', typ.IntegerType()),
('CIG_2_TRI', typ.IntegerType()),
('CIG_3_TRI', typ.IntegerType()),
('MOTHER_HEIGHT_IN', typ.IntegerType()),
('MOTHER_BMI_RECODE', typ.IntegerType()),
('MOTHER_PRE_WEIGHT', typ.IntegerType()),
('MOTHER_DELIVERY_WEIGHT', typ.IntegerType()),
('MOTHER_WEIGHT_GAIN', typ.IntegerType()),
('DIABETES_PRE', typ.StringType()),
('DIABETES_GEST', typ.StringType()),
('HYP_TENS_PRE', typ.StringType()),
('HYP_TENS_GEST', typ.StringType()),
('PREV_BIRTH_PRETERM', typ.StringType()),
('NO_RISK', typ.StringType()),
('NO_INFECTIONS_REPORTED', typ.StringType()),
('LABOR_IND', typ.StringType()),
('LABOR_AUGM', typ.StringType()),
('STEROIDS', typ.StringType()),
('ANTIBIOTICS', typ.StringType()),
('ANESTHESIA', typ.StringType()),
('DELIV_METHOD_RECODE_COMB', typ.StringType()),
('ATTENDANT_BIRTH', typ.StringType()),
('APGAR_5', typ.IntegerType()),
('APGAR_5_RECODE', typ.StringType()),
('APGAR_10', typ.IntegerType()),
('APGAR_10_RECODE', typ.StringType()),
('INFANT_SEX', typ.StringType()),
('OBSTETRIC_GESTATION_WEEKS', typ.IntegerType()),
('INFANT_WEIGHT_GRAMS', typ.IntegerType()),
('INFANT_ASSIST_VENTI', typ.StringType()),
('INFANT_ASSIST_VENTI_6HRS', typ.StringType()),
('INFANT_NICU_ADMISSION', typ.StringType()),
('INFANT_SURFACANT', typ.StringType()),
('INFANT_ANTIBIOTICS', typ.StringType()),
('INFANT_SEIZURES', typ.StringType()),
('INFANT_NO_ABNORMALITIES', typ.StringType()),
('INFANT_ANCEPHALY', typ.StringType()),
('INFANT_MENINGOMYELOCELE', typ.StringType()),
('INFANT_LIMB_REDUCTION', typ.StringType()),
('INFANT_DOWN_SYNDROME', typ.StringType()),
('INFANT_SUSPECTED_CHROMOSOMAL_DISORDER', typ.StringType()),
('INFANT_NO_CONGENITAL_ANOMALIES_CHECKED', typ.StringType()),
('INFANT_BREASTFED', typ.StringType())
]
schema = typ.StructType([
typ.StructField(e[0], e[1], False) for e in labels
])
加载数据。.read.csv()方法可以读取未压缩的或‘Gzipped’压缩的逗号分隔的值。参数header设置为true,代表第一行,并且使用shema来指定正确的数据类型:
births = spark.read.csv('file:///Program Files/Pyproject/pyspark/data/births_train.csv.gz',
header=True,
schema=schema)
# 首先定义重编码字典:
recode_dictionary = {
'YNU':{
'Y': 1,
'N': 0,
'U': 0
}
}
selected_features = [
'INFANT_ALIVE_AT_REPORT',
'BIRTH_PLACE',
'MOTHER_AGE_YEARS',
'FATHER_COMBINED_AGE',
'CIG_BEFORE',
'CIG_1_TRI',
'CIG_2_TRI',
'CIG_3_TRI',
'MOTHER_HEIGHT_IN',
'MOTHER_PRE_WEIGHT',
'MOTHER_DELIVERY_WEIGHT',
'MOTHER_WEIGHT_GAIN',
'DIABETES_PRE',
'DIABETES_GEST',
'HYP_TENS_PRE',
'HYP_TENS_GEST',
'PREV_BIRTH_PRETERM'
]
births_trimmed = births.select(selected_features)
在数据集中大量的特征,它们的值是YSE/NO/Unknown;将YES编码为1,其它值设置为0.
还有个小问题,母亲吸烟的数量如何编码:因为0意味着母亲在怀孕前或怀孕期间没有吸烟;1~97之间代表的是母亲实际吸烟数量;98代表母亲实际吸烟数量是98或者更过;而99代表母亲实际吸烟数量未知,将位置状态设置为0,并重新编码
import pyspark.sql.functions as func
def recode(col, key):
return recode_dictionary[key][col]
def correct_cig(feat):
return func.when(func.col(feat) != 99, func.col(feat)).otherwise(0)
rec_integer = func.udf(recode, typ.IntegerType())
recode方法从recode_dictionary(给出键)查找正确的键,并返回更正的值。correct_cig方法检查如下,当feat特征值不等于99时,返回特征值;如果值等于99,则得到0.
不能直接在DataFrame上使用recode函数,需要转换为Spark可理解的UDF。rec_integer函数功能如下:
通过传递指定的recode函数及指定返回值的数据类型,可以使用它来重新编码Yes/No/Unknown特征。
births_transformed = births_trimmed.withColumn('CIG_BEFORE', correct_cig('CIG_BEFORE'))\
.withColumn('CIG_1_TRI', correct_cig('CIG_1_TRI'))\
.withColumn('CIG_2_TRI', correct_cig('CIG_2_TRI'))\
.withColumn('CIG_3_TRI', correct_cig('CIG_3_TRI'))
.withColumn()方法用列名作为其第一个参数,用转换作为第二个参数。
cols = [(col.name, col.dataType) for col in births_transformed.schema]
YNU_cols = []
for i, s in enumerate(cols):
if s[1] == typ.StringType():
dis = births.select(s[0]).distinct().rdd.map(lambda row: row[0]).collect()
if 'Y' in dis:
YNU_cols.append(s[0])
首先,创建一个包含列名称和相应数据类型的元组(cols)列表。循环遍历这些列表,并计算所有字符串列的不同值;如果‘Y’在返回的列表中,将列名追加到YNU_cols列表。
DataFrame可以在选择特征时批量转换特征。
births.select([
'INFANT_NICU_ADMISSION',
rec_integer(
'INFANT_NICU_ADMISSION', func.lit('YNU')).alias('INFANT_NICU_ADMISSION_RECODE')
]).take(5)
[Row(INFANT_NICU_ADMISSION='Y', INFANT_NICU_ADMISSION_RECODE=1),
Row(INFANT_NICU_ADMISSION='Y', INFANT_NICU_ADMISSION_RECODE=1),
Row(INFANT_NICU_ADMISSION='U', INFANT_NICU_ADMISSION_RECODE=0),
Row(INFANT_NICU_ADMISSION='N', INFANT_NICU_ADMISSION_RECODE=0),
Row(INFANT_NICU_ADMISSION='U', INFANT_NICU_ADMISSION_RECODE=0)]
选择‘INFANT_NICU_ADMISSION’列,并且将该特征的名称传递给rec_integer方法。还将新转换的列的别名称为‘INFANT_NICU_ADMISSION_RECODE’。这样可以确保UDF按预期工作。
exprs_YNU = [
rec_integer(x, func.lit('YNU')).alias(x)
if x in YNU_cols
else x
for x in births_transformed.columns
]
births_transformed = births_transformed.select(exprs_YNU)
births_transformed.select(YNU_cols[-5:]).show(5)
+------------------+
|PREV_BIRTH_PRETERM|
+------------------+
| 0|
| 0|
| 0|
| 1|
| 0|
+------------------+
only showing top 5 rows
描述性统计
.colStats()是根据一个样本来计算描述性统计的。
该方法使用RDD的数据来计算MultivariateStatisticalSummary对象的描述性统计信息,并返回MultivariateStatisticalSummary对象,该对象包含如下描述性统计信息:
- count:行数
- max:列中最大值
- mean:列的所有值的平均值
- min:列中最小值
- normL1:列中值的L1_Norm值
- normL2:列中值的L2_Norm值
- numNonzeros:列中非零值的数量
- variance:列中值得方差值
import pyspark.mllib.stat as st
import numpy as np
numeric_cols = ['MOTHER_AGE_YEARS','FATHER_COMBINED_AGE',
'CIG_BEFORE','CIG_1_TRI','CIG_2_TRI','CIG_3_TRI',
'MOTHER_HEIGHT_IN','MOTHER_PRE_WEIGHT',
'MOTHER_DELIVERY_WEIGHT','MOTHER_WEIGHT_GAIN'
]
numeric_rdd = births_transformed\
.select(numeric_cols)\
.rdd \
.map(lambda row: [e for e in row])
mllib_stats = st.Statistics.colStats(numeric_rdd)
for col, m, v in zip(numeric_cols,
mllib_stats.mean(),
mllib_stats.variance()):
print('{0}: \t{1:.2f} \t {2:.2f}'.format(col, m, np.sqrt(v)))
MOTHER_AGE_YEARS: 28.30 6.08
FATHER_COMBINED_AGE: 44.55 27.55
CIG_BEFORE: 1.43 5.18
CIG_1_TRI: 0.91 3.83
CIG_2_TRI: 0.70 3.31
CIG_3_TRI: 0.58 3.11
MOTHER_HEIGHT_IN: 65.12 6.45
MOTHER_PRE_WEIGHT: 214.50 210.21
MOTHER_DELIVERY_WEIGHT: 223.63 180.01
MOTHER_WEIGHT_GAIN: 30.74 26.23
categorical_cols = [e for e in births_transformed.columns
if e not in numeric_cols]
categorical_rdd = births_transformed\
.select(categorical_cols)\
.rdd \
.map(lambda row: [e for e in row])
for i, col in enumerate(categorical_cols):
agg = categorical_rdd.groupBy(lambda row: row[i]).map(lambda row: (row[0], len(row[1])))
print(col, sorted(agg.collect(), key=lambda el: el[1], reverse=True))
corrs = st.Statistics.corr(numeric_rdd)
for i, el in enumerate(corrs > 0.5):
corelated = [
(numeric_cols[j], corrs[i][j])
for j, e in enumerate(el)
if e == 1.0 and j != i]
if len(corelated) > 0:
for e in corelated:
print('{0}-to-{1} : {2:.2f}'.format(numeric_cols[i], e[0], e[1]))
CIG_BEFORE-to-CIG_1_TRI : 0.83
CIG_BEFORE-to-CIG_2_TRI : 0.72
CIG_BEFORE-to-CIG_3_TRI : 0.62
CIG_1_TRI-to-CIG_BEFORE : 0.83
CIG_1_TRI-to-CIG_2_TRI : 0.87
CIG_1_TRI-to-CIG_3_TRI : 0.76
CIG_2_TRI-to-CIG_BEFORE : 0.72
CIG_2_TRI-to-CIG_1_TRI : 0.87
CIG_2_TRI-to-CIG_3_TRI : 0.89
CIG_3_TRI-to-CIG_BEFORE : 0.62
CIG_3_TRI-to-CIG_1_TRI : 0.76
CIG_3_TRI-to-CIG_2_TRI : 0.89
MOTHER_PRE_WEIGHT-to-MOTHER_DELIVERY_WEIGHT : 0.54
MOTHER_PRE_WEIGHT-to-MOTHER_WEIGHT_GAIN : 0.65
MOTHER_DELIVERY_WEIGHT-to-MOTHER_PRE_WEIGHT : 0.54
MOTHER_DELIVERY_WEIGHT-to-MOTHER_WEIGHT_GAIN : 0.60
MOTHER_WEIGHT_GAIN-to-MOTHER_PRE_WEIGHT : 0.65
MOTHER_WEIGHT_GAIN-to-MOTHER_DELIVERY_WEIGHT : 0.60
重量特征是高度相关的,所以保留‘MOTHER_PRE_WEIGHT’:
featrues_to_keep = [
'INFANT_ALIVE_AT_REPORT',
'BIRTH_PLACE',
'MOTHER_AGE_YEARS',
'FATHER_COMBINED_AGE',
'CIG_1_TRI',
'MOTHER_HEIGHT_IN',
'MOTHER_PRE_WEIGHT',
'DIABETES_PRE',
'DIABETES_GEST',
'HYP_TENS_PRE',
'HYP_TENS_GEST',
'PREV_BIRTH_PRETERM'
]
births_transformed = births_transformed.select([e for e in featrues_to_keep])
统计测试
使用MLlib的.chiSqTest()方法:
import pyspark.mllib.linalg as ln
for cat in categorical_cols[1:]:
agg = births_transformed \
.groupby('INFANT_ALIVE_AT_REPORT') \
.pivot(cat) \
.count()
agg_rdd = agg.rdd \
.map(lambda row: (row[1:])) \
.flatMap(lambda row:
[0 if e == None else e for e in row]) \
.collect()
row_length = len(agg.collect()[0]) - 1
agg = ln.Matrices.dense(row_length, 2, agg_rdd)
test = st.Statistics.chiSqTest(agg)
print(cat, round(test.pValue, 4))
print(ln.Matrices.dense(3,2, [1,2,3,4,5,6]))
DenseMatrix([[1., 4.],
[2., 5.],
[3., 6.]])
import pyspark.mllib.feature as ft
import pyspark.mllib.regression as reg
hashing = ft.HashingTF(7)
births_hashed = births_transformed.rdd \
.map(lambda row :[
list(hashing.transform(row[1]).toArrary())
if col == 'BIRTH_PLACE'
else row[i]
for i, col in enumerate(featrues_to_keep)]) \
.map(lambda row : [[e] if type(e) == int else e for e in row]) \
.map(lambda row : [item for sublist in row for item in sublist]) \
.map(lambda row : reg.LabeledPoint(row[0], ln.Vectors.dense(row[1:]))
)
创建哈希模型。因为特征值又七个级别,所以使用哈希处理中相同多的特征。将模型中’BIRTH_PLACE’特征转换为SparseVector。如果数据中又许多列,但是再一行中只有少数数据具有非零值,则着这种数据机构是首选。然后将所有特征结合在一起,最终创建LabeledPoint。
分隔培训和测试数据
将数据集分为两组:一组用于培训,另一组用于测试。RDD有一个方便的方法处理该情况:.random Split()。该方法的参数是随机分割数据集的比例列表。
births_train, births_test = births_hashed.randomSplit([0.6, 0.4])
预测婴儿生存几率
构建两个模型:线性分类器(linear classifier)————逻辑回归,和非线性分类器(non-linear-classifier)—————随机森林。对于前者,使用所有特征来处理,对于后者,使用chiSqSelector()方法选出前四个特征
MLlib中的逻辑回归
逻辑回归是从某种程度上是构建任何分类模型的基准。MLlib过去使用随机梯度下降(SGD)算法来提供逻辑回归模型评估。这个模型在Spark2.0中弃用,而使用LogisticRegressionWithLBFGS模型。
LogisticRegressionWithLBFGS模型使用Limited-memoryBroyden-Fletcher-Goldfarb-Shanno(BFGS)优化算法。这是一种接近于BFGS算法的拟牛顿方法。
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
LR_Model = LogisticRegressionWithLBFGS.train(births_train, iterations=10)
from pyspark.mllib.tree import RandomForest
selector = ft.ChiSqSelector(4).fit(births_train)
top_Features_train = (births_train.map(lambda row: row.label) \
.zip(selector.transform(births_train \
.map(lambda row: row.features)))
).map(lambda row: reg.LabeledPoint(row[0], row[1]))
top_Features_test = (
births_test.map(lambda row: row.label) \
.zip(selector \
.transform(births_test.map(lambda row: row.features)))
).map(lambda row: reg.LabeledPoint(row[0], row[]))
RF_model = RandomForest.trainClassifier(data=births_train,
numClasses=2,
categoricalFeaturesInfo={},
numTrees=6,
featureSubsetStrategy='all',
seed=666)
File "<ipython-input-46-46c21b63ab41>", line 1
RF_model = RandomForest.trainClassifier(data=births_train,
^
SyntaxError: invalid character in identifier
RF_results = (
topFeatures_test.map(lambda row: row.label) \
.zip(RF_moedel.predict(topFeatures_test \
.map(lambda row: row.features)))
)
RF_evaluation = ev.BinaryClassificationMetrics(RF_results)
print('Area under PR: {0 : .2f}'.format(RF_evaluation.areaUnderPR))
print('Area under ROC : {0: .2f}'.format(RF_evalution.areaUnderROC))
model_evaluation.unpersist()