ML 其他功能(二)

分类

使用RandomForestClassfier来模拟生存机会

from pyspark.sql import SparkSession
import pyspark.sql.functions as func
from pyspark.ml import Pipeline
import pyspark.sql.types as typ
import pyspark.ml.classification as cl
import pyspark.ml.feature as ft
spark = SparkSession.builder.master('local').appName('RandomForest').getOrCreate()
labels = [
    ('INFANT_ALIVE_AT_REPORT', typ.IntegerType()),
    ('BIRTH_PLACE', typ.StringType()),
    ('MOTHER_AGE_YEARS', typ.IntegerType()),
    ('FATHER_COMBINED_AGE', typ.IntegerType()),
    ('CIG_BEFORE', typ.IntegerType()),
    ('CIG_1_TRI', typ.IntegerType()),
    ('CIG_2_TRI', typ.IntegerType()),
    ('CIG_3_TRI', typ.IntegerType()),
    ('MOTHER_HEIGHT_IN', typ.IntegerType()),
    ('MOTHER_PRE_WEIGHT', typ.IntegerType()),
    ('MOTHER_DELIVERY_WEIGHT', typ.IntegerType()),
    ('MOTHER_WEIGHT_GAIN', typ.IntegerType()),
    ('DIABETES_PRE', typ.IntegerType()),
    ('DIABETES_GEST', typ.IntegerType()),
    ('HYP_TENS_PRE', typ.IntegerType()),
    ('HYP_TENS_GEST', typ.IntegerType()),
    ('PREV_BIRTH_PRETERM', typ.IntegerType())
]
schema = typ.StructType([
    typ.StructField(e[0], e[1], False) for e in labels
])
births = spark.read.csv('file:///Program Files/Pyproject/pyspark/data/births_transformed.csv.gz',
                       header=True,
                       schema=schema)
births = births.withColumn('BIRTH_PLACE_INT',
                           births['BIRTH_PLACE'].cast(typ.IntegerType()))
encoder = ft.OneHotEncoder(inputCol='BIRTH_PLACE_INT',
                           outputCol='BIRTH_PLACE_VEC')
featuresCreator = ft.VectorAssembler(
                    inputCols=[
                        col[0] 
                        for col in labels[2:]
                              ] + [encoder.getOutputCol()],
                    outputCol='features'
)
births = births.withColumn(
    'INFANT_ALIVE_AT_REPORT',
    func.col('INFANT_ALIVE_AT_REPORT').cast(typ.DoubleType())
)
births_train, births_test = births.randomSplit([0.7, 0.3], seed=6)
classifier = cl.RandomForestClassifier(numTrees=5,
                                      maxDepth=5,
                                      labelCol='INFANT_ALIVE_AT_REPORT')
pipline = Pipeline(stages=[encoder,
                          featuresCreator,
                          classifier])
model = pipline.fit(births_train)
test = model.transform(births_test)

查看LogisticRegression与RandomForestClassifier模型表现:

import pyspark.ml.evaluation as ev
evaluator = ev.BinaryClassificationEvaluator(labelCol='INFANT_ALIVE_AT_REPORT')
print(evaluator.evaluate(test, {evaluator.metricName: 'areaUnderROC'}))
print(evaluator.evaluate(test, {evaluator.metricName: 'areaUnderPR'}))
0.7596829423006186
0.7388027863240764

测试用一个树的模型表现如何:

Rclassifier = cl.DecisionTreeClassifier(
    maxDepth=5,
    labelCol='INFANT_ALIVE_AT_REPORT'
)
pipline = Pipeline(stages=[encoder, 
                           featuresCreator,
                           Rclassifier])
model = pipline.fit(births_train)
test = model.transform(births_test)
evaluator = ev.BinaryClassificationEvaluator(labelCol='INFANT_ALIVE_AT_REPORT')
print(evaluator.evaluate(test, {evaluator.metricName: 'areaUnderROC'}))
print(evaluator.evaluate(test, {evaluator.metricName: 'areaUnderPR'}))
0.732816393095637
0.7247560264070172
据类

据类时机器学习的重要组成部分

# 使用Kmeans模型查找相似性
import pyspark.ml.clustering as clus
kmeans = clus.KMeans(k=5, featuresCol='features')
pipline = Pipeline(stages=[encoder,
                          featuresCreator,
                          kmeans])
model = pipline.fit(births_train)

查看族间差异

test = model.transform(births_test)
test.groupBy('prediction').agg({'*': 'count',
                               'MOTHER_HEIGHT_IN': 'avg'
                               }).collect()
[Row(prediction=1, avg(MOTHER_HEIGHT_IN)=66.08835341365462, count(1)=249),
 Row(prediction=3, avg(MOTHER_HEIGHT_IN)=84.95989974937343, count(1)=399),
 Row(prediction=4, avg(MOTHER_HEIGHT_IN)=63.927140458869616, count(1)=8935),
 Row(prediction=2, avg(MOTHER_HEIGHT_IN)=67.26710816777042, count(1)=453),
 Row(prediction=0, avg(MOTHER_HEIGHT_IN)=65.41015200868621, count(1)=3684)]
主题挖掘

聚类模型不仅限于数字型数据。在自然语言处理(NLP)领域,诸如主题提取等问题也可依赖于聚类检测具有相似主题的文档。
首先创建数据集。数据集来自互联网随机选择段落,其中三个时处理自然和国家公园的主题,其余三个覆盖技术主题

text_data = spark.createDataFrame([
    ['''To make a computer do anything, you have to write a 
    computer program. To write a computer program, you have 
    to tell the computer, step by step, exactly what you want 
    it to do. The computer then "executes" the program, 
    following each step mechanically, to accomplish the end 
    goal. When you are telling the computer what to do, you 
    also get to choose how it's going to do it. That's where 
    computer algorithms come in. The algorithm is the basic 
    technique used to get the job done. Let's follow an 
    example to help get an understanding of the algorithm 
    concept.'''],
    ['''Laptop computers use batteries to run while not 
    connected to mains. When we overcharge or overheat 
    lithium ion batteries, the materials inside start to 
    break down and produce bubbles of oxygen, carbon dioxide, 
    and other gases. Pressure builds up, and the hot battery 
    swells from a rectangle into a pillow shape. Sometimes 
    the phone involved will operate afterwards. Other times 
    it will die. And occasionally—kapow! To see what's 
    happening inside the battery when it swells, the CLS team 
    used an x-ray technology called computed tomography.'''],
    ['''This technology describes a technique where touch 
    sensors can be placed around any side of a device 
    allowing for new input sources. The patent also notes 
    that physical buttons (such as the volume controls) could 
    be replaced by these embedded touch sensors. In essence 
    Apple could drop the current buttons and move towards 
    touch-enabled areas on the device for the existing UI. It 
    could also open up areas for new UI paradigms, such as 
    using the back of the smartphone for quick scrolling or 
    page turning.'''],
    ['''The National Park Service is a proud protector of 
    America’s lands. Preserving our land not only safeguards 
    the natural environment, but it also protects the 
    stories, cultures, and histories of our ancestors. As we 
    face the increasingly dire consequences of climate 
    change, it is imperative that we continue to expand 
    America’s protected lands under the oversight of the 
    National Park Service. Doing so combats climate change 
    and allows all American’s to visit, explore, and learn 
    from these treasured places for generations to come. It 
    is critical that President Obama acts swiftly to preserve 
    land that is at risk of external threats before the end 
    of his term as it has become blatantly clear that the 
    next administration will not hold the same value for our 
    environment over the next four years.'''],
    ['''The National Park Foundation, the official charitable 
    partner of the National Park Service, enriches America’s 
    national parks and programs through the support of 
    private citizens, park lovers, stewards of nature, 
    history enthusiasts, and wilderness adventurers. 
    Chartered by Congress in 1967, the Foundation grew out of 
    a legacy of park protection that began over a century 
    ago, when ordinary citizens took action to establish and 
    protect our national parks. Today, the National Park 
    Foundation carries on the tradition of early park 
    advocates, big thinkers, doers and dreamers—from John 
    Muir and Ansel Adams to President Theodore Roosevelt.'''],
    ['''Australia has over 500 national parks. Over 28 
    million hectares of land is designated as national 
    parkland, accounting for almost four per cent of 
    Australia's land areas. In addition, a further six per 
    cent of Australia is protected and includes state 
    forests, nature parks and conservation reserves.National 
    parks are usually large areas of land that are protected 
    because they have unspoilt landscapes and a diverse 
    number of native plants and animals. This means that 
    commercial activities such as farming are prohibited and 
    human activity is strictly monitored.''']
], ['documents'])
# 使用RegexTokenizer和StopWordsRemover模型:
tokenizer = ft.RegexTokenizer(inputCol='documents',
                             outputCol='input_arr',
                             pattern='\s+|[,.\"]')
stopwords = ft.StopWordsRemover(inputCol=tokenizer.getOutputCol(),
                               outputCol='input_stop')

管道中的是CountVectorizer:该模型计算文档中的单词并返回一个计数向量。向量长度等于所有文档中不同的单词的总数

stringIndexer = ft.CountVectorizer(inputCol=stopwords.getOutputCol(),
                                  outputCol='input_indexed')
tokenized = stopwords.transform(tokenizer.transform(text_data))
stringIndexer.fit(tokenized).transform(tokenized).select('input_indexed').take(2)
[Row(input_indexed=SparseVector(257, {2: 7.0, 6: 1.0, 7: 3.0, 8: 3.0, 13: 3.0, 14: 1.0, 15: 2.0, 19: 1.0, 22: 2.0, 23: 1.0, 38: 1.0, 69: 1.0, 83: 1.0, 108: 1.0, 112: 1.0, 122: 1.0, 124: 1.0, 126: 1.0, 136: 1.0, 160: 1.0, 178: 1.0, 184: 1.0, 186: 1.0, 196: 1.0, 202: 1.0, 224: 1.0, 229: 1.0, 236: 1.0, 237: 1.0, 240: 1.0, 243: 1.0, 249: 1.0, 253: 1.0})),
 Row(input_indexed=SparseVector(257, {23: 1.0, 24: 2.0, 30: 2.0, 31: 1.0, 37: 2.0, 40: 2.0, 47: 1.0, 52: 1.0, 53: 1.0, 59: 1.0, 60: 1.0, 70: 1.0, 71: 1.0, 74: 1.0, 76: 1.0, 89: 1.0, 91: 1.0, 96: 1.0, 97: 1.0, 99: 1.0, 101: 1.0, 102: 1.0, 107: 1.0, 109: 1.0, 117: 1.0, 127: 1.0, 130: 1.0, 138: 1.0, 141: 1.0, 148: 1.0, 149: 1.0, 153: 1.0, 158: 1.0, 167: 1.0, 172: 1.0, 181: 1.0, 182: 1.0, 183: 1.0, 187: 1.0, 195: 1.0, 199: 1.0, 208: 1.0, 216: 1.0, 227: 1.0, 228: 1.0, 233: 1.0, 247: 1.0}))]

文本中有257个不同的单词,而每个文档由每个单词出现次数的计数表示。
预测主题: 使用LDA模型————隐狄利克雷分布(Latent Dirichlet Allocation)模型

clustering = clus.LDA(k=2,
                     optimizer='online',
                     featuresCol=stringIndexer.getOutputCol())

k参数指定主题数量,optimizer参数可以是‘online’或‘em’(后者代表最大期望算法)

# 管道
pipline = Pipeline(stages=[tokenizer,
                          stopwords,
                          stringIndexer,
                          clustering])
topics = pipline.fit(text_data).transform(text_data)
topics.select('topicDistribution').collect()
[Row(topicDistribution=DenseVector([0.9886, 0.0114])),
 Row(topicDistribution=DenseVector([0.1736, 0.8264])),
 Row(topicDistribution=DenseVector([0.0239, 0.9761])),
 Row(topicDistribution=DenseVector([0.2119, 0.7881])),
 Row(topicDistribution=DenseVector([0.0208, 0.9792])),
 Row(topicDistribution=DenseVector([0.0118, 0.9882]))]
回归

预测MOTHER_WEIGHT_GAIN

features = ['MOTHER_AGE_YEARS', 'MOTHER_HEIGHT_IN',
           'MOTHER_PRE_WEIGHT', 'DIABETES_PRE',
           'DIABETES_GEST', 'HYP_TENS_PRE',
           'HYP_TENS_GEST', 'PREV_BIRTH_PRETERM',
           'CIG_BEFORE', 'CIG_1_TRI', 'CIG_2_TRI',
           'CIG_3_TRI']

由于所有的特征都是数字型的,所以将他们整理在一起,并使用ChiSqlSelector来仅选择前六个重要的特征

featuresCreator = ft.VectorAssembler(inputCols=[col for col in features[1:]],
                                    outputCol='features')
selector = ft.ChiSqSelector(numTopFeatures=6,
                           outputCol='selectedFeatrures',
                           labelCol='MOTHER_WEIGHT_GAIN')

预测增加的体重,使用梯度提升决策树regressor

import pyspark.ml.regression as reg
regressor = reg.GBTRegressor(maxDepth=3,
                            maxIter=15,
                            labelCol='MOTHER_WEIGHT_GAIN')
# 创建管道
pipline = Pipeline(stages=[featuresCreator,
                          selector,
                          regressor])
weightGain = pipline.fit(births_train)

创建weightGain模型后,测试数据上是否表现良好:

evaluator = ev.RegressionEvaluator(predictionCol='prediction',
                                  labelCol='MOTHER_WEIGHT_GAIN')
print(evaluator.evaluate(weightGain.transform(births_test),
                        {evaluator.metricName: 'r2'}))
0.48370094215000026

这个模型没有MOTHER_WEIGHT_GAIN标签相关的其它更好的独立特征,所以无法充分解释其变化

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值