Machine Learning Pipelines for R

421 篇文章 14 订阅

Machine Learning Pipelines for R

Building machine learning and statistical models often requires pre- and post-transformation of the input and/or response variables, prior to training (or fitting) the models. For example, a model may require training on the logarithm of the response and input variables. As a consequence, fitting and then generating predictions from these models requires repeated application of transformation and inverse-transformation functions – to go from the domain of the original input variables to the domain of the original output variables (via the model). This is usually quite a laborious and repetitive process that leads to messy code and notebooks.

The pipeliner package aims to provide an elegant solution to these issues by implementing a common interface and workflow with which it is possible to:

  • define transformation and inverse-transformation functions;
  • fit a model on training data; and then,
  • generate a prediction (or model-scoring) function that automatically applies the entire pipeline of transformations and inverse-transformations to the inputs and outputs of the inner-model and its predicted values (or scores).

The idea of pipelines is inspired by the machine learning pipelines implemented in Apache Spark’s MLib library (which are in-turn inspired by Python’s scikit-Learn package). This package is still in its infancy and the latest development version can be downloaded from this GitHub repository using the devtools package (bundled with RStudio),

1
devtools:: install_github ( "alexioannides/pipeliner" )

Pipes in the Pipleline

There are currently four types of pipeline section – a section being a function that wraps a user-defined function – that can be assembled into a pipeline:

  • transform_features: wraps a function that maps input variables (or features) to another space – e.g.,
1
2
3
transform_features ( function (df) {
   data.frame (x1 = log (df$var1))
})
  • transform_response: wraps a function that maps the response variable to another space – e.g.,
1
2
3
transform_response ( function (df) {
   data.frame (y = log (df$response))
})
  • estimate_model: wraps a function that defines how to estimate a model from training data in a data.frame – e.g.,
1
2
3
estimate_model ( function (df) {
   lm (y ~ 1 + x1, df)
})
  • inv_transform_features(f): wraps a function that is the inverse to transform_response, such that we can map from the space of inner-model predictions to the one of output domain predictions – e.g.,
1
2
3
inv_transform_response ( function (df) {
   data.frame (pred_response = exp (df$pred_y))
})

As demonstrated above, each one of these functions expects as its argument another unary function of a data.frame (i.e. it has to be a function of a single data.frame). With the exception of estimate_model, which expects the input function to return an object that has a predict.object-class-name method existing in the current environment (e.g. predict.lm for linear models built using lm()), all the other transform functions also expect their input functions to return data.frames (consisting entirely of columns notpresent in the input data.frame). If any of these rules are violated then appropriately named errors will be thrown to help you locate the issue.

If this sounds complex and convoluted then I encourage you to to skip to the examples below – this framework is very simple to use in practice. Simplicity is the key aim here.

Two Interfaces to Rule Them All

I am a great believer and protagonist for functional programming – especially for data-related tasks like building machine learning models. At the same time the notion of a ‘machine learning pipeline’ is well represented with a simple object-oriented class hierarchy (which is how it is implemented in Apache Spark’s). I couldn’t decide which style of interface was best, so I implemented both within pipeliner (using the same underlying code) and ensured their output can be used interchangeably. To keep this introduction simple, however, I’m only going to talk about the functional interface – those interested in the (more) object-oriented approach are encouraged to read the manual pages for the ml_pipeline_builder ‘class’.

Example Usage with a Functional Flavor

We use the faithful dataset shipped with R, together with the pipeliner package to estimate a linear regression model for the eruption duration of ‘Old Faithful’ as a function of the inter-eruption waiting time. The transformations we apply to the input and response variables – before we estimate the model – are simple scaling by the mean and standard deviation (i.e. mapping the variables to z-scores).

The end-to-end process for building the pipeline, estimating the model and generating in-sample predictions (that include all interim variable transformations), is as follows,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
library (pipeliner)
 
data <- faithful
 
lm_pipeline <- pipeline (
   data,
 
   transform_features ( function (df) {
     data.frame (x1 = (df$waiting - mean (df$waiting)) / sd (df$waiting))
   }),
 
   transform_response ( function (df) {
     data.frame (y = (df$eruptions - mean (df$eruptions)) / sd (df$eruptions))
   }),
 
   estimate_model ( function (df) {
     lm (y ~ 1 + x1, df)
   }),
 
   inv_transform_response ( function (df) {
     data.frame (pred_eruptions = df$pred_model * sd (df$eruptions) + mean (df$eruptions))
   })
)
 
in_sample_predictions <- predict (lm_pipeline, data, verbose = TRUE )
head (in_sample_predictions)
##   eruptions waiting         x1 pred_model pred_eruptions
## 1     3.600      79  0.5960248  0.5369058       4.100592
## 2     1.800      54 -1.2428901 -1.1196093       2.209893
## 3     3.333      74  0.2282418  0.2056028       3.722452
## 4     2.283      62 -0.6544374 -0.5895245       2.814917
## 5     4.533      85  1.0373644  0.9344694       4.554360
## 6     2.883      55 -1.1693335 -1.0533487       2.285521

Accessing Inner Models and Prediction Functions

We can access the estimated inner models directly and compute summaries, etc – for example,

1
2
3
4
5
6
7
8
9
10
11
12
summary (lm_pipeline$inner_model)
##
## Call:
## lm(formula = y ~ 1 + x1, data = df)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1.13826 -0.33021  0.03074  0.30586  1.04549
##
## Residual standard error: 0.435 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16

Pipeline prediction functions can also be accessed directly in a similar way – for example,

1
2
3
4
5
6
7
8
9
10
11
pred_function <- lm_pipeline$predict
predictions <- pred_function (data, verbose = FALSE )
 
head (predictions)
##   pred_eruptions
## 1       4.100592
## 2       2.209893
## 3       3.722452
## 4       2.814917
## 5       4.554360
## 6       2.285521

Turbo-Charged Pipelines in the Tidyverse

The pipeliner approach to building models becomes even more concise when combined with the set of packages in the [tidyverse](http://tidyverse.org "Welcome to The Tidyverse!"). For example, the 'Old Faithful' pipeline could be rewritten as,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
library (tidyverse)
 
lm_pipeline %
   pipeline (
     transform_features ( function (df) {
       transmute (df, x1 = (waiting - mean (waiting)) / sd (waiting))
     }),
 
     transform_response ( function (df) {
       transmute (df, y = (eruptions - mean (eruptions)) / sd (eruptions))
     }),
 
     estimate_model ( function (df) {
       lm (y ~ 1 + x1, df)
     }),
 
     inv_transform_response ( function (df) {
       transmute (df, pred_eruptions = pred_model * sd (eruptions) + mean (eruptions))
     })
   )
 
head ( predict (lm_pipeline, data))
## [1] 4.100592 2.209893 3.722452 2.814917 4.554360 2.285521

Nice, compact and expressive (if I don’t say so myself)!

Compact Cross-validation

If we now introduce the modelr package into this workflow and adopt the the list-columns pattern described in Hadley Wickham’s R for Data Science, we can also achieve wonderfully compact end-to-end model estimation and cross-validation,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
library (modelr)
 
# define a function that estimates a machine learning pipeline on a single fold of the data
pipeline_func <- function (df) {
   pipeline (
     df,
     transform_features ( function (df) {
       transmute (df, x1 = (waiting - mean (waiting)) / sd (waiting))
     }),
 
     transform_response ( function (df) {
       transmute (df, y = (eruptions - mean (eruptions)) / sd (eruptions))
     }),
 
     estimate_model ( function (df) {
       lm (y ~ 1 + x1, df)
     }),
 
     inv_transform_response ( function (df) {
       transmute (df, pred_eruptions = pred_model * sd (eruptions) + mean (eruptions))
     })
   )
}
 
# 5-fold cross-validation using machine learning pipelines
cv_rmse %
   mutate (model = map (train, ~ pipeline_func ( as.data.frame (.x))),
          predictions = map2 (model, test, ~ predict (.x, as.data.frame (.y))),
          residuals = map2 (predictions, test, ~ .x - as.data.frame (.y)$eruptions),
          rmse = map_dbl (residuals, ~ sqrt ( mean (.x ^ 2)))) %>%
   summarise (mean_rmse = mean (rmse), sd_rmse = sd (rmse))
 
cv_rmse
## # A tibble: 1 × 2
##   mean_rmse    sd_rmse
##            
## 1 0.4877222 0.05314748

Forthcoming Attractions

I built pipeliner largely to fill a hole in my own workflows. Up until now I’ve used Max Kuhn’s excellent caret package quite a bit, but for in-the-moment model building (e.g. within a R Notebook) it wasn’t simplifying the code that much, and the style doesn’t quite fit with the tidy and functional world that I now inhabit most of the time. So, I plugged the hole by myself. I intend to live with pipeliner for a while to get an idea of where it might go next, but I am always open to suggestions (and bug notifications) – please leave any ideas here.


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
TensorFlow Deep Learning Projects: 10 real-world projects on computer vision, machine translation, chatbots, and reinforcement learning Leverage the power of Tensorflow to design deep learning systems for a variety of real-world scenarios Key Features Build efficient deep learning pipelines using the popular Tensorflow framework Train neural networks such as ConvNets, generative models, and LSTMs Includes projects related to Computer Vision, stock prediction, chatbots and more Book Description TensorFlow is one of the most popular frameworks used for machine learning and, more recently, deep learning. It provides a fast and efficient framework for training different kinds of deep learning models, with very high accuracy. This book is your guide to master deep learning with TensorFlow with the help of 10 real-world projects. TensorFlow Deep Learning Projects starts with setting up the right TensorFlow environment for deep learning. Learn to train different types of deep learning models using TensorFlow, including Convolutional Neural Networks, Recurrent Neural Networks, LSTMs, and Generative Adversarial Networks. While doing so, you will build end-to-end deep learning solutions to tackle different real-world problems in image processing, recommendation systems, stock prediction, and building chatbots, to name a few. You will also develop systems that perform machine translation, and use reinforcement learning techniques to play games. By the end of this book, you will have mastered all the concepts of deep learning and their implementation with TensorFlow, and will be able to build and train your own deep learning models with TensorFlow confidently. What you will learn Set up the TensorFlow environment for deep learning Construct your own ConvNets for effective image processing Use LSTMs for image caption generation Forecast stock prediction accurately with an LSTM architecture Learn what semantic matching is by detecting duplic
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值