Key metrics separating real-time analytics from more traditional, batch, off-line analytics are latency and availability. At one end of the analytics spectrum are complex, long running batch analyses with slow response time and low availability requirements. At the other end are real-time, lightweight analytic applications with fast response time (O[ms]) and high availability (99.99..%). Another distinction is between the capability for responding to individual events and/or ordered sequences of events versus the capability for handling only event collections in micro batches without preservation of their ordered characteristics. The complexity of the analysis performed on the real-time data is also a big differentiator: capabilities range from simple filtering and aggregations to complex predictive procedures. The level of integration between the model generation and the model scoring functionalities needs also to be considered for real-time applications. Machine learning algorithms specially designed for online model building exist and are offered by some streaming data platforms but their number is small. Practical solutions could be built by combining an off-line model generation platform with a data streaming platform augmented with scoring capabilities.
In this blog we describe a new prototype for real time analytics integrating two components : Oracle Stream Explorer (OSX) and Oracle R Enterprise (ORE). Examples of target applications for this type of integration are: equipment monitoring through sensors, anomaly detection and failure prediction for large systems made of a high number of components.
The basic architecture is illustrated below:
OSX is a middleware platform for developing streaming data applications. These applications monitor and process large amounts of streaming data in real time, from a multitude of sources like sensors, social media, financial feeds, etc. Readers unfamiliar with OSX should visit Getting Started with Event Processing for OSX.
In OSX, streaming data flows into, through, and out of an application. The applications can be created, configured and deployed with pre-built components provided with the platform or built from customized adapters and event beans. The application in this case is a custom scoring application for real time data. A thorough description of the application building process can be found in the following guide: Developing Applications for Event Processing with Oracle Stream Explorer.
In our solution prototype for streaming analytics, the model exchange between ORE and OSX is realized by converting the R models to a PMML representation. After that, JPMML – the Java Evaluator API for PMML – is leveraged for reading the model and building a custom OSX scoring application.
The end-to-end workflow is represented below:
and the subsequent sections of this blog will summarize the essentials aspects.
Model Generation
res <- ore.groupApply(
X=…
INDEX=…
function(dat,frml) {mdl<-…},
…,
parallel=np)
Model representation in PMML
raw data and its desired form for the modeling algorithms, the Mining Schema component assigns the active and target variables and enumerates the policies for missing data, outliers, and so on. Besides the specifications for the data mining algorithms together with accompanying of pre- and post-processing steps, PMML can also describe more complex modeling concepts like model composition, model hierarchies, model verification and fields scoping – to find out more about PMML’s structure and functionality go to General Structure. PMML representations have been standardized for several classes of data mining algorithms. Details are available at the same location.
PMML in R
ada (ada)
arules (arules)
coxph (survival)
glm (stats)
glmnet (glmnet)
hclust (stats)
kmeans (stats)
ksvm (kernlab)
lm (stats)
multinom (nnet)
naiveBayes (e1071)
nnet (nnet)
randomForest (randomFoerst)
rfsrc (randomForestSRC)
rpart (rpart)
svm (e1071)
The r2pmml package offers complementary support for
gbm (gbm)
train(caret)
and a much better (performance-wise) conversion to PMML for randomForest. Check the details at converting_randomforest.
The conversion to pmml is done via the pmml() generic function which dispatches the appropriate method for the supplied model, depending on it’s class.
library(pmml)
mdl <- randomForest(…)
pmld <- pmml(mdl)
Exporting the PMML model
In the current prototype, the pmml model is exported to the streaming platform as a physical XML file. A better solution is to leverage R’s serialization interface which supports a rich set ofconnections through pipes, url’s, sockets, etc.
The pmml objects can be also saved in ORE datastores within the database and specific policies can be implemented to control the access and usage.
write(toString(pmmdl),file="..”)
serialize(pmmdl,connection)
ore.save(pmmdl,name=dsname,grantable=TRUE)
ore.grant(name=dsname, type="datastore", user=…)
OSX Applications and the Event Processing Network (EPN)
In OSX, applications are modeled as Data Flow graphs named Event Processing Networks (EPN). Data flows into, through, and out of EPNs. When raw data flows into an OSX application it is first converted into events. Events flow through the different stages of application where they are processed according to the specifics of the application. At the end, events are converted back to data in suitable format for consumption by downstream applications.
The EPN for our prototype is basic:
The JPMML-Evaluator
JPMML offers support for:
Association Rules
Cluster Models
Regression
General Regression
k-Nearest Neighbors
Naïve Bayes
Neural Networks
Tree Models
Support Vector Machines
Ensemble Models
which covers most of the models which can be converted to PMML from R, using the pmml() method, except for time series, sequence rules & text models.
The Scoring Processor
- The PMML schema is loaded, from the xml document,
pmml = pmmlUtil.loadModel(pmmlFileName);
- An instance of the Model Evaluator is created. In the example below we assume that we don’t know what type of model we are dealing with so the instantiation is delegated to an instance of a ModelEvaluatorFactory class.
ModelEvaluatorFactory modelEvaluatorFactory =
ModelEvaluatorFactory.newInstance();
ModelEvaluator<?> evaluator = modelEvaluatorFactory.newModelManager(pmml);
- This Model Evaluator instance is queried for the fields definitions. For the active fields:
List<FieldName> activeModelFields = evaluator.getActiveFields();
- The subsequent data preparation performs several tasks: value conversions between the Java type system and the PMML type system, validation of these values according to the specifications in the Data Field element, handling of invalid, missing values and outliers as per the Mining Field element.
FieldValue activeValue = evaluator.prepare(activeField, inputValue)
pmmlArguments.put(activeField, activeValue);
- The prediction is executed next
Map<FieldName, ?> results = evaluator.evaluate(pmmlArguments);
- Once this is done, the mapping between the scoring results & other fields to output events is performed. This needs to differentiate between the cases where the target values are Java primitive values or smth different.
FieldName targetName = evaluator.getTargetField();
Object targetValue = results.get(targetName);
if (targetValue instanceof Computable){ ….
More details about this approach can be found at JPMML-Evaluator: Preparing arguments for evaluation and Java (JPMML) Prediction using R PMML model.
The key aspect is that the JPMML Evaluator API provides the functionality for implementing the Scoring Processor independently of the actual model being used. The active variables, mappings, assignments, predictor invocations are figured out automatically, from the PMML representation. This approach allows flexibility for the scoring application. Suppose that several PMML models have been generated off-line, for the same system component/equipment part, etc. Then, for example, an n-variables logistic model could be replaced by an m-variables decision tree model via the UI control by just pointing a Scoring Processor variable to the new PMML object. Moreover the substitution can be executed via signal events sent through the UI Application Control (upper left of EPN) without stopping and restarting the scoring application. This is practical because the real-time data keeps flowing in !
Tested models
The R algorithms listed below were tested and identical results were obtained for predictions based on the OSX PMML/JPMML scoring application and predictions in R.
lm (stats)
glm (stats)
rpart (rpart)
naiveBayes (e1071)
nnet (nnet)
randomForest (randomForest)
The prototype is new and other algorithms are currently tested. The details will follow in a subsequent post.
Acknowledgment
The OSX-ORE PMML/JPMML-based prototype for real-time scoring was developed togehther with Mauricio Arango | A-Team Cloud Solutions Architects. The work was presented at BIWA 2016.