For training
We will write a pretty standard pipeline, one broadly applicable to many ML problems and which has a few functions, mainly ones that:
Load records of data.
Clean data by removing incomplete records and input missing values when necessary.
Preprocess and format data in a way that can be understood by a model.
Remove a set of data that will not be trained on but used to validate model results (a validation set).
Train a model on a given subset of data and return a trained model and summary statistics.
For inference
We will leverage some functions from the training pipeline, as well as writing a few custom ones. Ideally, we would need functions that:
Load a trained model and keep it in memory (to provide faster results) Will preprocess (same as training) Gather any relevant outside information Will pass one example through a model (an inference function) Will postprocess, to clean up results before serving them to users