The Implementation of baseline algorithms of Machine Learning

[THE THEORY COMES FROM HOW TO IMPLEMENT BASELINE MACHINE LEARNING ALGORITHMS FROM SCRATCH WITH PYTHON]

Background

When coming across a problem, we want to use to either classification or regression methods to predict or generate the basic pattern about this specific problem. The commonest way is to build a model to learn the basic pattern based on the labeled data and then build a model ,after which we will make some predictions upon other data without labels based on the model. In this regard, how can we know the performance of the model we built, and with what kinds of methods can we evaluate our model. Here, I will implement two kinds of models called baseline machine learning algorithms with which we can compare the performance of our models or make some improvement based on that.

Random Prediction Algorithm

This algorithm generate the random outcome as observed in the training data for the testing data.

There are several things need to be kept in mind:

(1) The random outcome of testing data is depended on the outcomes training data, so we should first store all of theunique outcomes from training data, and then randomly give thoseunique data to testing data as their prediction outcome;

(2) We should fix the random number seed. Since those generated random numbers decide which outcome from training data to be given to that of the testing data. If we fixed the random number seed, it ensures us to get the same set of random numbers, and the same decisions when we make some predictions.

(3)Assuming the last column of our dataset is the label. so we just need the labels of the training data and then randomly assign them to the testing data as the prediction outcomes.

For example, we want to get the last column of the dataset in python, we could do this.

first, give a dataset.


So, we can see, I generated a 3X4 array as dataset, and assume the last column [3,7,11] as the labels of the dataset.

Then we visit each row and get the last data of the row (for row in test:....print(row[-1])).

At last I will randomly assign these numbers to testing data as their labels.

The algorithm should be this:


The Whole algorithm should be like this:


Zero_Rule_Algorithm

Compared with random_Algorithm to assign the prediction values for testing data, Zero_Rule_Algorithm assign the those values based on the possibilities of these values appeared in the labels of training dataset. For example, in the labels of training dataset, if there 90% are '0' and 10% are '1', we would be more likely to assign 0 as the prediction value for the testing data to gaurantee the 90% accuracy. 

Based on which the implementation of the algorithm could be like following:

Firstly, we should store all the labels of training dataset;

Secondly, we should statistic the value that appears most of times in the training dataset;

Finally, assign the value as the prediction value of test dataset.


The whole code is like this:


Regression

Regression problems need to predict a real value. A good default prediction for real values is to predict the central tendency (The center can either be the median or mean of the values). It means that, we can use the mean or median value of the training dataset as prediction of the testing data.







  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值