Your Prediction Gets As Good As Your Data

In the past, I have often seen that software engineers and data scientists assume that they can keep increasing their prediction accuracy by improving their machine learning algorithm. Here, I want to approach the classification problem from a different angle where I suggest data scientists analyze the distribution of their data to measure the information level in their data. This approach gives us an upper bound for how far we can improve the accuracy of a predictive algorithm and make sure our optimization efforts are not wasted.

Information and Entropy

In information theory, mathematician have developed useful measures such as entropy to compute the information level in the data. Let's think of a random coin with a head probability of 1%. If one filps this coin, she will collect more information if she sees the head events (i.e. rare event) compared to seeing a tail (i.e. moere likely event). One can formualte the information level in a random process with the negative logarithm of the random event probability.

This captures the described intuition. Mathmatician also formulated another measure called entropy by which they capture the average information in a random process in bits. Below we have shown the entropy formula for a discrete random variable:


      
      

For the first example, let's assume we have a coin with P(H)=0% and P(T)=100%. We can compute the entropy of the coin as follows:



      

For the second example, let's consider a coin where P(H)=1% and P(T)=1-P(H)=99%. Plugging numbers one can find that the entropy of such a coin is:


      
      

Finally, if the coin has P(H) = P(T) = 0.5 (i.e. a fair coin), its entropy is calculated as follows:


      
      

Entropy and Predictability

So, what these examples tell us? If we have a coin with head probability of zero, the coin's entropy is zero meaning that the average information in the coin is zero. This makes sense because flipping such a coin always comes as tail. Thus, the prediction accuracy is 100%. In other words, when the entropy is zero, we have the maximum predictibility.

In the second example, head probability is not zero but still very close to zero which again makes the coin to be very predictable with a low entropy.

Finally, in the last example we have 50/50 chance of seeing head/tail events which maximizes the entropy and consequently minimizes the predictability. In words, one can show that a fair coin has the meaximum entropy of 1 bit making the prediction as good as a random guess.

Kullback–Leibler Divergence

As last example, we show how we can borrow ideas from information theory to measure the distance between two probability distributions. Let's assume we are modeling two random processes by their pmf's: P(.) and Q(.). One can employ the entropy measure to compute the distance between two pmf's as follows:


      
      

Above distance function is known as KL Divergence which measures the distance of Q distribution from P's. The KL Divergence can be very useful in various applications such as NLP problems where we want to measure the distance between the distributions of two documents (e.g. modelled as bag of words).

Wrap-up

In this post, we showed that the entropy from information theory provides a way to measure how much information exists in a given dataset. We also highlighted the inverse relationship between the entropy and the predictability. This shows that one can use the entropy to calculate an upper bound for the accuracy of the prediction problem in hand.


Source: http://www.aioptify.com/informationbound.php


这个问题需要使用到非参数置信度算法,例如Conformal Prediction或Transductive Conformal Prediction。下面是一个使用Transductive Conformal Prediction的Python实现: ```python import numpy as np import pandas as pd from sklearn.neighbors import NearestNeighbors from sklearn.linear_model import LinearRegression from nonconformist.tc import TcpClassifier from nonconformist.nc import MarginErrFunc # 读取数据集 data = pd.read_csv('data.csv') # 划分训练集和测试集 train_data = data[:100] test_data = data[100:] # 训练模型 model = LinearRegression() model.fit(train_data[['X']], train_data[['Y']]) # 计算测试集的预测值和非符合性 X_test = test_data[['X']].values y_test = test_data[['Y']].values predictions = model.predict(X_test) errors = np.abs(predictions - y_test) # 计算最近邻 k = 5 nbrs = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(X_test) distances, indices = nbrs.kneighbors(X_test) # 使用Transductive Conformal Prediction构造置信区间 tcp = TcpClassifier(MarginErrFunc(), k=k) tcp.fit(X_test, y_test, indices, distances) interval90 = tcp.predict(X_test, significance=0.1) interval80 = tcp.predict(X_test, significance=0.2) interval60 = tcp.predict(X_test, significance=0.4) # 计算正确率 correct90 = 0 correct80 = 0 correct60 = 0 for i in range(len(X_test)): if y_test[i] >= interval90[i][0] and y_test[i] <= interval90[i][1]: correct90 += 1 if y_test[i] >= interval80[i][0] and y_test[i] <= interval80[i][1]: correct80 += 1 if y_test[i] >= interval60[i][0] and y_test[i] <= interval60[i][1]: correct60 += 1 accuracy90 = correct90 / len(X_test) accuracy80 = correct80 / len(X_test) accuracy60 = correct60 / len(X_test) print('90% prediction interval accuracy:', accuracy90) print('80% prediction interval accuracy:', accuracy80) print('60% prediction interval accuracy:', accuracy60) ``` 这个实现使用了scikit-learn和nonconformist库来进行Transductive Conformal Prediction,它可以计算测试集的预测值和非符合性,计算最近邻,并使用Transductive Conformal Prediction构造置信区间。最后,它计算正确率并输出结果。请注意,这个实现可能需要额外的库函数,你可以使用 pip 命令来安装它们。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值