MADlib——基于SQL的数据挖掘解决方案（30）——模型评估之预测度量

最新推荐文章于 2023-11-14 08:00:00 发布

wzy0623

最新推荐文章于 2023-11-14 08:00:00 发布

阅读量1.1k

点赞数 1

分类专栏： BI MADlib——基于SQL的数据挖掘解决方案

BI 同时被 2 个专栏收录

242 篇文章 48 订阅

订阅专栏

MADlib——基于SQL的数据挖掘解决方案

30 篇文章 39 订阅

订阅专栏

一、预测度量

该模块提供了一组度量来评估模型预测的质量。除非另有说明，典型的函数将采用一组“预测”和“观察”值，并使用它们来计算所需的度量。所有功能都支持分组（混淆矩阵除外）。

二、预测度量函数

平均绝对误差：mean_abs_error(table_in, table_out,prediction_col, observed_col, grouping_cols)
平均绝对百分误差：mean_abs_perc_error(table_in,table_out, prediction_col, observed_col, grouping_cols)
平均百分比误差：mean_perc_error(table_in, table_out,prediction_col, observed_col, grouping_cols)
均方误差：mean_squared_error(table_in, table_out,prediction_col, observed_col, grouping_cols)
R2评分：r2_score(table_in, table_out,prediction_col, observed_col, grouping_cols)
整后的R2评分：adjusted_r2_score(table_in,table_out, prediction_col, observed_col, num_predictors, training_size,grouping_cols)
与二元分类相关的预测度量：binary_classifier(table_in,table_out, prediction_col, observed_col, grouping_cols)
ROC曲线下的面积（二元分类）：area_under_roc(table_in,table_out, prediction_col, observed_col, grouping_cols)
多类分类器的混淆矩阵：confusion_matrix(table_in,table_out, prediction_col, observed_col, grouping_cols)

三、参数

table_in：TEXT。输入表的名称。

table_out：TEXT。输出表的名称。出于一致性，即使在不使用分组的情况下，也会为所有度量输出创建一个表，这可能意味着在某些情况下输出表中只有一个值。

prediction_col：TEXT。输入表中预测值列的名称。

observed_col：TEXT。输入表中观察值列的名称。

num_predictors（只对adjusted_r2_score）：INTEGER。不含常数项的预测模型中的参数个数。

training_size（只对adjusted_r2_score）：INTEGER。用于训练的行数，不包括任何空行。

grouping_cols（可选）：TEXT。缺省值为NULL。输入表中分组列的名称。

四、函数的具体细节

1.r2_score

该函数返回预测值和观测值之间的决定系数（R2）。R2为1表示回归线与数据完全吻合，而R2为0表示该线完全不适合数据。当将非线性函数拟合为数据时，R2可能出现负值。详情请参阅参考资料[1]。

2. adjusted_r2_score

该函数返回对上述R2调整后的评分。当模型中加入额外的解释变量时，调整后的R2得分用来抵消R2自动增加的问题。它需要两个额外参数描述模型的自由度（num_predictors）和函数训练集的大小（training_size）：

num_predictors：指示模型具有常数项以外的参数个数。例如，如果它被设置为“3”的模式，可采取以下形式为例：7 + 5x + 39y + 0.91z。
training_size：指示训练集的行数（不包括任何空行）。

这些参数都不能从预测值和测试数据中推断出来，这就是它们是显式输入的原因。详情请参阅参考资料[1]。

3.BinaryClassification

该函数返回一个输出表，其中包含一些二进制分类常用的度量指标。各度量的定义如下：

tp：正确分类的正样本计数。
tn：正确分类的负样本计数。
fp：错误分类的正样本计数。
fn：错误分类的负样本计数。
tpr= tp / (tp + fn)。
tnr = tn / (fp + tn)。
ppv = tp / (tp + fp)。
npv = tn / (tn + fn)。
fpr = fp / (fp + tn)。
fdr = 1- ppv。
fnr = fn / (fn + tp)。
acc = (tp + tn)/ (tp + tn + fp + fn)。
f1 = 2* tp / (2 * tp + fp + fn)。

4. area_under_roc

该函数返回二元分类（AUC）下接收者操作特征曲线下的面积。ROC曲线是曲线与分类器的TPR和FPR度量。（这些度量定义见上面的二进制分类）。详情请参阅参考资料[2]。注意二分类函数可以用来获取绘制ROC曲线要求的数据（TPR和FPR值）。

注意：

对于‘binary_classifier’和‘area_under_roc’函数：

“observed_col”列为一个有两个值的数值列：0和1，或一个布尔列。就公制计算而言，0被认为是负的，1被认为是正的。
“pred_col”列包含相应的似然概率值。更大的值对应于更大的确定性，即所观察到的值将是“1”，而较低的值对应于更大的确定性，它将是“0”。

5. confusion_matrix

该函数返回多类分类的混淆矩阵。矩阵的每一列表示一个预测类中的实例，而每一行代表实际类中的实例。这比精确猜测（准确率）允许更详细的分析。详情请参阅参考资料[3]。请注意，混淆矩阵不支持分组。

五、示例

1. 创建示例数据

drop table if exists test_set;
create table test_set(
                  pred float8,
                  obs float8
                );
insert into test_set values
  (37.5,53.1), (12.3,34.2), (74.2,65.4), (91.1,82.1);

2. 运行平均绝对误差函数

drop table if exists table_out;
select madlib.mean_abs_error( 'test_set', 'table_out', 'pred', 'obs');
select * from table_out;

结果：

 mean_abs_error 
----------------
         13.825
(1 row)

3. 运行平均绝对百分误差函数

drop table if exists table_out;
select madlib.mean_abs_perc_error( 'test_set', 'table_out', 'pred', 'obs');
select * from table_out;

结果：

        avg        
-------------------
 0.294578793636013
(1 row)

4.运行平均百分比误差函数

drop table if exists table_out;
select madlib.mean_perc_error( 'test_set', 'table_out', 'pred', 'obs');
select * from table_out;

结果：

  mean_perc_error  
-------------------
 -0.17248930032771
(1 row)

5. 运行均方误差函数

drop table if exists table_out;
select madlib.mean_squared_error( 'test_set', 'table_out', 'pred', 'obs');
select * from table_out;

结果：

 mean_squared_error 
--------------------
           220.3525
(1 row)

6. 运行R2评分函数

drop table if exists table_out;
select madlib.r2_score( 'test_set', 'table_out', 'pred', 'obs');
select * from table_out;

结果：

     r2_score      
-------------------
 0.279929088443375
(1 row)

7. 运行调整后的R2评分函数

drop table if exists table_out;
select madlib.adjusted_r2_score( 'test_set', 'table_out', 'pred', 'obs', 3, 100);
select * from table_out;

结果：

     r2_score      | adjusted_r2_score 
-------------------+-------------------
 0.279929088443375 | 0.257426872457231
(1 row)

8. 为二进制分类器度量创建样本数据

drop table if exists test_set;
create table test_set as
    select ((a*8)::integer)/8.0 pred,
        ((a*0.5+random()*0.5)>0.5) obs
    from (select random() as a from generate_series(1,100)) x;

9. 运行二元分类器度量函数

drop table if exists table_out;
select madlib.binary_classifier( 'test_set', 'table_out', 'pred', 'obs');

10. 查看真正率和假正率

select threshold, tpr, fpr from table_out order by threshold;

结果：

       threshold        |        tpr        |        fpr         
------------------------+-------------------+--------------------
 0.00000000000000000000 |                 1 |                  1
 0.12500000000000000000 |                 1 |  0.882352941176471
 0.25000000000000000000 | 0.979591836734694 |  0.745098039215686
 0.37500000000000000000 | 0.897959183673469 |  0.568627450980392
 0.50000000000000000000 | 0.836734693877551 |  0.431372549019608
 0.62500000000000000000 | 0.693877551020408 |  0.313725490196078
 0.75000000000000000000 | 0.551020408163265 |  0.176470588235294
 0.87500000000000000000 | 0.428571428571429 | 0.0980392156862745
 1.00000000000000000000 | 0.163265306122449 | 0.0196078431372549
(9 rows)

11. 查看给定阈值下的所有度量值

-- Set extended display on for easier reading of output
\x on
select * from table_out where threshold=0.5;

结果：

-[ RECORD 1 ]---------------------
threshold | 0.50000000000000000000
tp        | 41
fp        | 22
fn        | 8
tn        | 29
tpr       | 0.836734693877551
tnr       | 0.568627450980392
ppv       | 0.650793650793651
npv       | 0.783783783783784
fpr       | 0.431372549019608
fdr       | 0.349206349206349
fnr       | 0.163265306122449
acc       | 0.7
f1        | 0.73214285714285714286

12. 运行ROC曲线下的面积函数

drop table if exists table_out;
select madlib.area_under_roc( 'test_set', 'table_out', 'pred', 'obs');
select * from table_out;

结果：

               area_under_roc                
---------------------------------------------
 0.77691076430572228891752501000400160064025
(1 row)

13. 创建混淆矩阵的样本数据

drop table if exists test_set;
create table test_set as
    select (x+y)%5+1 as pred,
        (x*y)%5 as obs
    from generate_series(1,5) x,
        generate_series(1,5) y;

14. 运行混淆矩阵函数

drop table if exists table_out;
select madlib.confusion_matrix( 'test_set', 'table_out', 'pred', 'obs');
select * from table_out order by class;

结果：

 row_id | class | confusion_arr 
--------+-------+---------------
      1 |     0 | {0,1,2,2,2,2}
      2 |     1 | {0,2,0,1,1,0}
      3 |     2 | {0,0,0,2,2,0}
      4 |     3 | {0,0,2,0,0,2}
      5 |     4 | {0,2,1,0,0,1}
      6 |     5 | {0,0,0,0,0,0}
(6 rows)