问题描述:
利用spark-sql得到了NaN的值,核对发现这些值都是关于stddev计算后得到的,但是在hive中查得为0.0。
使用的SQL代码为
select
phone
, tour_ymd
, stddev(total_price) as total_price_stddev
, stddev(bedroom_cnt) as bedroom_cnt_stddev
, stddev(tour_last_mintues) as tour_last_mintues_stddev
, stddev(showing_last_3day_cnt) as showing_last_3day_cnt_stddev
, stddev(showing_last_7day_cnt) as showing_last_7day_cnt_stddev
, stddev(showing_last_15day_cnt) as showing_last_15day_cnt_stddev
, stddev(showing_last_30day_cnt) as showing_last_30day_cnt_stddev
, stddev(temp) as stddevtemp
, stddev(humidity) as stddevhumidity
, stddev(aqi) as stddevaqi
from my_tb
where my_condition
group by phone, tour_ymd
问题原因:
虽然都执行的是一个SQL代码,但是从结果上看,hive中执行的时候是除以N(标准差)而spark-sql中执行时除以的是N-1(贝塞尔修正的标准差)
关于方差标准差知识点
- 方差ÿ