



Standard verification methods


"Eyeball" verification “眼球”验证
One of the oldest and best verification methods is the good old fashioned visual, or “eyeball”, method: look at the forecast and observations side by side and use human judgment to discern the forecast errors. Common ways to present data are as time series and maps.
The eyeball method is great if you only have a few forecasts, or you have lots of time, or you’re not interested in quantitative verification statistics. Even when you do want statistics, it is a very good idea to look at the data from time to time!
However, the eyeball method is not quantitative, and it is very prone to individual, subjective biases of interpretation. Therefore it must be used with caution in any formal verification procedure.
The following sections give fairly brief descriptions of the standard verification methods and scores for dichotomous, multi-category, continuous, and probabilistic forecasts. For greater detail and discussion of the standard methods see Stanski et al. (1989) or one of the excellent books on forecast verification and statistics.

Methods for dichotomous (yes/no) forecasts


A dichotomous forecast says, “yes, an event will happen”, or “no, the event will not happen”. Rain and fog prediction are common examples of yes/no forecasts. For some applications a threshold may be specified to separate “yes” and “no”, for example, winds greater than 50 knots.
To verify this type of forecast we start with a contingency table that shows the frequency of “yes” and “no” forecasts and occurrences. The four combinations of forecasts (yes or no) and observations (yes or no), called the joint distribution, are:
** hit** - event forecast to occur, and did occur
miss – event forecast not to occur, but did occur
false alarm - event forecast to occur, but did not occur
correct negative - event forecast not to occur, and did not occur

The total numbers of observed and forecast occurrences and non-occurences are given on the lower and right sides of the contingency table, and are called the marginal distribution.
The contingency table is a useful way to see what types of errors are being made. A perfect forecast system would produce only hits and correct negatives, and no misses or false alarms.
A large variety of categorical statistics are computed from the elements in the contingency table to describe particular aspects of forecast performance. We will illustrate these statistics using a (made-up) example. Suppose a year’s worth of official daily rain forecasts and observations produced the following contingency table:
Categorical statistics that can be computed from the yes/no contingency table are given below. Sometimes these scores are known by alternate names shown in parentheses.

Accuracy (fraction correct)

Answers the question: Overall, what fraction of the forecasts were correct?
Range: 0 to 1. Perfect score: 1.
Characteristics: Simple, intuitive. Can be misleading since it is heavily influenced by the most common category, usually “no event” in the case of rare weather.
In the example above, Accuracy = (82+222) / 365 = 0.83, indicating that 83% of all forecasts were correct.
Bias score (frequency bias)

Answers the question: How did the forecast frequency of “yes” events compare to the observed frequency of “yes” events?
Range: 0 to ∞. Perfect score: 1.
Characteristics: Measures the ratio of the frequency of forecast events to the frequency of observed events. Indicates whether the forecast system has a tendency to underforecast (BIAS<1) or overforecast (BIAS>1) events. Does not measure how well the forecast corresponds to the observations, only measures relative frequencies.
In the example above, BIAS = (82+38) / (82+23) = 1.14, indicating slight overforecasting of rain frequency.
在上面的例子中,BIAS =(82+38)/(82+23)=1.14,表明对降雨频率略有高估。
Probability of detection (hit rate)
(also denoted H)

Answers the question: What fraction of the observed “yes” events were correctly forecast?
Range: 0 to 1. Perfect score: 1.
Characteristics: Sensitive to hits, but ignores false alarms. Very sensitive to the climatological frequency of the event. Good for rare events.Can be artificially improved by issuing more “yes” forecasts to increase the number of hits. Should be used in conjunction with the false alarm ratio (below). POD is also an important component of the Relative Operating Characteristic (ROC) used widely for probabilistic forecasts.
In the example above, POD = 82 / (82+23) = 0.78, indicating that roughly 3/4 of the observed rain events were correctly predicted.
False alarm ratio

Answers the question: What fraction of the predicted “yes” events actually did not occur (i.e., were false alarms)?
Range: 0 to 1. Perfect score: 0.
Characteristics: Sensitive to false alarms, but ignores misses. Very sensitive to the climatological frequency of the event. Should be used in conjunction with the probability of detection (above).
In the example above, FAR = 38/(82+38) = 0.32, indicating that in roughly 1/3 of the forecast rain events, rain was not observed.
Probability of false detection (false alarm rate)
(also denoted F)

Answers the question: What fraction of the observed “no” events were incorrectly forecast as “yes”?
Range: 0 to 1. Perfect score: 0.
Characteristics: Sensitive to false alarms, but ignores misses. Can be artificially improved by issuing fewer “yes” forecasts to reduce the number of false alarms. Not often reported for deterministic forecasts, but is an important component of the Relative Operating Characteristic (ROC) used widely for probabilistic forecasts.
In the example above, POFD = 38/(222+38) = 0.15, indicating that for 15% of the observed “no rain” events the forecasts were incorrect.
Success ratio

Answers the question: What fraction of the forecast “yes” events were correctly observed?
Range: 0 to 1. Perfect score: 1.
Characteristics: Gives information about the likelihood of an observed event, given that it was forecast. It is sensitive to false alarms but ignores misses. SR is equal to 1-FAR. POD is plotted against SR in the categorical performance diagram.
In the example above, SR = 82/(82+38) = 0.68, indicating that for 68% of the forecast rain events, rain was actually observed.
Threat score (critical success index)
(also denoted CSI)

Answers the question: How well did the forecast “yes” events correspond to the observed “yes” events?
Range: 0 to 1, 0 indicates no skill. Perfect score: 1.
范围:0到1, 0表示没有技能。完美得分:1。
Characteristics: Measures the fraction of observed and/or forecast events that were correctly predicted. It can be thought of as the accuracy when correct negatives have been removed from consideration, that is, TS is only concerned with forecasts that count. Sensitive to hits, penalizes both misses and false alarms. Does not distinguish source of forecast error. Depends on climatological frequency of events (poorer scores for rarer events) since some hits can occur purely due to random chance.
In the example above, TS= 82/(82+23+38) = 0.57, meaning that slightly more than half of the “rain” events (observed and/or predicted) were correctly forecast.
Equitable threat score (Gilbert skill score)
(also denoted GSS) where

Answers the question: How well did the forecast “yes” events correspond to the observed “yes” events (accounting for hits due to chance)?
Range: -1/3 to 1, 0 indicates no skill. Perfect score: 1.
Characteristics: Measures the fraction of observed and/or forecast events that were correctly predicted, adjusted for hits associated with random chance (for example, it is easier to correctly forecast rain occurrence in a wet climate than in a dry climate). The ETS is often used in the verification of rainfall in NWP models because its “equitability” allows scores to be compared more fairly across different regimes. Sensitive to hits. Because it penalises both misses and false alarms in the same way, it does not distinguish the source of forecast error.
In the example above, ETS = (82-34)/(82+23+38-34) = 0.44. ETS gives a lower score than TS.

Hanssen and Kuipers discriminant (true skill statistic, Peirce’s skill score)
(also denoted TSS and PSS)

Answers the question: How well did the forecast separate the “yes” events from the “no” events?
Range: -1 to 1, 0 indicates no skill. Perfect score: 1.
Characteristics: Uses all elements in contingency table. Does not depend on climatological event frequency. The expression is identical to HK = POD - POFD, but the Hanssen and Kuipers score can also be interpreted as (accuracy for events) + (accuracy for non-events) - 1. For rare events HK is unduly weighted toward the first term (same as POD), so this score may be more useful for more frequent events. Can be expressed in a form similar to the ETS except the hits random term is unbiased. See Woodcock (1976) for a comparison of HK with other scores.
In the example above, HK = 82 / (82+23) - 38 / (38+222) = 0.63

Heidke skill score (Cohen’s k)

Answers the question: What was the accuracy of the forecast relative to that of random chance?
Range: -1 to 1, 0 indicates no skill. Perfect score: 1.
Characteristics: Measures the fraction of correct forecasts after eliminating those forecasts which would be correct due purely to random chance. This is a form of the generalized skill score, where the score in the numerator is the number of correct forecasts, and the reference forecast in this case is random chance. In meteorology, at least, random chance is usually not the best forecast to compare to - it may be better to use climatology (long-term average value) or persistence (forecast = most recent observation, i.e., no change) or some other standard.
In the example above, HSS = 0.61.

Odds ratio

Answers the question: What is the ratio of the odds of a “yes” forecast being correct, to the odds of a “yes” forecast being wrong?
Odds ratio - Range: 0 to ∞, 1 indicates no skill. Perfect score: ∞
优势比 范围:0到∞,1表示没有技能。满分:∞
Log odds ratio - Range: -∞ to ∞, 0 indicates no skill. Perfect score: ∞
对数优势比 范围:- ∞ 到∞,0表示没有技能。完美的分数:∞
Characteristics: Measures the ratio of the odds of making a hit to the odds of making a false alarm. The logarithm of the odds ratio is often used instead of the original value. Takes prior probabilities into account. Gives better scores for rarer events. Less sensitive to hedging. Do not use if any of the cells in the contingency table are equal to 0. Used widely in medicine but not yet in meteorology – see Stephenson (2000) for more information.
Note that the odds ratio is not the same as the ratio of the probability of making a hit (hits / # forecasts) to the probability of making a false alarm (false alarms / # forecasts), since both of those can depend on the climatological frequency (i.e., the prior probability) of the event.
In the example above, OR = (82 x 222) / (23 x 38) = 20.8, indicating that the odds of a “yes” prediction being correct are over 20 times greater than the odds of a “yes” forecast being incorrect.
在上面的例子中,OR=(82 X 222)/(23 X 38)=20.8,表明"是"预测正确的概率比"是"预测不正确的概率高20倍以上。

Odds ratio skill score (Yule’s Q)
优势比技巧评分(Yule’s Q)

Answers the question: What was the improvement of the forecast over random chance?
Range: -1 to 1, 0 indicates no skill. Perfect score: 1
Characteristics: Independent of the marginal totals (i.e., of the threshold chosen to separate “yes” and “no”), so is difficult to hedge. See Stephenson (2000) for more information.
In the example above, ORSS = [(82 x 222)-(23 x 38)] / [(82 x 222)+(23 x 38)] = 0.91。
在上面的例子中,ORSS=[(82 X 222)-(23 X 38)]/[(82 X 222)+(23 X 38)]=0.91。

Methods for multi-category forecasts


Methods for verifying multi-category forecasts also start with a contingency table showing the frequency of forecasts and observations in the various bins. It is analogous to a scatter plot for categories.
Multi-category Contingency Table 多类别列联表
In this table n(Fi,Oj) denotes the number of forecasts in category i that had observations in category j, N(Fi) denotes the total number of forecasts in category i, N(Oj) denotes the total number of observations in category j, and N is the total number of forecasts.
The distributions approach to forecast verification examines the relationship among the elements in the multi-category contingency table. A perfect forecast system would have values of non-zero elements only along the diagonal, and values of 0 for all entries off the diagonal. The off-diagonal elements give information about the specific nature of the forecast errors. The marginal distributions (N’s at right and bottom of table) show whether the forecast produces the correct distribution of categorical values when compared to the observations. Murphy and Winkler (1987), Murphy et al. (1989) and Brooks and Doswell (1996) develop this approach in detail.
The advantage of the distributions approach is that the nature of the forecast errors can more easily be diagnosed. The disadvantage is that it is more difficult to condense the results into a single number. There are fewer statistics that summarize the performance of multi-category forecasts. However, any multi-category forecast verification can be converted to a series of K-1 yes/no-type verifications by defining “yes” to be “in category i” or “in category i or higher”, and “no” to be “not in category i” or “below category i”.

Histogram - Plot the relative frequencies of forecast and observed categories

Answers the question: How well did the distribution of forecast categories correspond to the distribution of observed categories?
Characteristics: Shows similarity between location, spread, and skewness of forecast and observed distributions. Does not give information on the correspondence between the forecasts and observations. Histograms give information similar to box plots.

Answers the question: Overall, what fraction of the forecasts were in the correct category?
Range: 0 to 1. Perfect score: 1.
Characteristics: Simple, intuitive. Can be misleading since it is heavily influenced by the most common category.
Heidke skill score

Answers the question: What was the accuracy of the forecast in predicting the correct category, relative to that of random chance?
Range: -∞ to 1, 0 indicates no skill. Perfect score: 1.
Characteristics: Measures the fraction of correct forecasts after eliminating those forecasts which would be correct due purely to random chance. This is one form of a generalized skill score, where the score in the numerator is the number of correct forecasts, and the reference forecast in this case is random chance. Requires a large sample size to make sure that the elements of the contingency table are all adequately sampled. In meteorology, at least, random chance is usually not the best forecast to compare to - it may be better to use climatology (long-term average value) or persistence (forecast is most recent observation, i.e., no change) or some other standard.
Hanssen and Kuipers discriminant (true skill statistic, Peirce’s skill score)

Answers the question: What was the accuracy of the forecast in predicting the correct category, relative to that of random chance?
Range: -1 to 1, 0 indicates no skill. Perfect score: 1
Characteristics: Similar to the Heidke skill score (above), except that in the denominator the fraction of correct forecasts due to random chance is for an unbiased forecast.
Gerrity score
where sij are elements of a scoring matrix given by (i = j, diagonal), (i ≠ j, off-diagonal), and with the sample probabilities (observed frequencies) given by pi =N(Oi)/N).

Answers the question: What was the accuracy of the forecast in predicting the correct category, relative to that of random chance?
Range: -1 to 1, 0 indicates no skill. Perfect score: 1
Characteristics: Uses all entries in the contingency table, does not depend on the forecast distribution, and is equitable (i.e., random and constant forecasts score a value of 0). GS does not reward conservative forecasting like HSS and HK, but rather rewards forecasts for correctly predicting the less likely categories. Smaller errors are penalized less than larger forecast errors. This is achieved through the use of the scoring matrix. A more detailed discussion and examples for 3-category forecasts can be found in Jolliffe and Stephenson (2012).

Methods for forecasts of continuous variables


Verifying forecasts of continuous variables measures how the values of the forecasts differ from the values of the observations. The continuous verification methods and statistics will be demonstrated on a sample data set of 10 temperature forecasts taken from Stanski et al. (1989):
Verification of continuous forecasts often includes some exploratory plots such as scatter plots and box plots, as well as various summary scores.
Scatter plot
Scatter plot - Plots the forecast values against the observed values.

Answers the question: How well did the forecast values correspond to the observed values?
Characteristics: Good first look at correspondence between forecast and observations. An accurate forecast will have points on or near the diagonal.
Scatter plots of the error can reveal relationships between the observed or forecast values and the errors.

Box plot

Box plot - Plot boxes to show the range of data falling between the 25th and 75th percentiles, horizontal line inside the box showing the median value, and the whiskers showing the complete range of the data.

Answers the question: How well did the distribution of forecast values correspond to the distribution of observed values?
Characteristics: Shows similarity between location, spread, and skewness of forecast and observed distributions. Does not give information on the correspondence between the forecasts and observations. Box plots give information similar to histograms.
Mean error

Answers the question: What is the average forecast error?
Range: -∞ to ∞. Perfect score: 0.
Characteristics: Simple, familiar. Also called the (additive) bias. Does not measure the magnitude of the errors. Does not measure the correspondence between forecasts and observations, i.e., it is possible to get a perfect score for a bad forecast if there are compensating errors.
In the example above, Mean Error = 0.8 C
(Multiplicative) bias

Answers the question: How does the average forecast magnitude compare to the average observed magnitude?
Range: -∞ to ∞. Perfect score: 1.
Characteristics: Simple, familiar. Best suited for quantities that have 0 as a lower or upper bound. Does not measure the magnitude of the errors. Does not measure the correspondence between forecasts and observations, i.e., it is possible to get a perfect score for a bad forecast if there are compensating errors.
In the example above, Bias = 1.06.
Mean absolute error

Answers the question: What is the average magnitude of the forecast errors?
Range: 0 to ∞. Perfect score: 0.
范围:0到∞. 完美评分:0。
Characteristics: Simple, familiar. Does not indicate the direction of the deviations.
In the example above, MAE = 2.8 C
Root mean square error

Answers the question: What is the average magnitude of the forecast errors?
Range: 0 to ∞. Perfect score: 0.
Characteristics: Simple, familiar. Measures “average” error, weighted according to the square of the error. Does not indicate the direction of the deviations. The RMSE puts greater influence on large errors than smaller errors, which may be a good thing if large errors are especially undesirable, but may also encourage conservative forecasting.
In the example above, RMSE = 3.2 C
在上面的例子中,RMSE=3.2 C。
The root mean square factor is similar to RMSE, but gives a multiplicative error instead of an additive error.
Mean squared error

Measures the mean squared difference between the forecasts and observations.
Range: 0 to ∞. Perfect score: 0.
Characteristics: Can be decomposed into component error sources following Murphy (1987). Units of MSE are the square of the basic units.
In the example above, MSE = 10 degrees squared.
Linear error in probability space (LEPS)

Measures the error in probability space as opposed to measurement space, where CDFo () is the cumulative probability density function of the observations, determined from an appropriate climatology.
测量概率空间中的误差,与测量空间相反,其中CDFo( )是从适当的气候学确定的观测值的累积概率密度函数。
Range: 0 to 1. Perfect score: 0.
Characteristics: Does not discourage forecasting extreme values if they are warranted. Requires knowledge of climatological PDF. Not yet in wide usage – Potts et al. (1996) derived an improved version of the LEPS score that is equitable and does not “bend back” (give better scores for worse forecasts near the extremes):

In the example above, suppose the climatological temperature is normally distributed with a mean of 14 C and variance of 50 C. Then according to the first expression, LEPS=0.106.
Stable equitable error in probability space (SEEPS)

where n(Fi,Oj) is the joint occurrence of forecast category i and observed category j in the 3x3 contingency table, and the scoring matrix is given by
Like LEPS, SEEPS measures the error in probability space as opposed to measurement space. It was developed to assess rainfall forecasts, where (1-p1) is the climatological probability of rain (i.e., accumulation exceeding 0.2 mm, following WMO guidelines), and p2=2p3 divides the climatological cumulative rainfall distribution into “light” (lower 2/3 of rain rates ≥0.2 mm) and “heavy” (upper 1/3 of rain rates ≥0.2 mm). Refer to diagram at right, where tL/H is the threshold delineating “light” and “heavy” rain.
Range: 0 to 1. Perfect score: 0.
Characteristics: Encourages forecasting of all categories. Resistant to hedging. Requires knowledge of climatological PDF. 1-SEEPS may be preferred as it is positively oriented. Use of locally derived thresholds allows aggregation/comparison of scores across climatologically varying regimes. For further stability require 0.1 < p1 < 0.85, that is, climate not too dry or too wet so that rain (or no rain) is an extreme event. For more information see Rodwell et al. (2010).
Correlation coefficient

Addresses the question: How well did the forecast values correspond to the observed values?
Range: -1 to 1. Perfect score: 1.
Characteristics: Good measure of linear association or phase error. Visually, the correlation measures how close the points of a scatter plot are to a straight line. Does not take forecast bias into account – it is possible for a forecast with large errors to still have a good correlation coefficient with the observations. Sensitive to outliers.
In the example above, r = 0.914
Anomaly correlation

Addresses the question: How well did the forecast anomalies correspond to the observed anomalies?
Range: -1 to 1. Perfect score: 1.
Characteristics: Measures correspondence or phase difference between forecast and observations, subtracting out the climatological mean at each point, C, rather than the sample mean values. The anomaly correlation is frequently used to verify output from numerical weather prediction (NWP) models. AC is not sensitive to forecast bias, so a good anomaly correlation does not guarantee accurate forecasts. Both forms of the equation are in common use – see Jolliffe and Stephenson (2012) or Wilks (2011) for further discussion.
In the example above, if the climatological temperature is 14 C, then AC = 0.904. AC is more often used in spatial verification.
S1 score
S 1评分

where ∆F (∆O) refers to the horizontal gradient in the forecast (observations).
其中∆F (∆O)指的是预测(观测)中的水平梯度。
Answers the question: How well did the forecast gradients correspond to the observed gradients?
Range: 0 to ∞. Perfect score: 0.
Characteristics: It is usually applied to geopotential height or sea level pressure fields in meteorology. Long historical records in NWP showing improvement in model performance over the years. Because S1 depends only on gradients, good scores can be achieved even when the forecast values are biased. Also depends on spatial resolution of the forecast.
Skill score

Answers the question: What is the relative improvement of the forecast over some reference forecast?
Range: Lower bound depends on what score is being used to compute skill and what reference forecast is used, but upper bound is always 1; 0 indicates no improvement over the reference forecast. Perfect score: 1.
Characteristics: Implies information about the value or worth of a forecast relative to an alternative (reference) forecast. In meteorology the reference forecast is usually persistence (no change from most recent observation) or climatology. The skill score can be unstable for small sample sizes. When MSE is the score used in the above expression then the resulting statistic is called the reduction of variance.

See also Methods for spatial forecasts for more scientific/diagnostic techniques.
See also Other methods for additional scores for forecasts of continuous variables.

Methods for probabilistic forecasts


A probabilistic forecast gives a probability of an event occurring, with a value between 0 and 1 (or 0 and 100%). In general, it is difficult to verify a single probabilistic forecast. Instead, a set of probabilistic forecasts, pi, is verified using observations that those events either occurred (oi=1) or did not occur (oi=0).
An accurate probability forecast system has:
** reliability - agreement between forecast probability and mean observed frequency
sharpness - tendency to forecast probabilities near 0 or 1, as opposed to values clustered around the mean
** 锐度
resolution - ability of the forecast to resolve the set of sample events into subsets with characteristically different outcomes
** Reliability diagram -(called “attributes diagram” when the no-resoloution and no-skill w.r.t. climatology lines are included).**

The reliability diagram plots the observed frequency against the forecast probability, where the range of forecast probabilities is divided into K bins (for example, 0-5%, 5-15%, 15-25%, etc.). The sample size in each bin is often included as a histogram or values beside the data points.
Answers the question: How well do the predicted probabilities of an event correspond to their observed frequencies?
Characteristics: Reliability is indicated by the proximity of the plotted curve to the diagonal. The deviation from the diagonal gives the conditional bias. If the curve lies below the line, this indicates overforecasting (probabilities too high); points above the line indicate underforecasting (probabilities too low). The flatter the curve in the reliability diagram, the less resolution it has. A forecast of climatology does not discriminate at all between events and non-events, and thus has no resolution. Points between the “no skill” line and the diagonal contribute positively to the Brier skill score. The frequency of forecasts in each probability bin (shown in the histogram) shows the sharpness of the forecast.
The reliability diagram is conditioned on the forecasts (i.e., given that an event was predicted, what was the outcome?), and can be expected to give information on the real meaning of the forecast. It is a good partner to the ROC, which is conditioned on the observations. Some users may find a reliability table (table of observed relative frequency associated with each forecast probability) easier to understand than a reliability diagram.

Brier score

Answers the question: What is the magnitude of the probability forecast errors?
Measures the mean squared probability error. Murphy (1973) showed that it could be partitioned into three terms: (1) reliability, (2) resolution, and (3) uncertainty.
Range: 0 to 1. Perfect score: 0.
Characteristics: Sensitive to climatological frequency of the event: the more rare an event, the easier it is to get a good BS without having any real skill. Negative orientation (smaller score better) - can “fix” by subtracting BS from 1.
Brier skill score

Answers the question: What is the relative skill of the probabilistic forecast over that of climatology, in terms of predicting whether or not an event occurred?
Range: -∞ to 1, 0 indicates no skill when compared to the reference forecast. Perfect score: 1.
范围:-∞至1, 0表示与参考预测相比没有技能。完美得分:1。
Characteristics: Measures the improvement of the probabilistic forecast relative to a reference forecast (usually the long-term or sample climatology), thus taking climatological frequency into account. Not strictly proper. Unstable when applied to small data sets; the rarer the event, the larger the number of samples needed.

Relative operating characteristic

Relative operating characteristic–Plot hit rate (POD) vs false alarm rate (POFD), using a set of increasing probability thresholds (for example, 0.05, 0.15, 0.25, etc.) to make the yes/no decision. The area under the ROC curve is frequently used as a score.

Answers the question: What is the ability of the forecast to discriminate between events and non-events?
ROC: Perfect: Curve travels from bottom left to top left of diagram, then across to top right of diagram. Diagonal line indicates no skill.
ROC area: Range: 0 to 1, 0.5 indicates no skill. Perfect score: 1
Characteristics: ROC measures the ability of the forecast to discriminate between two alternative outcomes, thus measuring resolution. It is not sensitive to bias in the forecast, so says nothing about reliability. A biased forecast may still have good resolution and produce a good ROC curve, which means that it may be possible to improve the forecast through calibration. The ROC can thus be considered as a measure of potential usefulness.
The ROC is conditioned on the observations (i.e., given that an event occurred, what was the corresponding forecast?) It is therefore a good companion to the reliability diagram, which is conditioned on the forecasts.
More information on ROC can be found in Mason 1982, Jolliffe and Stephenson 2012 (ch.3), and the WISE site.
关于ROC的更多信息可以在Mason 1982、Jolliffe和史蒂芬森2012(CH.3)和智者站点中找到。

Discrimination diagram

Discrimination diagram- Plot the likelihood of each forecast probability when the event occurred and when it did not occur. A summary score can be computed as the absolute value of the difference between the mean values of each distribution.

Answers the question: What is the ability of the forecast to discriminate between events and non-events?
Perfect discrimination is when there is no overlap between the distributions of forecast probabilities for observed events and non-events. As with the ROC the discrimination diagram is conditioned on the observations (i.e., given that an event occurred, what was the corresponding forecast?) Some users may find the discrimination diagram easier to understand than the ROC.

Ranked probability score

where M is the number of forecast categories, pk is the predicted probability in forecast category k, and ok is an indicator (0=no, 1=yes) for the observation in category k.
排序概率得分 其中,M是预测类别的数目,pk是预测类别k中的预测概率,并且ok是针对类别k中的观测的指示符(0=否,1=是)。
Answers the question: How well did the probability forecast predict the category that the observation fell into?
Range: 0 to 1. Perfect score: 0.
Characteristics: Measures the sum of squared differences in cumulative probability space for a multi-category probabilistic forecast. Penalizes forecasts more severely when their probabilities are further from the actual outcome. Negative orientation - can “fix” by subtracting RPS from 1. For two forecast categories the RPS is the same as the Brier Score.
Continuous version:对于连续变量表达式为:
Ranked probability skill score

Answers the question: What is the relative improvement of the probability forecast over climatology in predicting the category that the observations fell into?
Range: -∞ to 1, 0 indicates no skill when compared to the reference forecast. Perfect score: 1.
Characteristics: Measures the improvement of the multi-category probabilistic forecast relative to a reference forecast (usually the long-term or sample climatology). Strictly proper. Takes climatological frequency into account. Unstable when applied to small data sets.
**Relative value (value score) (Richardson, 2000; Wilks, 2001)
相对价值(价值评分)(理查德森,2000;威尔克斯,2001) **

Answers the question: For a cost/loss ratio C/L for taking action based on a forecast, what is the relative improvement in economic value between climatological and perfect information?
Range: -∞ to 1. Perfect score: 1.
范围:-∞到1, 满分:1.
Characteristics: The relative value is a skill score of expected expense, with climatology as the reference forecast. Because the cost/loss ratio is different for different users of forecasts, the value is generally plotted as a function of C/L.
Like ROC, it gives information that can be used in decision making. When applied to a probabilistic forecasts system (for example, an ensemble prediction system), the optimal value for a given C/L may be achieved by a different forecast probability threshold than the optimal value for a different C/L. In this case it is necessary to compute relative value curves for the entire range of probabilities, then select the optimal values (the upper envelope of the relative value curves) to represent the value of the probabilistic forecast system. Click here for more information on the cost/loss model and relative value.
See also Methods for ensemble prediction systems for more scientific/diagnostic techniques.





当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


