逻辑回归中的Z值

The z-value is the regression coefficient divided by its standard error. It is also sometimes called the z-statistic. It is usually given in the third column of the logistic regression regression coefficient table output. Thus, in the example below, the z-value for the regression coefficient for ResidenceLength is 0.024680/0.013800 = 1.79.

If the z-value is too big in magnitude (i.e., either too positive or too negative), it indicates that the corresponding true regression coefficient is not 0 and the corresponding X-variable matters. A good rule of thumb is to use a cut-off value of 2 which approximately corresponds to a two-sided hypothesis test with a significance level of \alpha=0.05. So, for the ResidenceLength variable, the z-value is 1.79 which is not large enough to provide strong evidence that ResidenceLength matters.

Note: The relationship between the regression coefficient, its standard error, the z-value, and the p-value is virtually identical both logistic regression and regular least-squares regression. So if you understand this is regular regression, you also understand it in logistic regression.

Detailed Explanation

In statistics, the letter “Z” is often used to refer to a random variable that has a standard normal distribution. A standard normal distribution is a normal distribution with expectation 0 and standard deviation 1. This is the normal distribution that is generally tabulated in the back of any basic statistics book.

Because of this, the term “z-value” is often used to refer to the value of a statistic that has a standard normal distribution. Sometimes it is also used to refer to percentile points from the standard normal distribution that are used to compare to the value of statistic. For example, one might refer to “the z-value corresponding to a 95% confidence interval” (which would be 1.96).

In basic univariate statistics, z-statistics and z-values usually come about as a result of standardizing a statistic such as the sample mean \bar X or sample proportion \hat p. Standardizing a statistic means subtracting its expected value \mu_\text{stat} and then dividing by its standard error \sigma_\text{stat} (the standard error of a statistic is its standard deviation). The leading example of this from basic statistics would be a z-statistic derived from from the sample mean:

  \[Z = { \bar X - \mu \over \sigma/\sqrt{n}\]

Here \bar X is the statistic to be standardized (the sample mean), \mu is its expectation (which, for the sample mean, is the same as the population mean), and the standard deviation of the statistic \sigma_\text{stat} is \sigma_{\bar X} = \sigma/\sqrt{n}. Here \sigma is the population standard deviation, and the formula \sigma_{\bar X} = \sigma/\sqrt{n} comes about because of the relationship between the standard deviation of a sample mean and the population mean. Finally, the statistic has a normal distribution as the sample size gets large because of the central limit theorem.

Hopefully, the above reminds you about enough of your basic statistics that using these ideas in the context of logistic regression will make sense.

So where do z-values come about in logistic regression? They primarily come about as a result of standardizing the logistic regression coefficients when testing whether or not the individual X-variables are related to the Y-variables. For example, consider the coefficient table output from the logistic regression in the “Kid Creative” example I discussed in the post Understanding Logistic Regression Output: Part 2 — Which Variables Matter


Coefficients:      
 EstimateStd. Errorz valuePr(>|z|)Odds Ratio
(Intercept)-17.9100002.223000-8.060.0000  
Income0.0002020.0000248.550.00001.0002 
IsFemale1.6460000.4651003.540.00045.1862 
IsMarried0.5662000.5864000.970.33431.7616 
HasCollege-0.2794000.443700-0.630.52900.7562 
IsProfessional0.2253000.4650000.490.62801.2527 
IsRetired-1.1590000.932300-1.240.21400.3138 
Unemployed0.9886004.6900000.210.83302.6875 
ResidenceLength0.0246800.0138001.790.07381.0250 
DualIncome0.4518000.5215000.870.38631.5711 
Minors1.1330000.4635002.440.01453.1050 
Own1.0560000.5594001.890.05902.8748 
House-0.9265000.621800-1.490.13620.3959 
White1.8640000.5454003.420.00066.4495 
English1.5300000.8407001.820.06874.6182 
PrevChildMag1.5570000.7119002.190.02874.7446 
PrevParentMag0.4777000.6240000.770.44391.6124

You will see that the z-values are given in the third column of numbers in the table. These z-values are computed as the test statistic for the hypothesis test that the true corresponding regression coefficient \beta is 0. (Note: The p-values computed from the z-values are given in the 4th column of numbers in the regression coefficient output table. I generally do not look at the z-values, but rather use the p-values.

More specifically, suppose we want to determine if an X-variable matters (that is, has a significant relationship to the Y variable). We determine this by testing the null hypothesis that the corresponding regression coefficient \beta is 0. In hypothesis testing, we assume the null hypothesis is true, and then see if the data provide evidence against it. So in this case, we assume \beta is 0. That is, we assume the expectation of the fitted regression coefficient \hat\beta is 0. So we standardize the regression coefficient as follows:

  \[Z = {\hat\beta - 0 \over \hat\sigma_{\hat\beta} } = \hat\beta/\hat\sigma_{\hat\beta}\]

Note that there is no closed-form formula for \hat\sigma_{\hat\beta}. It is computed as the solutions to a non-linear system of equations.

So, for example, consider the ResidenceLength regression coefficient in the coefficient output table above. For this variable \hat\beta=0.024680 and \hat\sigma_{\hat\beta}=0.013800, so the Z-value is \hat\beta/\hat\sigma_{\hat\beta} = 0.024680/0.013800 = 1.79. So the value in the third column of numbers is Z=1.79.

How do we interpret the Z-values? As a rough rule of thumb, if the absolute value of the Z-value is bigger that 2.0, the variable is significant (which means that there is statistical evidence that it is related to the Y variable). This gives a rough hypothesis test with a significance level of about \alpha=0.05.

More precisely, in the hypothesis test, select a significance level such as \alpha=0.05. Determine the corresponding critical value for the test. This will depend on whether or not the hypothesis test is one-sided or two-sided. If it is one-sided, the critical value will be the upper \alpha percentage point of the standard normal distribution (generally referred to as z_\alpha). If it is a two sided test (most common), then the critical value is the upper \alpha/2 percentage point (generally referred to as z_{\alpha/2}). The absolute value of the Z-value is then compared to the appropriate critical value to determine if the test is significant. That is, the regression coefficient is significantly different from 0 if:<\p>

  \[|\text{Z-value}| \ge z_\alpha \;\;\;\text{(one-sided)}\]

or

  \[|\text{Z-value}| \ge z_{\alpha/2} \;\;\;\text{(two-sided)}\]

Coefficients:      
 EstimateStd. Errorz valuePr(>|z|)Odds Ratio
(Intercept)-17.9100002.223000-8.060.0000  
Income0.0002020.0000248.550.00001.0002 
IsFemale1.6460000.4651003.540.00045.1862 
IsMarried0.5662000.5864000.970.33431.7616 
HasCollege-0.2794000.443700-0.630.52900.7562 
IsProfessional0.2253000.4650000.490.62801.2527 
IsRetired-1.1590000.932300-1.240.21400.3138 
Unemployed0.9886004.6900000.210.83302.6875 
ResidenceLength0.0246800.0138001.790.07381.0250 
DualIncome0.4518000.5215000.870.38631.5711 
Minors1.1330000.4635002.440.01453.1050 
Own1.0560000.5594001.890.05902.8748 
House-0.9265000.621800-1.490.13620.3959 
White1.8640000.5454003.420.00066.4495 
English1.5300000.8407001.820.06874.6182 
PrevChildMag1.5570000.7119002.190.02874.7446 
PrevParentMag0.4777000.6240000.770.44391.6124
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值