方差分析、回归分析和多层回归分析（1）

最新推荐文章于 2024-05-10 17:01:28 发布

W2388727409

最新推荐文章于 2024-05-10 17:01:28 发布

阅读量1.9k

点赞数

文章标签：数据结构与算法 ui

原文链接：http://www.cnblogs.com/JoAnnal/p/6734515.html

版权

1、方差分析

方差分析的定义：方差分析(Analysis of Variance，简称ANOVA)，又称“变异数分析”，是R.A.Fisher发明的，用于两个及两个以上样本均数差别的显著性检验。由于各种因素的影响，研究所得的数据呈现波动状。造成波动的原因可分成两类，一是不可控的随机因素，另一是研究中施加的对结果形成影响的可控因素。

SST=SSB+SSW+SSE

Classes of models

There are three classes of models used in the analysis of variance, and these are outlined here.

Fixed-effects models

Main article: Fixed effects model

The fixed-effects model (class I) of analysis of variance applies to situations in which the experimenter applies one or more treatments to the subjects of the experiment to see whether the response variable values change. This allows the experimenter to estimate the ranges of response variable values that the treatment would generate in the population as a whole.

固定效应 Fixed effects model

In statistics, a fixed effects model is a statistical model that represents the observed quantities in terms of explanatory variables that are treated as if the quantities were non-random。

Random-effects models

Main article: Random effects model

Random effects model (class II) is used when the treatments are not fixed. This occurs when the various factor levels are sampled from a larger population. Because the levels themselves are random variables, some assumptions and the method of contrasting the treatments (a multi-variable generalization of simple differences) differ from the fixed-effects model.

In statistics, a random effects model, also called a variance components model, is a kind of hierarchical linear model. It assumes that the data being analysed are drawn from a hierarchy of different populations whose differences relate to that hierarchy. In econometrics, random effects models are used in the analysis of hierarchical or panel data when one assumes no fixed effects (it allows for individual effects). The random effects model is a special case of the fixed effects model. Contrast this to the biostatistics definitions，as biostatisticians use "fixed" and "random" effects to respectively refer to the population-average and subject-specific effects (and where the latter are generally assumed to be unknown, latent variables).

Simple example

Suppose m large elementary schools are chosen randomly from among thousands in a large country. Suppose also that n pupils of the same age are chosen randomly at each selected school. Their scores on a standard aptitude test are ascertained. Let Y_ij be the score of the jth pupil at the ith school. A simple way to model the relationships of these quantities is

Y_{ij}=\mu +U_{i}+W_{ij},\,

where μ is the average test score for the entire population. In this model U_i is the school-specific random effect: it measures the difference between the average score at school i and the average score in the entire country. The term W_ij is the individual-specific effect, i.e., it's the deviation of the j-th pupil’s score from the average for the i-th school.

The model can be augmented by including additional explanatory variables, which would capture differences in scores among different groups. For example:

Y_{ij}=\mu +\beta _{1}\mathrm {Sex} _{ij}+\beta _{2}\mathrm {Race} _{ij}+\beta _{3}\mathrm {ParentsEduc} _{ij}+U_{i}+W_{ij},\,

where Sex_ij is the dummy variable for boys/girls, Race_ij is the dummy variable for white/black pupils, and ParentsEduc_ij records the average education level of child’s parents. This is a mixed model, not a purely random effects model, as it introduces fixed-effects terms for Sex, Race, and Parents' Education.

Variance components

The variance of Y_ij is the sum of the variances τ² and σ² of U_i and W_ij respectively.

Let

{\overline {Y}}_{i\bullet }={\frac {1}{n}}\sum _{j=1}^{n}Y_{ij}

be the average, not of all scores at the ith school, but of those at the ith school that are included in the random sample. Let

{\overline {Y}}_{\bullet \bullet }={\frac {1}{mn}}\sum _{i=1}^{m}\sum _{j=1}^{n}Y_{ij}

be the grand average.

Let

SSW=\sum _{i=1}^{m}\sum _{j=1}^{n}(Y_{ij}-{\overline {Y}}_{i\bullet })^{2}\,

SSB=n\sum _{i=1}^{m}({\overline {Y}}_{i\bullet }-{\overline {Y}}_{\bullet \bullet })^{2}\,

be respectively the sum of squares due to differences within groups and the sum of squares due to difference between groups. Then it can be shown that

{\frac {1}{m(n-1)}}E(SSW)=\sigma ^{2}

and

{\frac {1}{(m-1)n}}E(SSB)={\frac {\sigma ^{2}}{n}}+\tau ^{2}.

These "expected mean squares" can be used as the basis for estimation of the "variance components" σ² and τ².

Dummy variable (statistics)

From Wikipedia, the free encyclopedia

In statistics and econometrics, particularly in regression analysis, a dummy variable (also known as an indicator variable, design variable, Boolean indicator, categorical variable, binary variable, or qualitative variable) is one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome. Dummy variables are used as devices to sort data into mutually exclusive categories (such as smoker/non-smoker, etc.).For example, in econometric time series analysis, dummy variables may be used to indicate the occurrence of wars or major strikes. A dummy variable can thus be thought of as a truth value represented as a numerical value 0 or 1 (as is sometimes done in computer programming).

哑变量设置为0，1和1，2的不同：

A dummy independent variable (also called a dummy explanatory variable) which for some observation has a value of 0 will cause that variable's coefficient to have no role in influencing the dependent variable, while when the dummy takes on a value 1 its coefficient acts to alter the intercept. For example, suppose Gender is one of the qualitative variables relevant to a regression. Then, female and male would be the categories included under the Gender variable. If female is arbitrarily assigned the value of 1, then male would get the value 0. Then the intercept (the value of the dependent variable if all other explanatory variables hypothetically took on the value zero) would be the constant term for males but would be the constant term plus the coefficient of the gender dummy in the case of females.

ANOVA models

A regression model in which the dependent variable is quantitative in nature but all the explanatory variables are dummies (qualitative in nature) is called an Analysis of Variance (ANOVA) model.

ANOVA model with one qualitative variable[edit]

Suppose we want to run a regression to find out if the average annual salary of public school teachers differs among three geographical regions in Country A with 51 states: (1) North (21 states) (2) South (17 states) (3) West (13 states). Say that the simple arithmetic average salaries are as follows: $24,424.14 (North), $22,894 (South), $26,158.62 (West). The arithmetic averages are different, but are they statistically different from each other? To compare the mean values, Analysis of Variance techniques can be used. The regression model can be defined as:

Y_{i}=\alpha _{1}+\alpha _{2}D_{2i}+\alpha _{3}D_{3i}+u_{i}

where

Y_{i}=

D_{2i}=1

D_{2i}=0

D_{3i}=1

D_{3i}=0

In this model, we have only qualitative regressors, taking the value of 1 if the observation belongs to a specific category and 0 if it belongs to any other category. This makes it an ANOVA model.

Now, taking the expectation of both sides, we obtain the following:

Mean salary of public school teachers in the North Region:

E(Y_i|D_2i = 1, D_3i = 0) = α₁ + α₂

Mean salary of public school teachers in the South Region:

E(Y_i|D_2i = 0, D_3i = 1) = α₁ + α₃

Mean salary of public school teachers in the West Region:

E(Y_i|D_2i = 0, D_3i = 0) = α₁

(The error term does not get included in the expectation values as it is assumed that it satisfies the usual OLS conditions, i.e., E(U_i) = 0)

The expected values can be interpreted as follows: The mean salary of public school teachers in the West is equal to the intercept term α₁ in the multiple regression equation and the differential intercept coefficients, α₂ and α₃, explain by how much the mean salaries of teachers in the North and South Regions vary from that of the teachers in the West. Thus, the mean salaries of teachers in the North and South is compared against the mean salary of the teachers in the West. Hence, the West Region becomes the base group or the benchmark group,i.e., the group against which the comparisons are made. The omitted category, i.e., the category to which no dummy is assigned, is taken as the base group category.

Using the given data, the result of the regression would be:

Ŷ_i = 26,158.62 − 1734.473D_2i − 3264.615D_3i

se = (1128.523) (1435.953) (1499.615)

t = (23.1759) (−1.2078) (−2.1776)

p = (0.0000) (0.2330) (0.0349)

R² = 0.0901

where, se = standard error, t = t-statistics, p = p value

The regression result can be interpreted as: The mean salary of the teachers in the West (base group) is about $26,158, the salary of the teachers in the North is lower by about $1734 ($26,158.62 − $1734.473 = $24.424.14, which is the average salary of the teachers in the North) and that of the teachers in the South is lower by about $3265 ($26,158.62 − $3264.615 = $22,894, which is the average salary of the teachers in the South).

To find out if the mean salaries of the teachers in the North and South are statistically different from that of the teachers in the West (the comparison category), we have to find out if the slope coefficients of the regression result are statistically significant. （要检验不同组之间是否具有显著性差异，就要检验斜率系数是否显著，直观上来理解是如果组之间差异显著，说明分组变量对该回归的解释作用是显著的，也就是说组与组之间的差异是显著的。如果斜率系数不显著，说明分组变量作用不大，分组效果不明显，组与组之间差异不大）For this, we need to consider the p values. The estimated slope coefficient for the North is not statistically significant as its p value is 23 percent; however, that of the South is statistically significant at the 5% level as its p value is only around 3.5 percent. Thus the overall result is that the mean salaries of the teachers in the West and North are not statistically different from each other, but the mean salary of the teachers in the South is statistically lower than that in the West by around $3265. The model is diagrammatically shown in Figure 2. This model is an ANOVA model with one qualitative variable having 3 categories.

ANCOVA models

Main article: Analysis of covariance

A regression model that contains a mixture of both quantitative（数值型变量） and qualitative （分类型）variables is called an Analysis of Covariance (ANCOVA) model. ANCOVA models are extensions of ANOVA models. They statistically control for the effects of quantitative explanatory variables (also called covariates or control variables).

Dummy dependent variables

What happens if the dependent variable is a dummy?

A model with a dummy dependent variable (also known as a qualitative dependent variable) is one in which the dependent variable, as influenced by the explanatory variables, is qualitative in nature. Some decisions regarding 'how much' of an act must be performed involve a prior decision making on whether to perform the act or not. For example, the amount of output to produce, the cost to be incurred, etc. involve prior decisions on whether to produce or not, whether to spend or not, etc. Such "prior decisions" become dependent dummies in the regression model。

For example, the decision of a worker to be a part of the labour force becomes a dummy dependent variable. The decision is dichotomous, i.e., the decision has two possible outcomes: yes and no. So the dependent dummy variable Participation would take on the value 1 if participating, 0 if not participating.

When the qualitative dependent dummy variable has more than two values (such as affiliation to many political parties), it becomes a multiresponse or a multinomial or polychotomous model。

Dependent dummy variable models

Analysis of dependent dummy variable models can be done through different methods. One such method is the usual OLS method, which in this context is called the linear probability model. An alternative method is to assume that there is an unobservable continuous latent variable Y^* and that the observed dichotomous variable Y = 1 if Y^* > 0, 0 otherwise. This is the underlying concept of the logit and probit models. These models are discussed in brief below.Linear probability modelMain article: Linear probability modelAn ordinary least squares model in which the dependent variable Y is a dichotomous dummy, taking the values of 0 and 1, is the linear probability model (LPM).

The model assumes that, for a binary outcome (Bernoulli trial), $Y$ , and its associated vector of explanatory variables, $X$ ,

\Pr(Y=1|X=x)=x'\beta .

For this model,

E[Y|X]=\Pr(Y=1|X)=x'\beta ,

and hence the vector of parameters β can be estimated using least squares. This method of fitting would be inefficient, and can be improved by adopting an iterative scheme based on weighted least squares,in which the model from the previous iteration is used to supply estimates of the conditional variances, $\operatorname {Var}(Y|X=x)$ , which would vary between observations. This approach can be related to fitting the model by maximum likelihood.

Some problems are inherent in the LPM model:

The regression line will not be a well-fitted one and hence measures of significance, such as R², will not be reliable.
Models that are analyzed using the LPM approach will have heteroscedastic disturbances.
The error term will have a non-normal distribution.
The LPM may give predicted values of the dependent variable that are greater than 1 or less than 0. This will be difficult to interpret as the predicted values are intended to be probabilities, which must lie between 0 and 1.
There might exist a non-linear relationship between the variables of the LPM model, in which case, the linear regression will not fit the data accurately。

Alternatives to LPM

Figure 4 : A cumulative distribution function.

To avoid the limitations of the LPM, what is needed is a model that has the feature that as the explanatory variable, X_i, increases, P_i = E (Y_i = 1 | X_i) should remain within the range between 0 and 1. Thus the relationship between the independent and dependent variables is necessarily non-linear（为什么必然是非线性的）.

For this purpose, a cumulative distribution function (CDF) can be used to estimate the dependent dummy variable regression. Figure 4 shows an 'S'-shaped curve, which resembles the CDF of a random variable. In this model, the probability is between 0 and 1 and the non-linearity has been captured. The choice of the CDF to be used is now the question.

Two alternative CDFs can be used: the logistic and normal CDFs. The logistic CDF gives rise to the logit model and the normal CDF give rises to the probit model .

Logit model

Main article: Logistic regression

The shortcomings of the LPM led to the development of a more refined and improved model called the logit model. In the logit model, the cumulative distribution of the error term in the regression equation is logistic.The regression is more realistic in that it is non-linear.

The logit model is estimated using the maximum likelihood approach. In this model, $P(Y=1|X)$ , which is the probability of the dependent variable taking the value of 1 given the independent variable is:

P_{i}={\frac {1}{1+e^{​{-z_{i}}}}}\ ={\frac {e^{​{z_{i}}}}{1+e^{​{z_{i}}}}}\

where $z_{i}=\alpha _{1}+\alpha _{2}X_{i}+u_{i}$ .

The model is then expressed in the form of the odds ratio: what is modeled in the logistic regression is the natural logarithm of the odds, the odds being defined as $P/(1-P)$ . Taking the natural log of the odds, the logit (L_i) is expressed as

L_{i}=\ln \left({\frac {P_{i}}{1-P_{i}}}\right)=z_{i}=\alpha _{1}+\alpha _{2}X_{i}.

This relationship shows that L_i is linear in relation to X_i, but the probabilities are not linear in terms of X_i.（Li对于Xi是线性的，但对于Xi的概率是非线性的）

Probit model

Main article: Probit model

Another model that was developed to offset the disadvantages of the LPM is the probit model. The probit model uses the same approach to non-linearity as does the logit model; however, it uses the normal CDF instead of the logistic CDF.

转载于:https://www.cnblogs.com/JoAnnal/p/6734515.html

W2388727409

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
方差分析、回归分析和多层回归分析（1）

1、方差分析方差分析的定义：方差分析(Analysis of Variance，简称ANOVA)，又称“变异数分析”，是R.A.Fisher发明的，用于两个及两个以上样本均数差别的显著性检验。由于各种因素的影响，研究所得的数据呈现波动状。造成波动的原因可分成两类，一是不可控的随机因素，另一是研究中施加的对结果形成影响的可控因素。SST=SSB+SSW+SSEClasses ...
复制链接

扫一扫