Statistics Primer-- L1 for Data Science

Basically Census is an effort to collect information about the population. But it is always impossible to get the information from every object in the population. So we need to draw some samples to get an inference. The sampling method then will be important for making a right conclusion. A good sampling method should be covering basically all different kinds of objects in the population.  

Let's say we have a fair coin to toss, we know the probability of head or tail is half. So we toss it 100 times and we are expecting to get 50 times of head. This is what the probability will allow us to answer.

Statistics on the other hand, we have a coin but we don't know if it is a fair one. We toss it 100 times and observe 60 times head. So the question will be if this is a fair coin that the probability of head is 0.5. The statistic will allow us to infer the population characteristics like being fair from the sample data.

In real world, the only way we can observe our population model is by data. How we can reconstruct our model by observations would be the main question for this course.

(data -> model)

Median is usually more stable in some random subset of population but mean has some very nice statistic properties that will allow us to do analysis. 

For example we have 100 observations and we rank all from low to high. The value of 25th observation is the 25% percentile. One of the mostly used percentiles is 99% percentile or p99. And median is also the same as 50% percentile.


 



We can sample n numbers from the normal distribution N(\mu,\sigma^2) we call each number as X1, X2, ... Xn. Then if we calculate the mean of these numbers and call it \bar{X}, the value of \bar{X} will obey the normal distribution N(\mu,\frac{\sigma^2}{n}),  n is the number of entities we have in a single sample set.

Note that now we are talking about any distribution rather than only normal distribution (as long as it has finite mean and finite variance) . In statistics, n is being considered as sufficiently large if it's larger than 30.

The central limit theorem somehow links any distribution to a normal distribution. That's why a normal distribution is so important. 

Standard deviation is \sqrt{\frac{\sigma^2}{n}} = 1/10 = 0.1

--

Remember that central limit theorem is independent of basically what the underline distribution we are drawing, so we can also have below: 

--


 Now we know the central limit theorem, what can we do with it?

A point estimate is a good-enough guess, and it is computed from the given data. 

From above we can see that the smaller  (1-\alpha) is, the smaller the z value will be, and then we will have a smaller interval. This shows the trader-off that if we want to shrink the confidence interval, our confidence level will also correspond to drop. 

Now we are making a sample of 100, and we try 100 times. Each time from the sample we will get a 95% confidence interval. We know the true population mean is mu=0 in this case, what we want to see is that from the 100 confidence intervals we create, 95% of them should cover the population mean which is 0. 

We could see that most intervals indeed cover 0 while some did not. If we sample just for one time, the interval it creates could be any of them, but when we try enough times, the tendency would appear.

This phenomenon is exactly ruled by central limit theorem because we know that the point estimater \hat{\theta} here is just \bar{X}. The distribution of it should follow a normal distribution, which is just the midpoint of every confidence interval.


Mostly we want to reject the null hypothesis. The world is stochastic, so every time we make a very deterministic conclusion, we run into the risk of making errors.

 

It means we want a very small \alpha.

The null distribution H0 is that given the probability of head is 0.5, the coin flip should follow a binomial distribution when tossing 10 times. For example, when x=0, it means: if getting a head is a success, what would be the probability that I will get 0 success/ 0 head from 10 tosses. Here is 0.001, which means we can still observe 0 head in ten trials though it's a very small probability.

Here we need to do some subjective decisions to decide if the null hypothesis is true.  For example, we know that there are small probabilities to obsever 0/1/2 or 8/9/10 times of heads in 10 tosses. We should be aware that it is possible but unlikely. So if we observe these cases we will say the null hypothesis cannot be true. This is what we have to make -- the rejection region. But by doing so, we are running a risk of making type 1 error, so we will need some significance level so we can make some conclusions, it means that we are fully aware that we will be wrong for like alpha percent of time.​​​​​​​​​​​​​​​​​​​​​ 

In this case, the type 1 error rate is just the sum of the probabilities in rejection area which is \alpha =0.11


Let's look at only blue curves first.

We usually pre-defined alpha = 0.05 to control type 1 error. It then makes us to find a vertical black line, which makes the area under the blue curve and on the left on the black line is 0.95. Then the critical value comes from where the black line intersects with x-axis.

The area under the blue curve and on the right on black line is  just the rejection region such that the probability we land in this region is 0.05 when the null hypothesis is true. (If we make an observation and the data lands on this area, we say H0 is not true, but H0 is actually true, this situation happens in probability of 0.05) Or we say if we decide our critical value or rejection region being anything larger than 107.8, then we will reject the null hypothesis, and we will have alpha or the type 1 error rate of 0.05 for this conclusion

If we do just like above, we make a measurement and get a sample, the mean of that sample is 105, which is not in the rejection region. It means we cannot reject the null hypothesis -- we don't have strong evidence to say the average of IQ of students in this class is larger than 100.

 If the mean of that sample is 110, which is now in the rejection region. Then we can reject the null hypothesis, the conclusion will be -- the average IQ of students in the class is larger than 100 with the confidence level of 0.05. (Always remember we start from setting an alpha and come up the critical value)​​​​​​​​​​

Now we consider also the red curves. It means we have an alternative hypothesis H_a. The two red curves are different because we have different value of \mu for H_a.

On the right figure for example, we can have the red curve for \mu =110.  If the sample mean we get is 105, which is in not in the rejection region of H_0 , then we cannot reject H_0. And if H_a is actually true, then we are having the type 2 error and type 2 error rate \beta is the area under the red curve and on the left of black line. Then the value of power (1-\beta) is just the area under the red curve and on the right of black line

We could find that if the curve of H_0 and H_a (blue and red) intersect less, we would have less probability of running into the type 2 error, which means we will have larger power. 

What if we have more sample in a sample set? The distribution curve will become narrow. This makes intuitive sense because the more data we collect for calculating the average, the less uncertain we will be, which means the curve will focus on a smaller range, making it narrower. The more data we have when we are calculating sample mean \bar{X}, the smaller of the variance of  the distribution of \bar{X}.

And by doing so the critical value will become smaller when we still set \alpha =0.05. This means we don't need a very drastic sample mean to reject the null hypothesis

Takeaways: If we can have more samples or data points, it would be a lot easier to draw conclusions, because you will be able to detect smaller and smaller difference by making null distribution curve (blue) and alternative distribution curve (red) narrower and narrower then it would be easy to distinguish them.


Let's introduce a different way to draw conclusions for hypothesis test: 

​​​​​​​

The observation we make here is 7 heads of 10 tosses. So data that is at least as extreme as this observation will be 7 heads, 8 heads, 9 heads, 10heads out of 10 tosses. And we calculate their probabilities and sum up.

In this case, if we set the significance level \alpha = 0.05, then our p value is greater than it, so we will not reject our null hypothesis. We would say p-value is essentially telling us what is the probability we are getting type 1 error. 

p-value is the probability of observing outcomes as extreme as the seen data assuming the null hypothesis holds. If the p-value is less than the pre-defined alpha, then we will reject the null hypothesis. We are willing to reject it, knowing fully we will make a type I error with a probability of (at most) alpha.

​​​​​​​remember alpha is something we specify before the experiment, and the p-value is something we can only obtain after the experiment.

Let's assume we are in the fair-coin case, with H0 as fair coin, and Ha as non-fair coin: we can specify alpha to be 0.05 before flipping any coin. After we tossed N flips and observed X heads, we can then calculate the p-value: say it is 0.02. If the p-value is less than the pre-specified alpha of 0.05, then we can say: "with a significance level (i.e., type I error rate) of 0.05, we can reject H0 (and say this coin is not fair)". If the pre-specified alpha is 0.01, then the calculated p-value is larger than 0.01, then we can say: "with a significance level of 0.01, we can not reject H0".


The p value here is the area under the blue curve and on the right of a black vertical line which drops on x=110 on the x-axis. And it is smaller than the pre-defined \alpha = 0.05, in this case we will reject the null hypothesis.

Sometimes we also consider two-sided case:

Before we run an experiment, we always want to know how many samples we need in each group. 

Note the difference between two peaks of the red and blue curve, and remember what is the meaning of m and n. This difference is imporant because we want it larger when setting the sample size (m, n). When two peaks are away from each other, the two curves will intersect less and make it easier to distinguish them. In this way we can draw the conclusion more easily.

* We always use the sample variance s_1,s_2  as the estimator of \sigma_1,\sigma_2

* Determine the desired difference is determine \Delta_0​​​​​​​

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值