【统计学】面试笔记

参考教材:《概率论与数理统计》峁诗松;《白话统计学》外国人写的那本

  • 什么是频率稳定于概率?
    频率依概率收敛于概率。
    随着n的增大,事件A发生的频率 S n n \frac{S_n}{n} nSn与概率p的偏差 ∣ S n n − p ∣ |\frac{S_n}{n}-p| nSnp大于预先给定精度 ϵ \epsilon ϵ的可能性愈来愈小,要多小有多小。
  • 蒙特卡罗方法计算定积分的方法
    随机投点法:依据是 伯努利大数定律;
    平均值法: 使用了辛钦大数定律;
  • 关于标准差
    标准差,是测度一个分布平均离散程度的量
    为什么样本方差的计算分母使用(n-1),而非是n?
    s 2 = Σ i = 1 n ( x i − x ˉ ) 2 n − 1 , s n 2 = = Σ i = 1 n ( x i − x ˉ ) 2 n s^2 =\frac{\Sigma_{i=1}^{n}(x_i-\bar{x})^2}{n-1},s_n^2==\frac{\Sigma_{i=1}^{n}(x_i-\bar{x})^2}{n} s2=n1Σi=1n(xixˉ)2,sn2==nΣi=1n(xixˉ)2
    个人认为:①从统计学角度,(n-1)表示的是自由度,使用其做分母计算的是平均每个自由度上的偏差平方和;使用(n-1)做分母计算出的样本标准差是总体方差 σ 2 \sigma^2 σ2的无偏估计,而使用(n)做分母计算出来的是有偏的;②用白话分析一下,我们得到一组样本,计算偏差平方和时,使用任何不等于样本均值的数作为均值,计算出的偏差平方和 都是大于 使用样本均值作为均值计算的偏差平方和,也就是说,使用样本均值作为计算式中的均值时计算出来的偏差平方和 是最小的。那么,可想而知,使用总体均值计算出的偏差平方和 是 大于 使用样本均值计算出的偏差平方和,而 使用总体均值计算方差 是我们想要的,为了使得 使用样本均值计算出的方差(我们手里只有它,总体均值通常是未知的) 更接近 使用总体均值计算方差,我们考虑除以一个较小的数(n-1)(我是说,它小于n),来调整 使用样本均值计算偏差平方和 低估 真实情况(即使用总体均值计算偏差平方和)。
  • 偏态(skew),分布中大量取值集中在分布的一端,相对较少的取值在分布的另一端,形成尾部。因为异常值对均值的影响比较大,且异常值常出现在分布的尾部,所以偏态的均值会靠近尾部。中位数是不受异常值的影响的。可想而知,对于正偏分布(positive skew)均值是大于中位数的,而对于负偏分布(negative skew)均值是小于中位数的。当均值等于中位数时,分布是对称的。
    正偏分布,在分布中大量取值集中在分布的低端,少量取值在分布的高端形成尾部。
    负偏分布,在分布中大量取值集中在分布的高端,少量取值在分布的低端形成尾部。
  • 异常值(outlier),样本量越小,异常值表现就会越明显,异常值对均值的影响就就越大。

大数定律
大数定律讨论的是 在什么条件下,随机变量序列的算术平均 依概率收敛于 其均值的算术平均;
教材的定理部分介绍了:伯努利大数定律、切比雪夫大数定律、马尔可夫大数定律、辛钦大数定律;
这几个大数定律条件不同,但是结论是相同的:随机序列的算术均值依概率收敛于其期望的算术均值。

  • 伯努利大数定律,想到伯努利试验,可以想到:伯努利大数定律的条件是独立同分布;
  • 切比雪夫大数定律,想到切比雪夫不等式,根据不等式的左右两边,从而可以想到其条件是:随机变量序列两两互不相关,方差存在且有共同的上界;另外,证明该不等式的时候使用到了切比雪夫不等式;
  • 注:切比雪夫不等式比伯努利不等式条件要宽,体现在切比雪夫不等式 只要求不相关,并不要求是同分布的。
  • 马尔可夫大数定律,满足 马尔科夫条件 即有大数定律的结论,值得注意的是在 马尔科夫条件中 也隐含了方差存在;
  • 辛钦大数定律,特别之处在于:它不需要方差存在,这是有别于以上三种大数定律的地方,与此同时,它又在别处加强—需要随机变量序列独立同分布,再加上 期望存在 即可得到大数定律的结论;
    注:辛钦大数定律 提供了一种计算数学期望近似值的方法。

大数定律的英文表达
The law of large number just tells us that the sample mean will approximate the expected value of the population.


中心极限定理
在什么条件下,独立随机变量之和的分布函数会收敛于正态分布。
大量微小的随机因素叠加导致的误差近似服从正态分布。
中心极限定理的英文表达
The central limit theorm tells us that it doesn’t matter what distribution you start with ,if you collect samples from those distributions ,and the sample mean will be normally distributed.It is a basis of statistical inference.


T-distribution VS Normal distribution
由中心极限定理( n → ∞ n \to \infty n)可以看出,正态分布更多的是在大样本的情况下使用,不过,还有一点需要注意的是要求样本数据是自然采集的(我看的书上是这么说的,对此,我也很好奇,自然采集是个什么样的定义,之后也许会填补此处的空白……)
而,t-distribution适用于中小样本量,而且此时的数据多半是在人工控制之下采集的。这一点怎么理解呢?
还是想想t-distribution的诞生,戈塞特在啤酒厂工作研究酿酒的什么麦芽什么的,他所采集的数据很可能就是在人工控制之下的产生的。
在总体方差未知的情况下,t-distribution还可以用于构造枢轴统计量等,从而进行假设检验等。
统计量 VS 充分统计量 VS 枢轴变量 三者的概念区分
统计量(statistic)是样本 x 1 , x 2 , x 3 , . . . . . . , x n x_1,x_2 ,x_3 , ......, x_n x1,x2x3,......,xn的函数,要求统计量本身不含位置参数,但是统计量是随机变量,随机变量是有分布的,统计量的分布函数是含未知参数的(这一点并不难理解,因为统计量是样本的函数,而样本作为随机变量,其分布是含有未知参数的)
充分统计量(sufficient statistic)是指一个统计量包含了-样本体现总体分布的未知参数的全部信息-(有点长,多读几遍理解理解),
枢轴变量(pivotal quantity)是样本和未知参数的函数,而它的分布是不依赖于未知参数的,举个例子:
I f X ∼ N ( μ , σ 2 ) , t h e n , Y = X − μ σ ∼ N ( 0 , 1 ) If X \sim N(\mu ,\sigma^2),then ,Y = \frac{X-\mu}{\sigma}\sim N(0,1) IfXN(μ,σ2),then,Y=σXμN(0,1),后面的随机变量Y就是枢轴变量,它本身是含有未知参数的,但是它的分布却不含未知参数。
关于 原假设 VS 备择假设
原假设和备择假设的地位是不对等的,我们是站在原假设的立场上的,所以说原假设一般都是一些长期经验积累得出的结论。 如果原假设为真,但是我们却拒绝了原假设,这是一件很严重的事情,所以说原假设是不轻易被拒绝的假设,除非有足够的数据证据说明它是错的。而不接受原假设也不是说原假设就真的是对的,只是没有足够的证据说明它是错的而已。

为什么会有两个总体均值的比较问题?
欧美统计学的诞生很大程度上是源于对提高农业等生产的需要,而如何体现新方法确实是优于旧方法的(统计上称“存在显著差异”),这就需要用到比较,具体如“两样本的均值比较、方差分析(多个总体的均值比较)等。
比如说,如何判断一个新的饲养方式对🐖的增肥更有效?我们可以设置对照组(使用旧的饲养方式去喂猪),设置实验组(使用新的方式去喂猪),控制其他变量(饲养时间、🐖的初始生理无差别<我知道这几乎是不可能做到的>等)不变。然后,采集数据进行两样本均值的比较,如果存在显著差异,就认为新的饲养方式是更为有效的。从而提高农业生产。
想想《女士品茶》这本书里费希尔一开始做的就是农业研究,并且他的论文也多半发表在农业、生物期刊上(当然也可能有卡尔·皮尔逊的原因)。
独立样本T检验(dependent sample t test) VS 相依样本T检验(independent sample t test)
相同点:比较均值是否相同。
不同点:①独立样本t检验,可以说是两个样本在单一因变量上的不同;而相依样本t检验,可以说是单个样本在两个变量上的平均取值(比如说,比较工人在放假前后工作效率,我们抽取30个工人做检验,其中30个工人就是一个样本,在休假前的生产效率和休假后的平均生产效率是两个变量)②二者的自由度不同,独立样本t检验的自由度是两个样本的样本容量加和-2,而相依样本t检验的自由度是配对样本的对数-1.
相依样本t检验排除了不同实验单元的之间的差异,能够更好的把差异体现在自变量上。

如何理解“统计显著差异”?
排除样本随机性后,仍然存在差异(即系统误差)。
比如,做独立样本t检验,我们希望比较两个总体均值是否相同。提出的原假设:两个样本均值相同( μ 1 = μ 2 \mu_1 = \mu_2 μ1=μ2)。这里如果拒绝原假设,也就是存在统计显著差异,也就是意味着抽取的两个样本之间的差距可以足够说明:两个样本分别所对应的总体是存在差异的(即系统误差)。

一些专业英文

  • 中心极限定理
    The central limit theorem is the basis for a lot of statistics.
    It turns out it doesn’t matter what distribution you start with , if you collect samples from those distribution,the sample mean will be normally distributed.
    The sample mean will be approximately normally distributed for large sample size, regardless of which we are sampling.
    The distribution of sample mean tends towards the normal distribution as the sample size increase.
    We can use the mean’s distribution to make confidence intervals.
    We can do test when we ask if there is a difference between means from two samples.
  • 中心极限定理为什么重要?
    We can often use well-developed statistical inference procedures that are based on normal distribution, even if we are sampling from a population that is not normal, provided we have a large sample size.
  • 大数定律
    The law of large numbers just says that the sample mean will approach the expected value of the random variable.
    As you take more and more samples, the average of that sample is going to approximate the true average.
    The mean of your samples is going to converge to the true mean of the population.
  • 介绍统计学
    Statistics are everywhere. The purpose of statistics is to find the truth , but it can also be used to lie.
    Statistics can help us make decision in uncerntain situation.
    Statistics is the study and practice of collecting and analyzing data and figure out how to use that information.
    Statistics is becoming more and more critical as academia, business and government come to rely on data-driven decisions, greatly expanding the demand of statisticians.
  • t检验
    Dependent t-test is known as paired samples t-test and it is approximate when the same subjects are being compared or when two samples are matched at the level of individual subjects.
  • 大数据
    Big data is a term for data sets that are so large and complex.
    Big data is characterized by “four V s”: volume, variety , velocity and veracity.That is, big data comes in large amounts.Meanwhile , it is also a mixture of structured and unstructured information and arrives at real-time speed.The term "Big Data " often refers simply to the use of predictive analytics, user behaviour analytics methods that extract value from data.
  • do a dependent t-test
    do a thorough analysis
    do interval esimates
    The sample means are just point estimates.
    The standard error is the standard deviation over the square root of N.
  • 试验设计
    Experiment research requires random assignment and random selection.Random selection means that individuals included in a sample should be randomly selected from the population.
    If we want to generalize to that population, then we need to get a random sample from that population.
    Experiment also depends on random assignment to condition.
  • 线性回归
    We can put the regression line on the scatter plot.I mean that the regression line goes through the plot so that it minimizes the overall distance between the line and those dots.
    The slope of this line is called as a regression coefficient.
    We can estimate the coefficients in regression equation, including simple regression and multiple regression.The values of the coefficients are estimated such that the regression model yields optimal predictions.
    This is known in statistics as ordinary least squares estimation.That idea is to minimize the sum of squared residuals.
  • 茎叶图
    From a distance, stem and leaf plot look a lot like histogram.
    The stem are listed vertically and leaves extend horizontally. Leaves appear in numerical order. I mean if leaves appear in numerical order. I mean if leaves have smaller digits , they are closer to the stem. It can give us information about data and their frequencies by stocking objects on top of each other.
    From a stem and leave plot, we can see what individual values are and how they are spread out with a bar. I mean stem and leaf plot use values from raw data.
  • 直方图
    data visualization
    make all sorts of beautiful graph
    Histograms use the height of bar to show how frequently data occur.
  • 箱线图
    Boxplots use some of our measure of central tendency and spread to display our data.
    It has two major parts: the box and the whisker.
    The box is a rectangle that stretches across the inter quartile range of our data.
    At the median, there is a line splitting the rectangle into halves.
    If one of these halves is larger than the other, the quartile range is more spread.
    Boxplots can help us look for potential outliers and compare data from two samples.
    It can be tempting to think of outliers as wrong data, but it’s not always the case
    More specifically, if some points are outside the fence of one boxplot, we call them as outliers.
  • 散点图
    Sometimes, for the relationship of two continuous variables, we need to visualize our data using a scatter plot. And the scatter plot is said to be a generally useful invention in the history of statistical grapnics.
    We put one variable on X-axis and the other variable on Y- axis, and we see the relationship.
    A scatter plot allows us to see the shape and spread of data.
    Maybe the data is clustered , but scatter plots are useful for identifying all kinds of relationships, both linear and nonlinear.
    The linear relationship is a classic example, and lines are a great way to describe a relationship for they have a nice formula, which allows us to predict one variable based on the value of another.
    Correlation measures the way two variables move together, both the direction and closeness of movement.
  • ANOVA (Anlysis Of Variance)
    ANOVA is the most common statistical procedures that we encounter.
    We can do multiple pairwise comparison. It is appropriate when multiple predictors are all categorical and the dependent variable are continuous.
    Most common application is to analyze data from randomized controlled experiments.
    More specifically, randomized controlled experiments generate more than two groups.
    If we have only two group means, we can use dependent t-test and independent t-test.
    If they are two independent samples, then that’s an independent t-test.
  • F 分布
    The family of F-distribution depends on the number of subjects per group and the number of groups.
    We can generate it in R. We can borrow some code off the Internet to generate some sampling distributions in R.
    It’s not symmetrical because negative values are impossible.
    It’s a ratio of variance.
    It describe it as sort of the between - groups variance relative to the within - group variance.

英文自我介绍

Good morning , professors, I’m balabala , from China university of mining and technology.
It is really my honor to have this opportunity for an interview.Let me introduce myself briefly.
I major in statistics in school of mathematics.
During the past three years, I’ve spent most of time studying and I’ve acquired basic knowledge of my major, like mathematical analysis, advanced algebra, probability and mathematical statistics.
I suppose people in school of mathematics are more diligent and careful, so I hope I can continue studying in school of mathematics.
I’ve known a little about my intension ---- Applied Statistics.
It has transparent background to solve practical problems.
What appeals to me about statistics is that almost of professors required statistical knowledge , like engineer and technology, economics and finance, social science and humanity.
Finally , I want to say why I choose this university. The reason is that I suppose she has high-level education , good reputation and firm foundation in postgraduate training and scientific reseach.
That’s all of my introduction.Thank you for your listening.
介绍学业:My performance ranked second ,and win some prizes like national scholarship and national encouragement scholarship, and I’ve passed CET Band 6. In addtion , I’ve finished a provincial project. This project is about incomplete decomposition of large-scale sparse matrix. My job includes reading papers and do some numerical experiments.

  • 4
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值