Quantile-Quantile Plot

最新推荐文章于 2024-04-29 19:16:15 发布

shikai1030

最新推荐文章于 2024-04-29 19:16:15 发布

阅读量6.6k

点赞数

分类专栏：学术

学术专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Purpose: Check If Two Data Sets Can Be Fit With the Same Distribution	The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution. A q-q plot is a plot of the quantiles of the first data set against the quantiles of the second data set. By a quantile, we mean the point below which a given fraction (or percent) of points lies. That is, the 0.3 (or 30%) quantile is the point at which 30% percent of the data fall below and 70% fall above that value. A 45-degree reference line is also plotted. If the two sets come from a population with the same distribution, the points should fall approximately along this reference line. The greater the departure from this reference line, the greater the evidence for the conclusion that the two data sets have come from populations with different distributions. The advantages of the q-q plot are: The sample sizes do not need to be equal. Many distributional aspects can be simultaneously tested. For example, shifts in location, shifts in scale, changes in symmetry, and the presence of outliers can all be detected from this plot. For example, if the two data sets come from populations whose distributions differ only by a shift in location, the points should lie along a straight line that is displaced either up or down from the 45-degree reference line. The q-q plot is similar to a probability plot. For a probability plot, the quantiles for one of the data samples are replaced with the quantiles of a theoretical distribution.
Sample Plot	This q-q plot shows that These 2 batches do not appear to have come from populations with a common distribution. The batch 1 values are significantly higher than the corresponding batch 2 values. The differences are increasing from values 525 to 625. Then the values for the 2 batches get closer again.
Definition: Quantiles for Data Set 1 Versus Quantiles of Data Set 2	The q-q plot is formed by: Vertical axis: Estimated quantiles from data set 1 Horizontal axis: Estimated quantiles from data set 2 Both axes are in units of their respective data sets. That is, the actual quantile level is not plotted. For a given point on the q-q plot, we know that the quantile level is the same for both points, but not what that quantile level actually is. If the data sets have the same size, the q-q plot is essentially a plot of sorted data set 1 against sorted data set 2. If the data sets are not of equal size, the quantiles are usually picked to correspond to the sorted values from the smaller data set and then the quantiles for the larger data set are interpolated.
Questions	The q-q plot is used to answer the following questions: Do two data sets come from populations with a common distribution? Do two data sets have common location and scale? Do two data sets have similar distributional shapes? Do two data sets have similar tail behavior?
Importance: Check for Common Distribution	When there are two data samples, it is often desirable to know if the assumption of a common distribution is justified. If so, then location and scale estimators can pool both data sets to obtain estimates of the common location and scale. If two samples do differ, it is also useful to gain some understanding of the differences. The q-q plot can provide more insight into the nature of the difference than analytical methods such as the chi-square and Kolmogorov-Smirnov 2-sample tests.
Related Techniques	Bihistogram T Test F Test 2-Sample Chi-Square Test 2-Sample Kolmogorov-Smirnov Test
Case Study	The quantile-quantile plot is demonstrated in the ceramic strength data case study.
Software	Q-Q plots are available in some general purpose statistical software programs, including Dataplot. If the number of data points in the two samples are equal, it should be relatively easy to write a macro in statistical programs that do not support the q-q plot. If the number of points are not equal, writing a macro for a q-q plot may be difficult.

http://www.itl.nist.gov/div898/handbook/eda/section3/qqplot.htm

Q-Q plot 即Quantile-Quantile Plot。它在各类研究中经常用到，主要是直观的表示观测值与预测值之间的差异。

在SPSS中很容做，Analysis - Descriptive statistics - Q-Qplot。

Q-Q plot主要是用来估计数量性状观测值与预测值之间的差异。一般我们所取得的数量性状数据都为正态分布数据。在GWAS研究中Q-Q plot的X和Y轴主要是代表各个SNP的-lg P values。预测的线是一条从原点发出的45°角的虚线。实际观测值则是标的实心点。

Q-Q plot主要要点：

预测的虚线为什么是45°出来的呢？因为预测的线实际是通过在QQ图中第一象限作图得出。理论上一个点A在该图上的位置应该是A预测值=A实际值，转化为坐标就是A（x，y）x=y。所以预测的线是一条从原点发出的45°线。

观测值的点的坐标是怎么得出来的。同样设点A的坐标是（x，y）x为预测值，y为实际观测值。查了一下R 中qq plot的算法是这样的

pvals <- read.table("DGI_chr3_pvals.txt", header=T)

observed <- sort(pvals$PVAL)
lobs <- -(log10(observed))

expected <- c(1:length(observed))
lexp <- -(log10(expected / (length(expected)+1)))

具体解释是这样的，先把P值从小到大排序。lobs代表纵坐标，lexp代表横坐标，纵坐标就是观测P值的-log10，而横坐标则根据P值数目而定。比如，当只有3个P值 P1=0.0001 P2=0.001 P3=0.01，那么在这个P值组中，length(observed)=3，对于P1=0.0001 expected=1 lexp=-log10（1/3+1），对于P2=0.001 expected=2 lexp=-log10（2/3+1）， P3=0.01 expected=3 lexp=-log10（3/3+1）。。。。。依此类推。

如果出现了偏离的情况说明实际值跟预测值有偏差，在GWAS研究中，那个SNP点出现了较大的偏离，则认为这个SNP位点的观测值的偏离是由这个SNP突变所产生的遗传作用造成的

http://blog.csdn.net/likelet/article/details/7377664