GWAS 教程之QC

秋秋秋今天吃肉肉了吗

已于 2022-05-14 19:52:34 修改

阅读量1.1k

点赞数 4

文章标签： perl bash

于 2022-05-14 18:58:58 首次发布

本文链接：https://blog.csdn.net/xiaopeiqi0419/article/details/124770155

版权

这篇博客详细介绍了GWAS数据分析的初步步骤，包括从git克隆数据，进行个体和SNP的缺失率检查，性别一致性检验，MAF过滤，HWE检验，异质性检验以及个体相关性检验。每一步骤都提供了具体的plink和R脚本命令，指导用户如何进行数据预处理，以确保后续分析的准确性。

摘要由CSDN通过智能技术生成

数据文件下载，输入下行命令直接克隆即可。

git clone https://github.com/MareesAT/GWA_tutorial.git

就会下载4个Tutorial的文件：

1_QC_GWAS.zip

2_Population_stratification.zip

3_Association_GWAS

4_ PRS.doc

前三个过程必须按顺序跑完，每个步骤需要前一个步骤完成之后的文件。PRS是独立的，不依赖于前三个步骤。

这个数据来源于hapmap3，这里的脚本也可以用于以后你的数据分析过程。但是这个示例数据全都来源于同一种群，而且是二分表型。所以如果你要处理的表型是数量性状，包含多个种群的话，就不太适用。

在follow之前，首先要确保你已经安装了plink或者plink2和R。

QC过程

所有的执行命令都放在 1_Main_script_QC_GWAS.txt中，依次照着步骤来在终端中输入即可。

步骤一检查个体和SNP的缺失情况

plink --bfile HapMap_3_r3_1 --missing

选项和参数说明：

plink : 用plink执行，如果你没有将plink放入你的profile文件，那么执行plink的时候需要提供你安装plink的绝对路径。比如说你安装在/home/software下，就要改为/home/software/plink。

--bfile : 提供plink的后缀名为.bed, .bim, .fam二进制文件。plink会自动读取这三个文件，因此我们不需要加上这个后缀名。只需要给HapMap_3_r3_1

--missing ：计算缺失率。

会生成两个文件：plink.imiss 和 plink.lmiss。plink.imiss记录了每个个体的缺失SNPs的比例：

plink.lmiss记录了有多少个个体缺失了某一SNP，即SNP在个体中的缺失率。

做出直方图查看这两个的缺失率情况，这一步要确保你已经安装了R，我用的是conda创建R的虚拟环境，很方便，每个环境是独立的。不清楚的可以查查资料如何安装虚拟环境。

Rscript --no-save hist_miss.R

移除有较高缺失率的SNP和个体，可以根据你自己的需要设置过滤阈值，比如移除缺失率大于0.2或0.02的：

# Delete SNPs with missingness >0.2.
plink --bfile HapMap_3_r3_1 --geno 0.2 --make-bed --out HapMap_3_r3_2

# Delete individuals with missingness >0.2.
plink --bfile HapMap_3_r3_2 --mind 0.2 --make-bed --out HapMap_3_r3_3

# Delete SNPs with missingness >0.02.
plink --bfile HapMap_3_r3_3 --geno 0.02 --make-bed --out HapMap_3_r3_4

# Delete individuals with missingness >0.02.
plink --bfile HapMap_3_r3_4 --mind 0.02 --make-bed --out HapMap_3_r3_5

参数说明

--make-bed : 输出plink的二进制结果文件

--out 可以定义结果文件前缀名。如HapMap_3_r3_5就会生成HapMap_3_r3_5.bim, HapMap_3_r3_5.bed, HapMap_3_r3_5.fam文件。

步骤二性别分歧检验

在前面的步骤中我们生成了HapMap_3_r3_5.bim, HapMap_3_r3_5.bed, HapMap_3_r3_5.fam。接下来检验样本中性别不一致的情况。

在个体中，F值小于0.2的视为女性，F值大于0.8的视为男性。如果不符合这个条件的话，会被plink认为是“PROBLEM”

plink --bfile HapMap_3_r3_5 --check-sex

生成plink.sexcheck，

Rscript --no-save gender_check.R

结果表明只有一个女性出现性别分歧的情况。

如果有出现性别分歧的个体的话，可以用下列命令进行删除或者是填充（impute）

删除

grep "PROBLEM" plink.sexcheck| awk '{print$1,$2}'> sex_discrepancy.txt

生成含有性别分歧的列表名单sex_discrepancy.txt

$ cat sex_discrepancy.txt
1349 NA10854

用plink移除名单上的个体

plink --bfile HapMap_3_r3_5 --remove sex_discrepancy.txt --make-bed --out HapMap_3_r3_6

impute

这里的impute是基于你的数据集中的基因型信息来填充的

plink --bfile HapMap_3_r3_5 --impute-sex --make-bed --out HapMap_3_r3_6

步骤三 MAF过滤

生成只包含常染色体的bfile，并且删除MAF较低的SNPs

1. 只选常染色上的SNPs

awk '{ if ($1 >= 1 && $1 <= 22) print $2 }' HapMap_3_r3_6.bim > snp_1_22.txt
plink --bfile HapMap_3_r3_6 --extract snp_1_22.txt --make-bed --out HapMap_3_r3_7

2. 看SNP的MAF分布情况，生生成MAF_check.frq文件

plink --bfile HapMap_3_r3_7 --freq --out MAF_check
Rscript --no-save MAF_check.R

删除低MAF的个体，比如MAF<0.05, 最后移除了325318个variants，剩下1073226的variants。

plink --bfile HapMap_3_r3_7 --maf 0.05 --make-bed --out HapMap_3_r3_8

常见的MAF的阈值为0.01或0.05，主要取决于你的样本大小。

步骤四 HWE检验

删除不满足Hardy-Weinberg equilibrium的SNPs。

首先检验下所有SNPs HWE p值的分布情况

plink --bfile HapMap_3_r3_8 --hardy

生成plink.hwe文件，如下所示：

只挑选HWE<0.00001的SNPs，

# Selecting SNPs with HWE p-value below 0.00001, required for one of the two plot generated by the next Rscript, allows to zoom in on strongly deviating SNPs. 
awk '{ if ($9 <0.00001) print $0 }' plink.hwe>plinkzoomhwe.hwe
Rscript --no-save hwe.R

默认情况下，--hwe只过滤controls，所以还得需要以下步骤。control设定一个严格的HWE阈值，

case的阈值不用那么严格。

plink --bfile HapMap_3_r3_8 --hwe 1e-6 --make-bed --out HapMap_hwe_filter_step1
plink --bfile HapMap_hwe_filter_step1 --hwe 1e-10 --hwe-all --make-bed --out HapMap_3_r3_9

步骤五异质性检验

生成样本异质率的分布图，并移除异质率偏离平均值大于3个标准差的个体。

异质性检验是在没有很高相关性的SNPs上进行的。所以要先移除一些具有高LD的区域，并用命令--indep-pairwise prune SNPs。示例中已经提供了一个高度相关的SNPs的列表：inversion.txt，

我们要去除掉列表上的这些SNPs，生成indepSNP.prune.in和indepSNP.prune.out文件

plink --bfile HapMap_3_r3_9 --exclude inversion.txt --range --indep-pairwise 50 5 0.2 --out indepSNP

参数说明：

50 5 0.2 分别表示：the window size, the number of SNPs to shift the window at each step, and the multiple correlation coefficient for a SNP being regressed on all other SNPs simultaneously.

提取只在indepSNP.prune.in中的SNPs

plink --bfile HapMap_3_r3_9 --extract indepSNP.prune.in --het --out R_check

Rscript --no-save check_heterozygosity_rate.R

查看异质性偏离平均数大于3个标准差的频率

Rscript --no-save heterozygosity_outliers_list.R

上述所以输出命令保存在：fail-het-qc.txt

要让plink兼容这个文件，需要把标题引号去掉，并输出第一列，第二列。

sed 's/"// g' fail-het-qc.txt | awk '{print$1, $2}'> het_fail_ind.txt

移除掉异质性离群的：

plink --bfile HapMap_3_r3_9 --remove het_fail_ind.txt --make-bed --out HapMap_3_r3_10

步骤六个体相关性检验

在这个教程中移除个体之间π帽大于0.2的个体。

检验个体之间π帽大于0.2的相关性，生成pihat_min0.2.genome文件

The following commands will visualize specifically these parent-offspring relations, using the z values.

awk '{ if ($8 >0.9) print $0 }' pihat_min0.2.genome>zoom_pihat.genome

生成相关性的图

Rscript --no-save Relatedness.R

The generated plots show a considerable amount of related individuals (explentation plot; PO = parent-offspring, UN = unrelated individuals) in the Hapmap data, this is expected since the dataset was constructed as such.

为了证明大部分的亲缘关系是由于父母-后代，我们只包括founders(在数据集中没有父母的个人)。

plink --bfile HapMap_3_r3_10 --filter-founders --make-bed --out HapMap_3_r3_1

然后再移除π帽大于0.2的个体

plink --bfile HapMap_3_r3_11 --extract indepSNP.prune.in --genome --min 0.2 --out pihat_min0.2_in_founders

The file 'pihat_min0.2_in_founders.genome' shows that, after exclusion of all non-founders, only 1 individual pair with a pihat greater than 0.2 remains in the HapMap data.
This is likely to be a full sib or DZ twin pair based on the Z values. Noteworthy, they were not given the same family identity (FID) in the HapMap data.

plink --bfile HapMap_3_r3_11 --missing

# Generate a list of FID and IID of the individual(s) with a Pihat above 0.2, to check who had the lower call rate of the pair.
# In our dataset the individual 13291 NA07045 had the lower call rate.

编辑 0.2_low_call_rate_pihat.txt，把上面具有lower call rate的信息输入进去。

然后再删除在'related' pairs with a pihat > 0.2，而且lower call rate的个体。

plink --bfile HapMap_3_r3_11 --remove 0.2_low_call_rate_pihat.txt --make-bed --out HapMap_3_r3_12

至此，大功告成！

参考文献：A tutorial on conducting genome‐wide association studies:Quality control and statistical analysis

GWAS 教程之QC

QC过程

步骤一 检查个体和SNP的缺失情况

步骤二 性别分歧检验

步骤三 MAF过滤

步骤四 HWE检验

步骤五 异质性检验

步骤六 个体相关性检验

步骤一检查个体和SNP的缺失情况

步骤二性别分歧检验

步骤五异质性检验

步骤六个体相关性检验