写这篇博文的背景
接手了一个卵巢WES的项目,在搭建HRD分析流程时,遇到了许多问题,这里做个记录,把所有修改过的最终用docker建立了镜像,方便部署,docker相关的可以看下我之前的文章<<docker部署conda环境下的小程序_小程序 docker-CSDN博客>>。总体安装见sztup/scarHRD,我这里cnv分析时,用的是python版Sequenza-utils — sequenza-utils 2.2.0 documentation,前期分析的大致步骤如下:
# 生成gc文件
$ sequenza-utils gc_wiggle
# 统计等位基因深度和频率
$ sequenza-utils bam2seqz
# 片段化
$ sequenza-utils seqz_binning
用sequenza R包分析时,遇到问题报错了 ,这是我的第一个问题;另外,R sequenza包依赖copynumber包,但是copynumber包没法分析GRCh38版本的参考基因组,这是第二个问题。用python sequenza-utils时,bam2seqz子命令没有传递bed文件的参数,这是我想解决的第三个问题。已经把修改好的代码直接上传到github上了,如下:
- https://github.com/WortJohn/sequenza/tree/WortJohn-patch-1
- https://github.com/WortJohn/copynumber
- https://github.com/WortJohn/sequenza-utils
问题一:sequenza R包运行报错
Solve the following issues:
Collecting GC information Error in (function (x, col_types) : unused argument (parallel = 1)
Calls: main ... preprocess.seqz -> sequenza.extract -> gc.sample.stats -> chunk.apply
Execution halted
The reason:
The "chunk.apply" function from R iotools package with new version does not have "parallel" parameter which has been changed to "CH.PARALLEL"
报错信息如上,主要是iotools R 包版本更新,有些参数名发生了改变,可以直接下载修改过的源码,重新安装
# 安装命令
$ Rscript -e "devtools::install_local('./sequenza-WortJohn-patch-1.zip', force=TRUE, repos=NULL, type='source', dependencies=FALSE)"
问题二:GRCh38版本问题
之前安装的copynumber包不支持GRCh38版本,这里补充了hg38和mm10参考基因组版,安装方法和第一个问题一样
# 部分内容
> ?copynumber::pcf
assembly: a string specifying which genome assembly version should be
applied to determine chromosome arms. Allowed options are
"hg38", "hg19", "hg18", "hg17" and "hg16" (corresponding to
the four latest human genome annotations in the UCSC genome
browser).
问题三:补充参数
这个主要是分析wgs时,想着只计算目标区域的,默认安装的又没有一个参数可以实现这个功能
$ sequenza-utils bam2seqz --help
error: the following arguments are required: -n/--normal, -t/--tumor, -gc
usage: sequenza-utils bam2seqz [-p] -n NORMAL -t TUMOR -gc GC [-F FASTA] [-o OUT] [-n2 NORMAL2] [-C CHR [CHR ...]] [-L TARGET_FILE]
[--parallel NPROC] [-S SAMTOOLS] [-T TABIX] [-q QLIMIT] [-f QFORMAT] [-N N] [--hom HOM] [--het HET]
[--het_f HET_F]
Input/Output:
Input and output files.
-p, --pileup Use pileups as input files instead of BAMs.
-n NORMAL, --normal NORMAL
Name of the BAM/pileup file from the reference/normal sample
-t TUMOR, --tumor TUMOR
Name of the BAM/pileup file from the tumor sample
-gc GC The GC-content wiggle file
-F FASTA, --fasta FASTA
The reference FASTA file used to generate the intermediate pileup. Required when input are BAM
-o OUT, --output OUT Name of the output file. To use gzip compression name the file ending in .gz. Default STDOUT.
-n2 NORMAL2, --normal2 NORMAL2
Optional BAM/pileup file used to compute the depth.normal and depth-ratio, instead of using the normal BAM.
Genotype:
Options regarding the genotype filtering.
--hom HOM Threshold to select homozygous positions. Default 0.9.
--het HET Threshold to select heterozygous positions. Default 0.25.
--het_f HET_F Threshold of frequency in the forward strand to trust heterozygous calls. Default -0.2 (Disabled, effective with values
>= 0).
Subset indexed files:
Option regarding samtools and bam indexes.
-C CHR [CHR ...], --chromosome CHR [CHR ...]
Argument to restrict the input/output to a chromosome or a chromosome region. Coordinate format is Name:pos.start-
pos.end, eg: chr17:7565097-7590856, for a particular region; eg: chr17, for the entire chromosome. Chromosome names can
checked in the BAM/pileup files and are depending on the FASTA reference used for alignment. Default behavior is to not
selecting any chromosome.
-L TARGET_FILE, --target_file TARGET_FILE
overlap (BED) regions in TARGET_FILE
--parallel NPROC Defines the number of chromosomes to run in parallel. The output will be divided in multiple files, one for each
chromosome. The file name will be composed by the output argument (used as prefix) and a chromosome name given by the
chromosome argument list. This imply that both output and chromosome argument need to be set correctly.
-S SAMTOOLS, --samtools SAMTOOLS
Path of samtools exec file to access the indexes and compute the pileups. Default "samtools"
-T TABIX, --tabix TABIX
Path of the tabix binary. Default "tabix"
Quality and Format:
Options that change the quality threshold and format.
-q QLIMIT, --qlimit QLIMIT
Minimum nucleotide quality score for inclusion in the counts. Default 20.
-f QFORMAT, --qformat QFORMAT
Quality format, options are "sanger" or "illumina". This will add an offset of 33 or 64 respectively to the qlimit
value. Default "sanger".
-N N Threshold to filter positions by the sum of read depth of the two samples. Default 20.
这里修改了源代码,增加了-L参数,安装方法
$ wget -c https://github.com/WortJohn/sequenza-utils/archive/refs/heads/main.zip
$ unzip main.zip
$ pip install ./sequenza-utils-main/
最终把修改后的程序用docker封装了
$ docker run zsr/hrd:v1.0
Usage:
run_hrd_pipe.R small_seqz_file output_directory output_prefix [grch38 or grch37, default=grch38]
镜像Dockerfile
最后把我最终封装使用的Dockerfile抛出给大家参考
# 使用官方的miniconda3基础镜像
FROM continuumio/miniconda3
# 设置工作目录
WORKDIR /work
# 将环境配置文件复制到镜像中
COPY hrd.yaml data.table_1.16.4.tar.gz scarHRD-master.zip ./
# 创建Conda环境
RUN conda env create -f hrd.yaml
# 激活环境
RUN echo "source activate hrd" > ~/.bashrc
ENV PATH /opt/conda/envs/hrd/bin:$PATH
RUN . activate hrd
# 安装特定软件包
RUN R CMD INSTALL data.table_1.16.4.tar.gz
RUN Rscript -e "devtools::install_local('./scarHRD-master.zip', build_vignettes = TRUE, repos=NULL, type='source')"
COPY run_hrd_pipe.R /opt/conda/bin/
RUN chmod 755 /opt/conda/bin/run_hrd_pipe.R
COPY sequenza-WortJohn-patch-1.zip ./
RUN Rscript -e "devtools::install_local('./sequenza-WortJohn-patch-1.zip', force=TRUE, repos=NULL, type='source', dependencies=FALSE)"
COPY sequenza-utils.v2.zip ./
RUN conda install conda-forge::unzip
RUN unzip sequenza-utils.v2.zip
RUN pip install ./sequenza-utils.v2/
COPY copynumber_1.38.0.v2.zip ./
RUN Rscript -e "devtools::install_local('./copynumber_1.38.0.v2.zip', force=TRUE, repos=NULL, type='source', dependencies=FALSE)"
# 设置容器启动时的命令
CMD ["run_hrd_pipe.R"]