cellranger 操作笔记-2:构建绵羊单细胞转录组参考基因组

本文详细介绍了如何为单细胞测序数据准备参考基因组,包括从Ensembl下载绵羊的参考基因组和注释文件,使用cellranger工具过滤非编码转录本,创建10X兼容的参考基因组,并解决线粒体基因缺失的问题。此外,还展示了如何添加外源基因如GFP到参考基因组中,以供后续分析使用。
摘要由CSDN通过智能技术生成

参考10X官方教程:Find the input files -Software -Single Cell Gene Expression -Official 10x Genomics Support

1. 说明

    用于单细胞测序数据的参考基因组与bulk转录组、芯片、重测的参考基因组存在差别,需要对下载好的参考基因组进行加工。

    10X提供构建好的人、小鼠的参考基因组,并且附有详细的构建过程和参数

2. 参考基因组和注释文件

参考基因组,约800M

wget -b -c http://ftp.ensembl.org/pub/release-103/fasta/ovis_aries_rambouillet/dna/Ovis_aries_rambouillet.Oar_rambouillet_v1.0.dna.toplevel.fa.gz
##最好是下载*.dna.primary_assembly.fa.gz文件,但是绵羊中没有,下载.dna.toplevel.fa.gz参考基因组
gunzip Ovis_aries_rambouillet.Oar_rambouillet_v1.0.dna.toplevel.fa.gz 

注释文件:约13M。cellranger仅支持.gtf格式注释文件,不支持.gff格式。Ensemble中有多种类型注释文件,下载*.chr.gtf.gz.

wget http://ftp.ensembl.org/pub/release-103/gtf/ovis_aries_rambouillet/Ovis_aries_rambouillet.Oar_rambouillet_v1.0.103.chr.gtf.gz
gunzip Ovis_aries_rambouillet.Oar_rambouillet_v1.0.103.chr.gtf.gz

3. 过滤注释文件

    GTF文件中包含非polyA转录本的注释信息,一些基因碱基序列与蛋白编码基因碱基序列重叠,由于注释信息的重叠,从而造成了读段比对到多个基因上(multi-mapped)。这种情况下,不会对这些多重比对的读段进行计数。为此,我们需要移除.gtf中这些转录本的注释信息,仅留下--attribute=gene_biotype:protein_coding的转录本

    有些地方理解的不太准确,原文如下:GTF files can contain entries for non-polyA transcripts that overlap with protein-coding gene models.  These entries can cause reads to be flagged as mapped to multiple genes (multi-mapped) because of the overlapping annotations. In the case where reads are flagged as multi-mapped, they are not counted. See article on Which reads are considered for UMI counting by Cell Ranger. To remove these entries from the GTF, add this filter argument to the mkgtf command: --attribute=gene_biotype:protein_coding. If you are interested in seeing all of the filters used to build references available on our support site, click here. If you are using a GTF file that does not contain gene_biotype attributes or is missing other entries, don't worry too much; there may still be enough information to build a reference. A minimal GTF file only needs to contain exon features for protein coding genes.

    cellranger mkgtf --help  查看参数

Genes GTF tool for 10x Genomics Cell Ranger.
Filter user-supplied GTF files for use as Cell Ranger-compatible
genes files for mkref tool.
The commands below should be preceded by 'cellranger':

Usage:
    mkgtf <input_gtf> <output_gtf> [--attribute=KEY:VALUE...]
    mkgtf -h | --help | --version

Arguments:
    input_gtf           Path to input genes GTF file.
    output_gtf          Path to filtered output genes GTF file.

Options:
    --attribute=<key:value>
                        Key-value pair in attributes field to be kept in the GTF
                            file.
    -h --help           Show this message.
    --version           Show version.

       运行命令,很快,1min

cellranger mkgtf Ovis_aries_rambouillet.Oar_rambouillet_v1.0.103.chr.gtf  Ovis_aries_rambouillet.Oar_rambouillet_v1.0.103.chr.filtered.gtf --attribute=gene_biotype:protein_coding

4. 准备单细胞分析参考基因组

    cellranger mkref --help 查看命令参数

Reference preparation tool for 10x Genomics Cell Ranger.

Build a Cell Ranger-compatible reference folder from user-supplied genome FASTA and gene GTF files. Creates a new folder named after the genome.

The commands below should be preceded by 'cellranger':

Usage:
    mkref
        --genome=NAME ...
        --fasta=PATH ...
        --genes=PATH ...
        [options]
    mkref -h | --help | --version

Arguments:
    genome  #输出文件夹      Unique genome name(s), used to name output folder
                            [a-zA-Z0-9_-]+. Specify multiple genomes by
                            specifying the --genome argument multiple times; the
                            output folder will be <name1>_and_<name2>.
    fasta   #FASTA参考基因组绝对路径 
                            Path(s) to FASTA file containing your genome reference.
                            Specify multiple genomes by specifying the --fasta
                            argument multiple times.
    genes   #.filtered.gtf注释文件绝对路径
                            Path(s) to genes GTF file(S) containing annotated genes
                            for your genome reference. Specify multiple genomes
                            by specifying the --genes argument multiple times.

Options:
    --nthreads=<num>    This option is currently ignored due to a bug, and will be re-enabled
                          in the next Cell Ranger release.
    --memgb=<num>       Maximum memory (GB) used when aligning reads with STAR.
                            Defaults to 16.
    --ref-version=<str> Optional reference version string to include with
                            reference.
    -h --help           Show this message.
    --version           Show version.

    生成参考基因组文件夹,在服务器中需要1-2h,用nohup命令挂起。

nohup cellranger mkref --genome=ovis_aries --fasta=Ovis_aries_rambouillet.Oar_rambouillet_v1.0.dna.toplevel.fa --genes=Ovis_aries_rambouillet.Oar_rambouillet_v1.0.103.chr.filtered.gtf &

    命令运行详情:


['/home/hanjiangang/software/cellranger-6.0.0/bin/rna/mkref', '--genome=ovis_aries', '--fasta=Ovis_aries_rambouillet.Oar_rambouillet_v1.0.dna.toplevel.fa', '--genes=Ovis_aries_rambouillet.Oar_rambouillet_v1.0.103.chr.filtered.gtf']
Apr 15 14:36:45 ..... started STAR run
Apr 15 14:36:45 ... starting to generate Genome files
Apr 15 14:38:52 ... starting to sort Suffix Array. This may take a long time...
Apr 15 14:39:03 ... sorting Suffix Array chunks and saving them to disk...
Apr 15 16:40:45 ... loading chunks from disk, packing SA...
Apr 15 16:41:47 ... finished generating suffix array
Apr 15 16:41:47 ... generating Suffix Array index
Apr 15 16:46:07 ... completed Suffix Array index
Apr 15 16:46:07 ..... processing annotations GTF
Apr 15 16:46:19 ..... inserting junctions into the genome indices
Apr 15 16:55:08 ... writing Genome to disk ...
Apr 15 16:55:23 ... writing Suffix Array to disk ...
Apr 15 16:56:00 ... writing SAindex to disk
Apr 15 16:56:08 ..... finished successfully
Creating new reference folder at /home/hanjiangang/single_Cell/example/ref/ovis_aries/ovis_aries
...done

Writing genome FASTA file into reference folder...
...done

Indexing genome FASTA file...
...done

Writing genes GTF file into reference folder...
...done

Generating STAR genome index (may take over 8 core hours for a 3Gb genome)...
...done.

Writing genome metadata JSON file into reference folder...
Computing hash of genome FASTA file...
...done

Computing hash of genes GTF file...
...done

...done

>>> Reference successfully created! <<<

You can now specify this reference on the command line:
cellranger --transcriptome=/home/hanjiangang/single_Cell/example/ref/ovis_aries/ovis_aries ...

    生成的参考基因组目录结构:有一个log.out的输出文件,其中有fasta包含的染色体信息

tree ovis_aries
ovis_aries/
├── fasta
│   ├── genome.fa
│   └── genome.fa.fai
├── genes
│   └── genes.gtf.gz
├── reference.json
└── star
    ├── chrLength.txt
    ├── chrNameLength.txt
    ├── chrName.txt
    ├── chrStart.txt
    ├── exonGeTrInfo.tab
    ├── exonInfo.tab
    ├── geneInfo.tab
    ├── Genome
    ├── genomeParameters.txt
    ├── SA
    ├── SAindex
    ├── sjdbInfo.txt
    ├── sjdbList.fromGTF.out.tab
    ├── sjdbList.out.tab
    └── transcriptInfo.tab

5. Trouble:无染色体或染色体注释信息

    在下游对细胞进行质控时,发现对线粒体基因比例进行筛选,之后发现没有注释到线粒体基因,感觉没有道理

    最终在注释文件中发现了问题所在

    首先需要确定下载的参考基因组是否含有线粒体序列,结果是有的

grep ">" Ovis_aries_rambouillet.Oar_rambouillet_v1.0.dna.toplevel.fa  | grep "MT"
>MT dna:primary_assembly primary_assembly:Oar_rambouillet_v1.0:MT:1:16616:1 REF

    然后查看注释文件是否有线粒体注释信息,结果是!!没有!!步步是坑啊

##艹,没有
 awk -F '\t' '{print $1}' Ovis_aries_rambouillet.Oar_rambouillet_v1.0.103.chr.gtf | grep "MT"
##是按照10x官网标准下载的*.chr.gtf文件,结果没有线粒体基因信息


##重新下载注释文件
wget -b -c http://ftp.ensembl.org/pub/release-103/gtf/ovis_aries_rambouillet/Ovis_aries_rambouillet.Oar_rambouillet_v1.0.103.gtf.gz

awk -F '\t' '{print $1}' Ovis_aries_rambouillet.Oar_rambouillet_v1.0.103.gtf | grep "MT" 
#N多mt
MT
MT
MT
MT
MT
MT
MT
MT
MT
MT
MT
MT

    但是!!!在NCBI上显示线粒体中是有13个基因表达的,而这13个基因,有的在Ensembl注释文件有,有的没有,比如“ATP8”,鬼知道之后的分析会不会报错!!

grep "ATP8" Ovis_aries_rambouillet.Oar_rambouillet_v1.0.103.gtf | grep "MT"

6. 用重新下载的注释文件制作10x参考基因组

############################5. 在参考基因组和注释文件中添加marker gene或外源基因,一般分析不涉及该内容,转自10X官网流程##################################

    以GFP(绿色荧光蛋白)为例

    下载GFP碱基序列(GFP有多种碱基序列,在这里仅展示了其中一种),

wwget -O GFP_orig.fa --no-check-certificate https://www.ebi.ac.uk/ena/browser/api/fasta/AAA27722.1?download=true

head GFP_orig.fa 
>ENA|AAA27722|AAA27722.1 Aequorea victoria green-fluorescent protein
ATGAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGT
GATGTTAATGGGCACAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACATACGGA
AAACTTACCCTTAAATTTATTTGCACTACTGGAAAGCTACCTGTTCCATGGCCAACACTT
GTCACTACTTTCTCTTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAG
CATGACTTTTTCAAGAGTGCCATGCCCGAAGGTTATGTACAGGAAAGAACTATATTTTA

    在GFP_orig.fa文件中有“|”字符和很多空格,可能造成下游分析报错,修改GFP_orig.fa文件的title

cat GFP_orig.fa | sed s/ENA\|AAA27722\|AAA27722\.\1\ Aequorea\ victoria\ green\-fluorescent\ protein/GFP/ > GFP.fa
head GFP.fa 
>GFP
ATGAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGT
GATGTTAATGGGCACAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACATACGGA
AAACTTACCCTTAAATTTATTTGCACTACTGGAAAGCTACCTGTTCCATGGCCAACACTT
GTCACTACTTTCTCTTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAG
CATGACTTTTTCAAGAGTGCCATGCCCGAAGGTTATGTACAGGAAAGAACTATATTTTAC
AAAGATGACGGGAACTACAAATCACGTGCTGAAGTCAAGTTTGAAGGTGATACCCTCGTT
AATAGAATTGAGTTAAAAGGTATTGATTTTAAAGAAGATGGAAACATTCTTGGACACAAA
ATGGAATACAACTATAACTCACACAATGTATACATCATGGCAGACAAACAAAAGAATGGA
ATCAAAGTTAACTTCAAAATTAGACACAACATTGAAGATGGAAGCGTTCAACTAGCAGAC

    查看序列长度,717个碱基

cat GFP.fa | grep -v "^>" | tr -d "\n" | wc -c
717

    制作GFP.fa的注释文件

echo -e 'GFP\tunknown\texon\t1\t717\t.\t+\t.\tgene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";' > GFP.gtf
head GFP.gtf 
GFP	unknown	exon	1	717	.	+	.	gene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding"

    将GFP.fa的碱基序列添加到参考基因组中,并将新的参考基因组命名为 Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa

cp Danio_rerio.GRCz11.dna.primary_assembly.fa Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa   ##重命名
cat GFP.fa >> Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa  ##添加GFP碱基序列
grep ">" Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa    ##查看是否添加成功

输出如下:
>1 dna:chromosome chromosome:GRCz11:1:1:59578282:1 REF
>10 dna:chromosome chromosome:GRCz11:10:1:45420867:1 REF
>11 dna:chromosome chromosome:GRCz11:11:1:45484837:1 REF
...
>MT dna:chromosome chromosome:GRCz11:MT:1:16596:1 REF
>KN149696.2 dna:scaffold scaffold:GRCz11:KN149696.2:1:368252:1 REF
>KN147651.2 dna:scaffold scaffold:GRCz11:KN147651.2:1:351968:1 REF
>KN149690.1 dna:scaffold scaffold:GRCz11:KN149690.1:1:343018:1 REF
>KN149686.1 dna:scaffold scaffold:GRCz11:KN149686.1:1:260365:1 REF
>KN147652.2 dna:scaffold scaffold:GRCz11:KN147652.2:1:252640:1 REF
>KN149688.2 dna:scaffold scaffold:GRCz11:KN149688.2:1:252035:1 REF
>KN149691.1 dna:scaffold scaffold:GRCz11:KN149691.1:1:233193:1 REF
...
>GFP
添加成功

    查看基因组中contigs 数量,994个

grep -c "^>" Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa

    将GFP的注释信息添加到参考基因组的注释信息中,并命名为Danio_rerio.GRCz11.99.chr.filtered.GFP.gtf

cp Danio_rerio.GRCz11.99.chr.filtered.gtf Danio_rerio.GRCz11.99.chr.filtered.GFP.gtf
cat GFP.gtf >> Danio_rerio.GRCz11.99.chr.filtered.GFP.gtf
tail Danio_rerio.GRCz11.99.chr.filtered.GFP.gtf

输出如下:
MT  RefSeq  start_codon 15308   15310   .   +   0   gene_id "ENSDARG00000063924"; gene_version "3"; transcript_id "ENSDART00000093625"; transcript_version "3"; exon_number "1"; gene_name "mt-cyb"; gene_source "RefSeq"; gene_biotype "protein_coding"; transcript_name "mt-cyb-201"; transcript_source "RefSeq"; transcript_biotype "protein_coding";
GFP unknown exon    1   717 .   +   .   gene_id GFP; transcript_id GFP; gene_name GFP; gene_biotype protein_coding;

将作为输出文件,再次运行cellranger mkref

cellranger mkref --genome=Danio.rerio_genome_GFP --fasta=Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa --genes=Danio_rerio.GRCz11.99.chr.filtered.GFP.gtf

6. 10X官方网站还提供了外源基因、恒河猴、猕猴、兔、大鼠等物种的单细胞参考基因组注释方法,有兴趣的可以参考官网。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值