在做生信分析的时候,很多情况下我个人倾向于从ENSEMBL下载基因组,但是这个数据库的染色体编号为数字,而一些f分析软件会要求chr前缀。这里演示下如何进行给gtf文件和基因组添加chr前缀。
$ ll Homo_sapiens.GRCh38.* |cut -d ' ' -f 5-
842M Apr 22 2023 Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
52M Apr 24 2023 Homo_sapiens.GRCh38.110.gtf.gz
对于gtf文件
zcat Homo_sapiens.GRCh38.110.gtf.gz |sed '/^#/!s/^/chr/g' > Homo_sapiens.GRCh38.110.gtf
其中/^#/!
的/^#/
部分表示匹配,!
表示非。
查看gtf染色体前缀
$ cat Homo_sapiens.GRCh38.110.gtf |grep -v '^#' |cut -f 1 |uniq |head -6
chr1
chr2
chr3
chr4
chr5
chr6
对于基因组文件
$ zcat Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz |sed 's/^>/>chr/g' > Homo_sapiens.GRCh38.dna.primary_assembly.fa
查看基因组染色体前缀
$ cat Homo_sapiens.GRCh38.dna.primary_assembly.fa |grep '>' |head -6
>chr1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
>chr10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF
>chr11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF
>chr12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF
>chr13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF
>chr14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF