How to choose genome reference and corresponding .gtf file

When you want to map your interested sequence to reference genome, it is important that you choose the correct files according to your research goal. The picture above recorded all the released versions of Ensembl archives. 

Archives

Take Ensembl 106 as an example:

On the right side of the website, you can easily see the version that you are looking at and the release date. If you are looking for the human genome, just click "Human", and you will get:

Under the "Gene annotation" section, you could download the fasta file for the genome, as well as the .gtf file:

1. The picture above has so many formats of names, what do they consist of?

2. What does "rm." "sm." mean?

Unmasked Genomic DNA Sequences (dna):

This refers to the original, raw sequence of genomic DNA without any modifications. All nucleotide bases (adenine, cytosine, guanine, and thymine represented by A, C, G, and T, respectively) are presented exactly as they appear in the genome.

Masked Genomic DNA (dna_rm):

In masked genomic DNA, certain regions of the DNA sequence are hidden or "masked." Specifically, interspersed repeats and low-complexity regions are identified using a tool like RepeatMasker and are replaced with 'N's. The idea is that these repetitive sequences often do not contain meaningful information for certain types of analyses, and masking them can make the analysis more focused and efficient.

For example, if a section of the original sequence was ACGTACGT, and the CGTA was a repeat, it might be displayed as ACNNNNCGT in the masked sequence.

Soft-Masked Genomic DNA (dna_sm):

Soft-masking is similar to hard-masking but less aggressive. In a soft-masked sequence, repeats and low-complexity regions are still identified but instead of replacing them with 'N's, the original nucleotide bases are simply converted to lowercase (a, c, g, t).

For example, if a section of the original sequence was ACGTACGT, and the CGTA was identified as a repeat, it might be displayed as ACgtACGT in the soft-masked sequence.

This retains the original information while still signaling that certain portions of the sequence are repetitive or low-complexity, which allows software tools more flexibility in how they handle these regions.

3. What does "primary assembly" mean?

The term "primary assembly" in genomics refers to the foundational DNA sequence that is assembled from raw sequencing data. This sequence serves as the main representation of an organism's genome. Generally, in a primary assembly, the aim is to create the longest and most accurate sequences (contigs and scaffolds) possible from the read data.

4. Which file should I choose?

If your targeted sequence is gene, you should use the .dna file, since the masked one which hide the repeated regions can cause lots of unmapped reads.

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值