How to choose genome reference and corresponding .gtf file

MorganaFu

已于 2023-09-20 14:41:39 修改

阅读量48

点赞数

文章标签：笔记经验分享

于 2023-09-20 14:41:37 首次发布

本文链接：https://blog.csdn.net/MorganaFu/article/details/133074665

版权

When you want to map your interested sequence to reference genome, it is important that you choose the correct files according to your research goal. The picture above recorded all the released versions of Ensembl archives.

Unmasked Genomic DNA Sequences (`dna`):

This refers to the original, raw sequence of genomic DNA without any modifications. All nucleotide bases (adenine, cytosine, guanine, and thymine represented by A, C, G, and T, respectively) are presented exactly as they appear in the genome.

Masked Genomic DNA (`dna_rm`):

In masked genomic DNA, certain regions of the DNA sequence are hidden or "masked." Specifically, interspersed repeats and low-complexity regions are identified using a tool like RepeatMasker and are replaced with 'N's. The idea is that these repetitive sequences often do not contain meaningful information for certain types of analyses, and masking them can make the analysis more focused and efficient.

For example, if a section of the original sequence was ACGTACGT, and the CGTA was a repeat, it might be displayed as ACNNNNCGT in the masked sequence.

Soft-Masked Genomic DNA (`dna_sm`):

Soft-masking is similar to hard-masking but less aggressive. In a soft-masked sequence, repeats and low-complexity regions are still identified but instead of replacing them with 'N's, the original nucleotide bases are simply converted to lowercase (a, c, g, t).

For example, if a section of the original sequence was ACGTACGT, and the CGTA was identified as a repeat, it might be displayed as ACgtACGT in the soft-masked sequence.

This retains the original information while still signaling that certain portions of the sequence are repetitive or low-complexity, which allows software tools more flexibility in how they handle these regions.

3. What does "primary assembly" mean?

The term "primary assembly" in genomics refers to the foundational DNA sequence that is assembled from raw sequencing data. This sequence serves as the main representation of an organism's genome. Generally, in a primary assembly, the aim is to create the longest and most accurate sequences (contigs and scaffolds) possible from the read data.

4. Which file should I choose?

If your targeted sequence is gene, you should use the .dna file, since the masked one which hide the repeated regions can cause lots of unmapped reads.