When you want to map your interested sequence to reference genome, it is important that you choose the correct files according to your research goal. The picture above recorded all the released versions of Ensembl archives.
Take Ensembl 106 as an example:
On the right side of the website, you can easily see the version that you are looking at and the release date. If you are looking for the human genome, just click "Human", and you will get:
Under the "Gene annotation" section, you could download the fasta file for the genome, as well as the .gtf file:
1. The picture above has so many formats of names, what do they consist of?
2. What does "rm." "sm." mean?
Unmasked Genomic DNA Sequences (dna
):
This refers to the original, raw sequence of genomic DNA without any modifications. All nucleotide bases (adenine, cytosine, guanine, and thymine represented by A, C, G, and T, respectively) are presented exactly as they appear in the genome.
Masked Genomic DNA (dna_rm
):
In masked genomic DNA, certain regions of the DNA sequence are hidden or "masked." Specifically, interspersed repeats and low-complexity regions are identified using a tool like RepeatMasker and are replaced with 'N's. The idea is that these repetitive sequences often do not contain meaningful information for certain types of analyses, and masking them can make the analysis more focused and efficient.
For example, if a section of the original sequence was ACGTACGT
, and the CGTA
was a repeat, it might be displayed as ACNNNNCGT
in the masked sequence.
Soft-Masked Genomic DNA (dna_sm
):
Soft-masking is similar to hard-masking but less aggressive. In a soft-masked sequence, repeats and low-complexity regions are still identified but instead of replacing them with 'N's, the original nucleotide bases are simply converted to lowercase (a, c, g, t).
For example, if a section of the original sequence was ACGTACGT
, and the CGTA
was identified as a repeat, it might be displayed as ACgtACGT
in the soft-masked sequence.
This retains the original information while still signaling that certain portions of the sequence are repetitive or low-complexity, which allows software tools more flexibility in how they handle these regions.
3. What does "primary assembly" mean?
The term "primary assembly" in genomics refers to the foundational DNA sequence that is assembled from raw sequencing data. This sequence serves as the main representation of an organism's genome. Generally, in a primary assembly, the aim is to create the longest and most accurate sequences (contigs and scaffolds) possible from the read data.
4. Which file should I choose?
If your targeted sequence is gene, you should use the .dna file, since the masked one which hide the repeated regions can cause lots of unmapped reads.