cellranger VDJ 算法篇
Step | Operation |
---|---|
Read subsampling | Reduce the reads for a given barcode to at most 80,000, because more reads don’t help. |
Read trimming | Trim off read bases after enrichment primers. |
Graph formation | Build a De Bruijn graph using k = 20 |
Reference-free graph simplification | Simplify the graph by removing ‘noise’ edges. |
Reference-assisted graph simplification | Same, but this time use the reference. |
UMI filtering | Filter out UMIs that are likely to be artifacts. |
Contig construction | Make contigs by looking for the best path through the graph for each UMI. |
Competitive deletion of contigs | Remove contigs that are much weaker than other contigs and which are likely to be artifacts. |
Contig confidence | Define the high confidence contigs, which are likely to represent bona fide transcripts from a single cell associated to a barcode. |
Contig quality scores | Assign a quality score to each base on each contig. |
VDJ Cell Calling Algorithm
Detection of T or B cells depends on identification and counting of V(D)J transcripts from them. Some T and B cells have very low expression levels for these transcripts, and thus these may not be detected. Conversely, sufficiently high levels of extracellular mRNA may result in some barcodes being misidentified as T or B cells. Thus the goal of the VDJ cell calling algorithm is to approximate the set of barcodes that contain a T or B cell.
The algorithm is executed as part of the assembly algorithm. To be identified as a T or B cell, a barcode must satisfy the following three requirements:
- There must be a productive, confident contig, and if there is only one such contig, there must be more than one UMI supporting its junction region. (In the denovo case, we require only that there is a contig.) Although other cell types can exhibit transcription within the TCR and BCR loci, only T and B cells produce fully rearranged transcripts that contain both a V and a C segment. Therefore having a productive contig is good evidence that a transcript from a T or B cell was present in the GEM. However, it is possible that the transcript was background – present in the fluid between cells rather than in an intact cell. Requiring more than one UMI provides some protection against this case.
- There must be at least three filtered UMIs having at least two read pairs each (see Assembly Algorithm). This reduces the likelihood of mis-identifying a cell as a T or B cell based solely on background transcripts.
- Compute the N50 value of the number of read pairs per UMI, across all barcodes. If for a given barcode, the maximum read pair count across filtered UMIs is less then 3% of this N50, do not call the barcode a cell. This provides some protection against transcripts arising from index hopping on an Illumina flowcell, and from other forms of cross-library contamination.
Annotation Algorithm
For a given dataset, the pipeline first determines if the data are TCR or BCR, then accordingly aligns all contigs to the TCR or BCR reference sequences. In rare (mixed) cases contigs are aligned to both. Alignment is seeded on 12-mer perfect matches, followed by heuristic extension; we also search backward from C segment alignments for J segment alignments that do not have 12-mer perfect matches, as these will arise occasionally from somatic hypermutation.
It is important to understand that the choice of V(D)J reference sequences in an alignment can be arbitrary, depending on how similar the reference sequences are to each other. For D segments, which are both short and more mutated, it is often not possible to find a confident alignment, and an alignment may not be shown.
Productive Contigs
A contig is termed productive if the following conditions are met:
- Full length requirement. The contig matches the initial part of a V gene. The contig continues on, ultimately matching the terminal part of a J gene.
- Start requirement. The initial part of the V matches a start codon on the contig. Note that in the human and mouse reference sequences supplied by 10x, every V segment begins with a start codon.
- Nonstop requirement. There is no stop codon between the V start and the J stop.
- In-frame requirement. The J stop minus the V start equals one mod three. This just says that the codons on the V and J segments are in frame.
- CDR3 requirement. There is an annotated CDR3 sequence (see below).
- Structure requirement. Let VJ denote the sum of the lengths of the V and J segments. Let len denote the J stop minus the V start, measured on the contig. Then VJ - len lies between -25 and +25, except for IGH, which must be between -55 and +25. This condition is imposed to preclude anomalous structure changes that are unlikely to correspond to functional proteins.
We require that a CDR3 sequence have length between 5 and 27 amino acids, start with a C, and not contain a stop codon
Clonotype Grouping
T cells. The lack of somatic hypermutation in T cell receptors (TCRs) yields biological clonotypes which have identical V(D)J transcripts. Technical artifacts (e.g. arising in reverse transcription) can result in the computed clonotypes having isolated differences. These are generally rare.
B cells. Fully rearranged B cell receptors (BCRs) can undergo somatic hypermutation (SHM), which can increase antigen affinity. Thus for BCRs, V(D)J transcripts in a clonotype can differ at any position, as shown below:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZHwE4tuA-1610073180341)(https://support.10xgenomics.com/img/cellranger-workflows/vdj/what_is_a_clonotype.png)]
Clonotyping grouping and the associated filtering are complex and heuristic. The concepts are described briefly here, and some link out to the enclone site for technical documentation.
A. Clonotype grouping. This includes two key steps:
Step 1: Cells are placed into groupings called exact subclonotypes if they have essentially identical V(D)J transcripts, meaning the sequences extending from the start codon through the J gene are identical, and the constant region gene assignments are the same. The algorithm does not test for somatic hypermutations (SHM) in the 5’-UTR or constant region, and moreover the data normally extend only up to the primer site in the constant region.
Step 2: Only productive transcripts are used. Cells which are missing a chain (typically because of lower coverage) are placed in their own exact subclonotype, although they may later be joined into the same clonotype (below).
Exact clonotypes are iteratively merged into clonotypes based on comparing each pair of exact subclonotypes to each other. They can be merged into a single clonotype based on two types of evidence:
(a) Similarity of their rearrangement junction regions. Somatic hypermutation during clonal evolution can cause junction regions to diverge.
(b) Evidence of shared SHM outside their junction regions. This is assessed by identifying positions within V genes that are identical on the two exact subclonotypes, differ from the reference, and do not represent germline differences. An outline of how this is implemented is described on the enclone site.
B. Cell filtering during the clonotype grouping process. During library generation, artifacts can arise by two primary mechanisms:
(a) Reverse transcription or sequencing can introduce base call errors. These usually occur at bases having low quality scores. Cells with these low quality bases are screened out, typically at a low rate.
(b) GEMs may contain material from two or more cells: either entire intact cells, cell fragments, or individual mRNA molecules. Detection of contamination is complex and is accomplished via multiple heuristic filters. A partial guide to the filters can be found on the enclone website. Note that if the multi pipeline is run with both gene expression and V(D)J data, then barcodes which are not called cells by the gene expression pipeline will also be deleted from the V(D)J cell set.