1. Cell Ranger软件的下载和安装
进入10x软件官网Software Overview - 10x Genomics,下拉找到Cell Ranger,点击Download跳转至下载界面,该界面有Cell Ranger软件和参考基因组等资源,可以下载Cell Ranger软件压缩包上传至服务器,也可以复制这段代码在服务器下载。需注意的是,进入这些网站需要填写个人信息,包括姓名、邮箱、机构等。
curl -o cellranger-8.0.0.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-8.0.0.tar.gz?Expires=1713279020&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA&Signature=ZTePihPd94hgAqjrHT03NaPNcF2nwdzDyDhycY2sBXVaspcenoza5qIK623rXip8x3Mou2mUpaqpWlPu1KpYQMwrHk~TQM~bqUvOLY8CgVRJvy5gd2RxpPzWcTr5~q9IrRDWRj2v6PDCR-362~a8Gp8w7iLFgMYO3FFO5IJK0I~ToaEH~PKDwUDNfLMkZZQ~hTojrbEVCM-OTjEBrvS5Q3or1Ehv0IAmRCnCGUJeqa4e99qzblFBozqAl9DV3m4ca6ujrLYPSUjFuW72wxOIP2eVTNG-Aal0pgneWtor9ZUt~jMZ-2VTb~T3anOiLGS~XOEz3fLK8Q3JwEHtlckP8Q__"



下载完成后,直接解压缩,在文件夹下面有以下文件,将软件添加至环境变量,并运行cellranger count --help检测能否成功运行Cell Ranger。

2. Cell Ranger count命令的应用
Cell Ranger软件有以下功能,mkfastq可以将BCL文件转换为fastq文件;count用于分析fastq文件,生成用于seurat分析的输入文件;aggr可以合并count输出的两个不同样本的文件。

运行下面的代码,对单细胞测序的fastq文件进行处理,其中涉及的参数含义如下,--expect-cells在第4部分将进一步详细解释。
cellranger count --id Sample_1 --transcriptome /home/Data/refdata-gex-GRCh38-2024-A --fastqs Sample_1 --sample Sample_1 --expect-cells 10000 --localcores 8 --localmem 40
cellranger count \
--id Sample_1 \ #样本名称
--transcriptome /home/Data/refdata-gex-GRCh38-2024-A \ #参考基因组路径
--fastqs Sample_1 \ #fastq文件路径名称
--sample Sample_1 \ #样本名称
--expect-cells 10000 \ #期望细胞的数量,与实验上机测序的细胞数量相关,默认为3000
--localcores 8 \ #限制Cell Ranger运行时使用的最大核数,默认使用系统下所有可用的核数
--localmem 40 #限制Cell Ranger运行时使用的最大内存
除了上述的基本用法之外,还可以添加--no-bam,不生成bam文件;--nosecondary,不进行二级分析,比如聚类等;--include-introns=false,设置内含子read不纳入UMI的定量等。

3.Cell Ranger count的输出文件
Cell Ranger运行之后的输出文件如下:
web_summary.html文件:打开后有样本UMI数目和质控等信息;
filtered_feature_bc_matrix文件夹:作为Seurat的输入文件,有barcodes.tsv.gz,features.tsv.gz和matrix.mtx.gz文件;
filtered_feature_bc_matrix.h5文件:HDF5格式的单细胞数据存储文件;
analysis文件夹:该文件夹下对单细胞测序数据进行了pca主成分分析、细胞聚类分析、tsne和umap降维分析、不同细胞亚群的差异表达分析。

4.Cell Ranger count涉及的算法
整体的分析流程如下图所示,首先将fastq文件比对至参考基因组生成bam格式的文件,对bam文件进行定量,生成细胞原始的基因表达矩阵,最后根据每个barcode内UMI的数量进行Cell Calling。这个过程包含了对reads的层层过滤、UMI和细胞的过滤。

Alignment
read的修剪:在每条全长的cDNA结构的5'末端含有一段30bp的TSO序列,在3'末端有一段ployA序列,以TSO和polyA形式存在的非模板序列会影响序列的比对。在比对之前需要对这两段序列进行修剪,从而提高序列比对的灵敏性。输出bam文件的tags ts:i和pa:i,分别代表了TSO序列和polyA序列的修剪数目。
This section on read trimming applies to 3' Gene Expression assays.
Each full-length cDNA construct is flanked by a 30-bp template switch oligo (TSO) sequence (
AAGCAGTGGTATCAACGCAGAGTACATGGG) at the 5' end and a poly-A sequence at the 3' end. The fragment size distribution of the sequencing library influences the likelihood of sequencing reads containing these sequences; reads derived from shorter RNA molecules are more prone to include both TSO and poly-A sequences compared to those from longer RNA molecules.Due to the presence of non-template sequences in the form of either TSO or poly-A, low-complexity ends can complicate read mapping. As a result, the TSO sequence is trimmed from the 5' end and the poly-A is trimmed from the 3' end of read 2 before alignment. This trimming enhances the sensitivity of the assay and improves the computational efficiency of the software pipeline.
The tags
ts:iandpa:iin the output BAM files indicate the number of TSO nucleotides trimmed from the 5' end of read 2 and the number of poly-A nucleotides trimmed from the 3' end. These trimmed bases are present in the sequence of the BAM record, and the CIGAR string reveals the position of these soft-clipped sequences.


比对至基因组:Cell Ranger使用STAR软件进行比对,根据reads比对至基因组的区域,将其分为外显子区(至少50%比对至外显子区)、内含子区(不符合外显子标准,但与内含子有交集)和基因间(不属于外显子和内含子)。
Cell Ranger employs the STAR aligner for splicing-aware alignment of reads to the genome. It categorizes reads into exonic, intronic, or intergenic based on their alignment, using the transcript annotation GTF file. A read is classified as exonic if at least 50% of it overlaps with an exon. It is deemed intronic if it does not qualify as exonic but intersects an intron, and it is labeled intergenic if it fits neither of the previous categories.
MAPQ调整:对于比对至基因组外显子区域,同时也比对至其他区域的reads,优先考虑外显子区域,MAPQ值高于255的read则被认为准确比对至外显子区域。
For reads that align to a single exonic locus but also align to one or more non-exonic loci, the exonic locus is prioritized and the read is considered to be confidently mapped to the exonic locus with MAPQ 255.
转录组比对:Cell Ranger会进一步对准确比对至外显子和内含子区域的reads进行转录本层面的注释,将reads分为外显子区(Exonic Read)和内含子区(Intronic Read),这二者都属于转录组reads(Transcriptomic Read)。而另一类反义reads(Antisense Read),则是比对至非编码区域的reads,在后续的分析中将会被过滤掉。
Cell Ranger further aligns confidently mapped exonic and intronic reads to annotated transcripts by examining their compatibility with the transcriptome. As shown below, reads are classified based on whether they are exonic (light blue) or intronic (red) and whether they are sense or antisense (purple).
默认情况下,上图中深蓝色区域的Transcriptomic reads,会被纳入UMI的计数。内含子区域的reads也会被纳入UMI的计数,这可以最大限度地提高灵敏度,比如未剪接的转录本而产生的内含子reads。当添加include-introns=false参数时,内含子reads将会被过滤,仅有浅蓝色区域的Exonic reads被纳入UMI的计数。需要说明的是,MAPQ低于255和比对至多个基因上的reads会被过滤掉,不纳入后续UMI的计数。
Starting in Cell Ranger 7.0, by default, the
cellranger countandcellranger multipipelines will include intronic reads for whole transcriptome gene expression analysis. Any reads that map in the sense orientation to a single gene - the reads labeled transcriptomic (blue) in the diagram above - are carried forward to UMI counting. Cell Ranger ignores antisense reads (purple). As shown above, antisense reads are defined as any read with alignments to an entire gene on the opposite strand and no sense alignments. Consequently, the web summary metrics "Reads Mapped Confidently to Transcriptome" and "Reads Mapped Antisense to Gene" will reflect reads mapped confidently to exonic regions, as well as intronic regions. The default setting that includes intronic reads is recommended to maximize sensitivity, for example in cases where the input to the assay consists of nuclei, as there may be high levels of intronic reads generated by unspliced transcripts.To exclude intronic reads,
include-intronsmust be set tofalse. In this case, Cell Ranger still uses transcriptomic (blue) reads with sense alignments (and ignores antisense alignments) for UMI counting, however a read is now classified as antisense if it has any alignments to a transcript exon on the opposite strand and no sense alignments.Furthermore, a read is considered uniquely mapping if it is compatible with only a single gene. Only uniquely mapping reads are carried forward to UMI counting (see this page for methods to check for multi-mapped reads).
10x Barcode的矫正:为了确定barcodes序列是否正确,Cell Ranger将barcodes序列与存储在barcodes白名单文件中的已知barcodes进行比较。对于那些白名单之外的barcodes,Cell Ranger对其进行矫正,未矫正的barcodes存储在bam文件的CR Tag中,而矫正之后的barcodes存储在CB Tag中。
To determine whether a barcode sequence is correct, Cell Ranger compares the sequence to the known barcodes for a given assay chemistry, which are stored in a barcode whitelist file.
Cell Ranger uses the following algorithm to correct putative barcode sequences against the whitelist:
- Count the observed frequency of every barcode on the whitelist in the dataset.
- For every observed barcode in the dataset that is not on the whitelist and is at most one Hamming distance away from the whitelist sequences:
- Compute the posterior probability that the observed barcode did originate from the whitelist barcode but has a sequencing error at the differing base (by base quality score).
- Replace the observed barcode with the whitelist barcode that has the highest posterior probability (>0.975).
The corrected barcodes are used for all downstream analysis and output files. In the output BAM file, the original uncorrected barcode is encoded in the
CRtag, and the corrected barcode sequence is encoded in theCBtag. Reads that cannot be assigned a corrected barcode will not have aCBtag.
UMI计数:首先,Cell Ranger将具有相同的barcode、UMI和准确比对至转录组相同位置的reads归为一组,随后对这些reads组进行过滤。
第一步过滤,当两组reads具有相同的barcode和转录组位置,仅UMIs序列有1个碱基不同时,这可能是由于碱基发生了替换错误,这两组reads会被归为一组reads,最终的UMI序列以reads数目多的组为准。第二步过滤,发生在两组或多组reads具有相同的barcode和UMI,而比对至转录组上的位置不同,保留reads数目最多的组,其他组被丢弃,当reads数目相同时,这些reads组都会被丢弃。
最终每一个barcode、UMI和转录组注释相同的reads组被计为1个UMI,即对细胞初始的mRNA进行绝对定量。
Prior to UMI counting, Cell Ranger attempts to correct sequencing errors within UMI sequences. Reads that are confidently mapped to the transcriptome are placed into groups that share the same barcode, UMI, and gene annotation. If two groups of reads have the same barcode and gene but their UMIs differ by a single base (i.e., are one Hamming distance apart), it implies a probable base substitution error. In this case, the UMI for the group with lesser support is adjusted to match the more prevalent UMI.
Cell Ranger again groups the reads by barcode, UMI (possibly corrected), and gene annotation. If two or more groups of reads have the same barcode and UMI, but different gene annotations, the gene annotation with the most supporting reads is kept for UMI counting, and the other read groups are discarded. In case of a tie for maximal read support, all read groups are discarded, as the gene cannot be confidently assigned.
After these two filtering steps, each observed barcode, UMI, gene combination is recorded as a UMI count in the unfiltered feature-barcode matrix. The number of reads supporting each counted UMI is also recorded in the molecule info file.
Calling cell barcodes
Calling cell barcodes这一步骤是对细胞的分析,根据样本barcode下UMI的数目,从而确定哪些barcode下的样本可以被认为是真正的细胞,而纳入下游分析。Cell Ranger v3.0版本之后,基于 EmptyDrops 算法,先鉴定出初始细胞,再进行细化改善。
第一步,设置一个阈值,对每一个barcode下的UMI数目进行筛选,从而鉴定出高RNA含量的细胞,这个阈值通过cellranger count命令下的--expect-cells参数进行设定,当设置为3000时,那么UMI总数的前3000个barcodes中,第99百分位的UMI值为m,当barcodes对应的样本UMI值超过m/10时,这个barcodes对应的样本被当做一个细胞。随后,细胞数值会通过OrdMag算法和最小化损失函数进行优化。
Cell Ranger v3.0 and later have an improved cell-calling algorithm that is better able to identify populations of low RNA content cells, especially when low RNA content cells are mixed into a population of high RNA content cells. For example, tumor samples often contain large tumor cells mixed with smaller tumor infiltrating lymphocytes (TIL) and researchers may be particularly interested in the TIL population.
The algorithm is based on the EmptyDrops method (Lun et al., 2019) and operates in two main phases:
- Initial cell identification: apply a threshold based on total UMI counts per barcode to pinpoint cells, effectively distinguishing the initial set of high RNA content cells.
- Refinement: examine the RNA profiles of the remaining barcodes to differentiate between "empty" and cell-containing partitions, thereby identifying low RNA content cells that may have UMI counts similar to those of empty GEMs.
In the first step, the original Cell Ranger cell calling algorithm is used to identify the primary mode of high RNA content cells, using a cutoff based on the total UMI count for each barcode. Cell Ranger may take as input an expected number of recovered cells (e.g., see
--expect-cellsfor count), N, or as of Cell Ranger 7.0, estimate this number.The order of magnitude algorithm (OrdMag) estimates the initial number of recovered cells such that barcodes are called cells if total UMI counts exceed m/10, where m is the 99th percentile of top N barcodes based on total UMI counts. The estimation of
expect-cellsthen finds a value x that approximates OrdMag(x) by minimizing a loss function whereexpect-cells= minx(OrdMag(x) - x)2 / x. Cell Ranger's grid-search goes from 2 to ~45k cells. In the example below, the optimized value forexpect-cellsis 1,137:
第二步,基于细胞的基因表达谱,对含有低UMI数目barcodes再次选择,鉴别出低RNA含量的细胞。首先,上步被选定的细胞创建基因表达谱作为背景模型,这个模型使用Good-Turing平滑算法,为空GEM组中未被观测到的基因提供了非零估计模型。随后将上一步中未被选定的barcodes中所表达基因与这个模型进行比较,如果基因表达谱与模型有较大的不同,这些barcodes也会被选定为细胞。
In the second step, a set of barcodes with low UMI counts that likely represent "empty" GEM partitions is selected. A model of the RNA profile of selected barcodes is created. This model, called the background model, is a multinomial distribution over genes. It uses Simple Good-Turing smoothing to provide a non-zero model estimate for genes that were not observed in the representative empty GEM set. Finally, the RNA profile of each barcode not called as a cell in the first step is compared to the background model. Barcodes whose RNA profile strongly disagrees with the background model are added to the set of positive cell calls. This second step identifies cells that are clearly distinguishable from the profile of empty GEMs, even though they may have much lower RNA content than the largest cells in the experiment.
Below is an example of a challenging cell calling scenario where 300 high RNA content 293T cells are mixed with 2,000 low RNA content PBMC cells. On the left is the cell calling result with the cell calling algorithm prior to Cell Ranger 3.0 and on the right is the Cell Ranger 3.0 result. You can see that low RNA content cells are successfully identified by the new algorithm.
The plot shows the count of filtered UMIs mapped to each barcode. Barcodes can be determined to be cell-associated based on their UMI count or by their RNA profiles. Therefore some regions of the graph can contain both cell-associated and background-associated barcodes. The color of the graph represents the local density of barcodes that are cell-associated.
在一些情况下,当计算出的细胞数目与预期细胞数目不符合时,可以在cellranger count命令下添加--force-cell选项重新分析,或者通过设置原始barcodes列表选定细胞进行分析。
In some cases the set of barcodes called as cells may not match the desired set of barcodes based on visual inspection. This can be remedied by either re-running
countorreanalyzewith the --force-cells option, or by selecting the desired barcodes from the raw feature-barcode matrix in downstream analysis. Custom barcode selection can also be done by specifying --barcodes toreanalyze.
参考
Running Cell Ranger count - Official 10x Genomics Support
Cell Ranger's Gene Expression Algorithm - Official 10x Genomics Support



9676

被折叠的 条评论
为什么被折叠?



