deInput / output
该软件包提供多种功能:
-
repLoad
- 加载具有兼容格式的曲目。 -
repSave
- 保存更改并将曲目数据以特定格式(VDJtools)写入文件。immunarch
repLoad
自动检测输入文件格式。 目前支持以下免疫库数据格式:immunarch
-
"immunarch"
- 当前的软件工具,以防忘记:) -
"immunoseq"
- https://www.immunoseq.com -
"mitcr"
- GitHub - milaboratory/mitcr: MiTCR is a software for processing T-cell repertoire sequencing data -
"migec"
- MIGEC: Molecular Identifier Guided Error Correction pipeline — MIGEC SNAPSHOT documentation -
"migmap"
- GitHub - mikessh/migmap: HTS-compatible wrapper for IgBlast V-(D)-J mapping tool -
"tcr"
- tcR is no longer supported! Please use immunarch instead, redirecting to https://immunarch.com/ -
"vdjtools"
- VDJtools: a framework for post-analysis of repertoire sequencing data — vdjtools SNAPSHOT documentation -
"imgt"
- IMGT/HighV-QUEST -
"airr"
- AIRR Data Representations — AIRR Standards 1.4 documentation -
"10x"
- V(D)J Annotations -Software -Single Cell Immune Profiling -Official 10x Genomics Support -
"archer"
- ArcherDX clonotype tables. https://archerdx.com/immunology/ -
"imseq"
- IMSEQ - IMmunogenetic SEQuence Analysis -
"rtcr"
- https://github.com/uubram/RTCR -
"vidjil"
- Vidjil
为了解析 IgBLAST 结果,首先使用 MigMap 处理数据。
您可以从单个文件、曲目文件路径列表或包含曲目文件的文件夹中加载数据。
本地文件
如果有本地文件,则只需指定文件或包含文件的文件夹的路径。然后使用以下命令加载数据:repLoad
#path argument is a path to the folder with your file or files including the metadata file.
immdata <- repLoad(path)
使用示例文件
可以在此处找到一个包含示例文件的文件夹(下载并解压缩test_data.zip或test_data.tar.gz),并使用它来测试数据加载。
如果不熟悉文件路径,可以将模拟数据下载到工作目录中。可以使用命令获取工作目录getwd()
还可以将所有文件下载到工作目录中的文件夹中,并通过将文件夹名称传递给引号中的 repLoad 函数来加载所有文件:'example'
immdata <- repLoad('example')
示例数据已随包一起下载。可以使用以下命令加载所有示例文件:immunarch
#path to the folder with example data
file_path <- paste0(system.file(package="immunarch"), "/extdata/io/")
immdata <- repLoad(file_path)
在其他情况下,可能需要提供元数据文件并在文件夹中找到它。将其命名为“metadata.txt”。
# For instance you have a following structure in your folder:
# >_ ls
# immunoseq1.txt
# immunoseq2.txt
# immunoseq3.txt
# metadata.txt
# To load the whole folder with every file in it type:
file_path <- paste0(system.file(package="immunarch"), "/extdata/io/")
immdata <- repLoad(file_path)
print(names(immdata))
# In order to do that your folder must contain metadata file named
# "metadata.txt".
# In R, when you load your data:
# > immdata <- repLoad("path/to/your/folder/")
# > names(immdata)
# [1] "data" "meta"
# Suppose you do not have "metadata.txt":
# > immdata <- repLoad("path/to/your/folder/")
# > names(immdata)
# [1] "data" "meta"
使用 dplyr 和 immunarch 进行基本数据操作
可以在此处找到介绍:https://CRAN.R-project.org/package=dplyr/vignettes/dplyr.html
获得最丰富的克隆型
该函数返回给定指令表最丰富的克隆型:
top(immdata$data[[1]])
## # A tibble: 10 × 15
## Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start D.end
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <int> <int>
## 1 173 0.0204 TGCGCCAGC… CASSQE… TRBV4… TRBD1 TRBJ2… 16 18 26
## 2 163 0.0192 TGCGCCAGC… CASSYR… TRBV4… TRBD1 TRBJ2… 11 13 18
## 3 66 0.00776 TGTGCCACC… CATSTN… TRBV15 TRBD1 TRBJ2… 11 16 22
## 4 54 0.00635 TGTGCCACC… CATSIG… TRBV15 TRBD2 TRBJ2… 11 19 25
## 5 48 0.00565 TGTGCCAGC… CASSPW… TRBV27 TRBD1 TRBJ1… 11 16 23
## 6 48 0.00565 TGCGCCAGC… CASQGD… TRBV4… TRBD1 TRBJ1… 8 13 19
## 7 40 0.00471 TGCGCCAGC… CASSQD… TRBV4… TRBD1 TRBJ2… 16 21 26
## 8 31 0.00365 TGTGCCAGC… CASSEE… TRBV2 TRBD1 TRBJ1… 15 17 20
## 9 30 0.00353 TGCGCCAGC… CASSQP… TRBV4… TRBD1 TRBJ2… 14 23 28
## 10 28 0.00329 TGTGCCAGC… CASSWV… TRBV6… TRBD1 TRBJ2… 12 20 25
## # ℹ 5 more variables: J.start <int>, VJ.ins <dbl>, VD.ins <dbl>, DJ.ins <dbl>,
## # Sequence <lgl>
过滤functional/non-functional/in-frame/out-of-frame克隆型
方便的是,函数在数据帧列表上矢量化;在下面的示例中,返回带有编码序列的数据帧列表:
coding(immdata$data[[1]])
其他的以类似的方式运行:
noncoding(immdata$data[[1]])
nrow(inframes(immdata$data[[1]]))
nrow(outofframes(immdata$data[[1]]))
获取具有特定 V 基因的克隆型子集
根据指定索引中的标签对数据框进行子集操作很简单。在此示例中,生成的数据框仅包含具有“TRBV10-1”V 基因的记录:
filter(immdata$data[[1]], V.name == 'TRBV10-1')
## # A tibble: 24 × 15
## Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start D.end
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <int> <int>
## 1 2 0.000235 TGCGCCAGC… CASSES… TRBV1… TRBD2 TRBJ2… 16 20 25
## 2 2 0.000235 TGCGCCAGC… CASSDG… TRBV1… TRBD1 TRBJ2… 13 15 22
## 3 1 0.000118 TGCGCCAGC… CASSGD… TRBV1… TRBD2 TRBJ2… 8 10 15
## 4 1 0.000118 TGCGCCACC… CATLRS… TRBV1… TRBD1 TRBJ2… 6 7 9
## 5 1 0.000118 TGCGCCAGC… CASSES… TRBV1… TRBD2 TRBJ2… 16 20 22
## 6 1 0.000118 TGCGCCAGC… CASSES… TRBV1… TRBD2 TRBJ2… 16 17 21
## 7 1 0.000118 TGCGCCAGC… CASRAS… TRBV1… TRBD2 TRBJ2… 10 13 21
## 8 1 0.000118 TGCGCCAGC… CASRRD… TRBV1… TRBD1 TRBJ2… 8 13 19
## 9 1 0.000118 TGCGCCAGC… CASSEV… TRBV1… TRBD1 TRBJ2… 14 19 24
## 10 1 0.000118 TGCGCCAGC… CASSEG… TRBV1… TRBD2 TRBJ2… 13 19 27
## # ℹ 14 more rows
## # ℹ 5 more variables: J.start <int>, VJ.ins <dbl>, VD.ins <dbl>, DJ.ins <dbl>,
## # Sequence <lgl>
Downsampling
# 使用repSample函数进行downsampling
ds <- repSample(immdata$data, "downsample", 100)
sapply(ds, nrow)
## A2-i129 A2-i131 A2-i133 A2-i132 A4-i191 A4-i192 MS1 MS2 MS3 MS4
## 99 95 95 98 89 95 82 100 94 98
## MS5 MS6
## 82 100
ds <- repSample(immdata$data, "sample", .n = 10)
sapply(ds, nrow)
## A2-i129 A2-i131 A2-i133 A2-i132 A4-i191 A4-i192 MS1 MS2 MS3 MS4
## 10 10 10 10 10 10 10 10 10 10
## MS5 MS6
## 10 10
immunarch 数据格式
immunarch
自带数据格式,包括制表符分隔的列,可以按如下方式指定:
-
“Clones” - count or number of barcodes (events, UMIs) or reads;
-
“Proportion” - proportion of barcodes (events, UMIs) or reads;
-
“CDR3.nt” - CDR3 nucleotide sequence;
-
“CDR3.aa” - CDR3 amino acid sequence;
-
“V.name” - names of aligned Variable gene segments;
-
“D.name” - names of aligned Diversity gene segments or NA;
-
“J.name” - names of aligned Joining gene segments;
-
“V.end” - last positions of aligned V gene segments (1-based);
-
“D.start” - positions of D’5 end of aligned D gene segments (1-based);
-
“D.end” - positions of D’3 end of aligned D gene segments (1-based);
-
“J.start” - first positions of aligned J gene segments (1-based);
-
“VJ.ins” - number of inserted nucleotides (N-nucleotides) at V-J junction (-1 for receptors with VDJ recombination);
-
“VD.ins” - number of inserted nucleotides (N-nucleotides) at V-D junction (-1 for receptors with VJ recombination);
-
“DJ.ins” - number of inserted nucleotides (N-nucleotides) at D-J junction (-1 for receptors with VJ recombination);
-
“Sequence” - full nucleotide sequence.
参考来源:Bioinformatics Analysis of T-Cell and B-Cell Immune Repertoires • immunarch