workLog: 用seqkit快速进行简单的抗体NGS测序数据分析

最新推荐文章于 2024-03-14 18:03:30 发布

lamovrevx

最新推荐文章于 2024-03-14 18:03:30 发布

阅读量1.1k

点赞数 1

分类专栏： worklog

本文链接：https://blog.csdn.net/lamovrevx/article/details/106096091

版权

本文介绍如何利用seqkit工具进行抗体Vregion数据的初步分析，包括序列统计、去除接头、翻译、去重、重复序列分析、CDR3特征统计等步骤，以快速获取基本的抗体测序分析结果。

摘要由CSDN通过智能技术生成

Starting point

抗体Vregion数据双端测序，拼接后有1E7条序列（FLASH拼接，10,760,889），长度在250-490之间。有正向反向序列mix在一起。
目的是做一些基本的分析工作，希望拿到的分析结果包括：

Total number of sequences
#原始seq, clean去头尾翻译之后的可用序列。

Unique sequences number
#seqkit rmdup dupfile

Single occurrence sequences number
#total unique - replicated unique
Repeated sequences
#dup
Highest frequency

Identified VH family
#annotation FR2 align to VH (V…GL…)

Functional sequences number (无frame shift, 无stop codon等的完整正确VHH序列）
#cleaned unique remove stop codon

VHH sequence number
#total unique - VH

CDRH3 Cysteine数目
#annotation CDR3 cys stats
CDRH3 length distribution
#annotation CDR3 length
CDRH3 amino acids composition
#annotation CDR3 stats by aa

$ seqkit stats outM.fa
file     format  type    num_seqs        sum_len  min_len  avg_len  max_len
outM.fa  FASTA   DNA   10,760,889  4,571,253,828      250    424.8      490

怎样分组比对比较合理又省时呢。先用seqkit的sort by sequence 排序。这次的前51.4%是反向的。分了一百组先

$ seqkit sort -s outM.fa -o sortAll.fa
$ seqkit split -p 100 sortAll.fa

挑出第10组做个trim试一下，用cutadapt
5’端是ccatgg, 3’端是gcggccgc. 第10组是反向的，但是不懂为什么 -a -g不能在同一条命令里面去除。

$ cutadapt -g gcggccgc sortAll.part_010.fa -o 10tr.fa
Total reads processed:                 107,609
Reads with adapters:                   107,609 (100.0%)
Reads written (passing filters):       107,609 (100.0%)
Sequence: GCGGCCGC; Type: regular 5'; Length: 8; Trimmed: 107609 times.

处理另一边adapter时候，顺便把<288(96aa)的序列给删掉。

$ cutadapt -a ccatgg  -m 288  10tr.fa -o 10trag.fa
=== Summary ===
Total reads processed:                 107,609
Reads with adapters:                   105,822 (98.3%)
Reads written (passing filters):       107,609 (100.0%)

最低0.47元/天解锁文章

lamovrevx

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
workLog: 用seqkit快速进行简单的抗体NGS测序数据分析

Starting point抗体Vregion数据双端测序，拼接后有1E7条序列（FLASH拼接，10,760,889），长度在250-490之间。有正向反向序列mix在一起。目的是做一些基本的分析工作，希望拿到的分析结果包括：Total number of sequences#原始seq, clean去头尾翻译之后的可用序列。Unique sequences number#seqkit rmdup dupfileSingle occurrence sequences number#tota
复制链接

扫一扫