Starting point
抗体Vregion数据双端测序,拼接后有1E7条序列(FLASH拼接,10,760,889),长度在250-490之间。有正向反向序列mix在一起。
目的是做一些基本的分析工作,希望拿到的分析结果包括:
Total number of sequences
#原始seq, clean去头尾翻译之后的可用序列。
Unique sequences number
#seqkit rmdup dupfile
Single occurrence sequences number
#total unique - replicated unique
Repeated sequences
#dup
Highest frequency
Identified VH family
#annotation FR2 align to VH (V…GL…)
Functional sequences number (无frame shift, 无stop codon等的完整正确VHH序列)
#cleaned unique remove stop codon
VHH sequence number
#total unique - VH
CDRH3 Cysteine数目
#annotation CDR3 cys stats
CDRH3 length distribution
#annotation CDR3 length
CDRH3 amino acids composition
#annotation CDR3 stats by aa
$ seqkit stats outM.fa
file format type num_seqs sum_len min_len avg_len max_len
outM.fa FASTA DNA 10,760,889 4,571,253,828 250 424.8 490
怎样分组比对比较合理又省时呢。先用seqkit的sort by sequence 排序。这次的前51.4%是反向的。分了一百组先
$ seqkit sort -s outM.fa -o sortAll.fa
$ seqkit split -p 100 sortAll.fa
挑出第10组做个trim试一下,用cutadapt
5’端是ccatgg, 3’端是gcggccgc. 第10组是反向的,但是不懂为什么 -a -g不能在同一条命令里面去除。
$ cutadapt -g gcggccgc sortAll.part_010.fa -o 10tr.fa
Total reads processed: 107,609
Reads with adapters: 107,609 (100.0%)
Reads written (passing filters): 107,609 (100.0%)
Sequence: GCGGCCGC; Type: regular 5'; Length: 8; Trimmed: 107609 times.
处理另一边adapter时候,顺便把<288(96aa)的序列给删掉。
$ cutadapt -a ccatgg -m 288 10tr.fa -o 10trag.fa
=== Summary ===
Total reads processed: 107,609
Reads with adapters: 105,822 (98.3%)
Reads written (passing filters): 107,609 (100.0%)
Total basepairs processed