转录组Unigene 筛选ORF，翻译成蛋白质

生信bug日记

已于 2024-01-28 15:36:13 修改

阅读量546

点赞数 3

文章标签： linux

于 2024-01-27 21:20:11 首次发布

本文链接：https://blog.csdn.net/weixin_46070445/article/details/135887971

版权

使用软件TransDecoder

conda安装TransDecoder，也可在github上直接获取Releases · TransDecoder/TransDecoder (github.com)使用

conda activate python3
conda install -y -c bioconda/label/cf201901 transdecoder
#下载速度比较慢，下好后，如果直接运行TransDecoder.Predict，会显示command not found。

#在whereis transdecoder时，会提醒没有安装URI/Escape模块，使用Perl安装一下
###启动 Perl CPAN

perl -MCPAN -e shell

###安装 Perl URI/Escape 模块:

install URI::Escape

### 退出：
q

使用 TransDecoder

# Step 1: 提取最长的开放阅读框

        TransDecoder.LongOrfs -t Demo.Unigene.fa

        #默认情况下，TransDecoder.LongOrfs将识别长度至少为100个氨基酸的开放阅读框。你可以通过-m参数来降低这个值，但是要知道随着最小长度的变短，ORF预测的假阳性率迅速增长。

    # Step 2: (可选)

        #可选地，可以通过blast或者pfam搜索已知蛋白的同源序列来识别ORF。

    # Step 3: 预测可能的编码区

        TransDecoder.Predict -t Demo.Unigene.fa

       #候选编码区的最终集合可以在文件.transdecoder中找到。扩展名包括.pep，.cds，.gff3和.bed。

通过blast或者pfam搜索已知蛋白的同源序列来识别ORF

#安装hmmer
conda install -y hmmer
hmmbuild -h

#获取比较pfam库
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam33.1/Pfam-A.hmm.gz 

#解压缩 
gunzip Pfam-A.hmm.gz

#得到 PFAM 数据库的 HMM 文件。HMM 文件是文本文件，需要将其变成二进制格式，以加快运算速度，同时进行压缩

# 建立索引数据库
hmmpress Pfam-A.hmm

#数据进行索引比对,具体命令可参考hmmscan -h
hmmscan --domtblout pfam.domtblout Pfam-A.hmm Demo.Unigene.fa.transdecoder.pep


#安装blast
conda install -c bioconda blast
#下载swissprot索引库
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

gunzip uniprot_sprot.fasta.gz
#建立索引库
makeblastdb -in uniprot_sprot.fasta -dbtype prot -out uniprot_db -parse_seqids

#数据进行索引比对
blastp -query Demo.Unigene.fa.transdecoder.pep -db uniprot_db -max_target_seqs 1 -outfmt
6 -evalue 1e-5 -num_threads 10 > blastp.outfmt6


#整合比对信息保留匹配的序列
TransDecoder.Predict -t Demo.Unigene.fa --retain_pfam_hits pfam.domtblout --retain_blastp_hits blastp.outfmt6

生信bug日记

关注

3
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫