cd-hit：进行序列去冗余

CAAS_IFR_zp

已于 2024-01-04 21:09:03 修改

阅读量1.7k

点赞数

文章标签： python conda ubuntu

于 2023-04-18 15:24:25 首次发布

本文链接：https://blog.csdn.net/m0_53945548/article/details/130214667

版权

conda install -c bioconda cd-hit
#OR
wget https://github.com/weizhongli/cdhit/releases/download/V4.8.1/cd-hit-v4.8.1-2019-0228.tar.gz
tar -xvzf cd-hit-v4.8.1-2019-0228.tar.gz
cd cd-hit-v4.8.1-2019-0228/
make

在base里面用conda安装，方便

cd-hit的各种包

* cd-hit 		Cluster peptide sequences
* cd-hit-est 		Cluster nucleotide sequences
* cd-hit-2d 		Compare 2 peptide databases
* cd-hit-est-2d 	Compare 2 nucleotide databases
* psi-cd-hit 		Cluster proteins at <40% cutoff
* cd-hit-lap 		Identify overlapping reads
* cd-hit-dup 		Identify duplicates from single or paired Illumina reads
* cd-hit-454 		Identify duplicates from 454 reads
* cd-hit-otu 		Cluster rRNA tags
* cd-hit-para 		Cluster sequences in parallel on a computer cluster
* h-cd-hit 		Hierarchical clustering

各种参数

  -i	input filename in fasta format, required
   # 输入文件，必需。为fasta格式。基本格式为一行以'>'开头的注释行和单独的一列序列行反复重复。
   
   -o	output filename, required
   # 输出文件名，必需。
   
   -c	sequence identity threshold, default 0.9
 	# 相似性阈值，默认为0.9，c必须和字长一起调整，字长最长为5，最短为2，c最小为0.4，需要越少的类就
 	选择越小的c值

   -M	memory limit (in MB) for the program, default 800; 0 for unlimitted;
   # 程序占用的内存
   
   -T	number of threads, default 1; with 0, all CPUs will be used
   # 线程
   
   -n	word_length, default 5, see user's guide for choosing it
   # 字长，字长代表在比对时两个序列比对上的相邻的几个碱基或者氨基酸，字长越长那么相似性越高。默认为5
   最少为2，此时相似性阈值最低，可以设置为0.4
   
   -l	length of throw_away_sequences, default 10
   # 需要丢弃掉不管的序列,blast有时候会生成很短的序列，这个选项很适合用。
 	
   -s	length difference cutoff, default 0.0
 	if set to 0.9, the shorter sequences need to be
 	at least 90% length of the representative of the cluster
 	# 即短序列若要和代表序列匹配需要达到的长度，默认为0
 	
   -S	length difference cutoff in amino acid, default 999999
 	if set to 60, the length difference between the shorter sequences
 	and the representative of the cluster can not be bigger than 60
 	# 同s，但是是以数字表示不能超过的个数

   -g	1 or 0, default 0
 	by cd-hit's default algorithm, a sequence is clustered to the first 
 	cluster that meet the threshold (fast cluster). If set to 1, the program
 	will cluster it into the most similar cluster that meet the threshold
 	(accurate but slow mode)
 	but either 1 or 0 won't change the representatives of final clusters
 	# 此选项即是否选择是否将序列划入匹配到的第一个代表序列，若设置为1，那么还需要和其它代表序列进行
 	比较获取更为接近的类，因此分类效果会更好。但相应的会延长时间。默认为0哦

基本用法

cd-hit-est  -n 9 -g 1 -c 0.95 -G 0 -M 0 -d 0 -aS 0.9 -T 0