cd-hit:进行序列去冗余

conda install -c bioconda cd-hit
#OR
wget https://github.com/weizhongli/cdhit/releases/download/V4.8.1/cd-hit-v4.8.1-2019-0228.tar.gz
tar -xvzf cd-hit-v4.8.1-2019-0228.tar.gz
cd cd-hit-v4.8.1-2019-0228/
make

在base里面用conda安装,方便

cd-hit的各种包

* cd-hit 		Cluster peptide sequences
* cd-hit-est 		Cluster nucleotide sequences
* cd-hit-2d 		Compare 2 peptide databases
* cd-hit-est-2d 	Compare 2 nucleotide databases
* psi-cd-hit 		Cluster proteins at <40% cutoff
* cd-hit-lap 		Identify overlapping reads
* cd-hit-dup 		Identify duplicates from single or paired Illumina reads
* cd-hit-454 		Identify duplicates from 454 reads
* cd-hit-otu 		Cluster rRNA tags
* cd-hit-para 		Cluster sequences in parallel on a computer cluster
* h-cd-hit 		Hierarchical clustering

各种参数

  -i	input filename in fasta format, required
   # 输入文件,必需。为fasta格式。基本格式为一行以'>'开头的注释行和单独的一列序列行反复重复。
   
   -o	output filename, required
   # 输出文件名,必需。
   
   -c	sequence identity threshold, default 0.9
 	# 相似性阈值,默认为0.9,c必须和字长一起调整,字长最长为5,最短为2,c最小为0.4,需要越少的类就
 	选择越小的c值

   -M	memory limit (in MB) for the program, default 800; 0 for unlimitted;
   # 程序占用的内存
   
   -T	number of threads, default 1; with 0, all CPUs will be used
   # 线程
   
   -n	word_length, default 5, see user's guide for choosing it
   # 字长,字长代表在比对时两个序列比对上的相邻的几个碱基或者氨基酸,字长越长那么相似性越高。默认为5
   最少为2,此时相似性阈值最低,可以设置为0.4
   
   -l	length of throw_away_sequences, default 10
   # 需要丢弃掉不管的序列,blast有时候会生成很短的序列,这个选项很适合用。
 	
   -s	length difference cutoff, default 0.0
 	if set to 0.9, the shorter sequences need to be
 	at least 90% length of the representative of the cluster
 	# 即短序列若要和代表序列匹配需要达到的长度,默认为0
 	
   -S	length difference cutoff in amino acid, default 999999
 	if set to 60, the length difference between the shorter sequences
 	and the representative of the cluster can not be bigger than 60
 	# 同s,但是是以数字表示不能超过的个数

   -g	1 or 0, default 0
 	by cd-hit's default algorithm, a sequence is clustered to the first 
 	cluster that meet the threshold (fast cluster). If set to 1, the program
 	will cluster it into the most similar cluster that meet the threshold
 	(accurate but slow mode)
 	but either 1 or 0 won't change the representatives of final clusters
 	# 此选项即是否选择是否将序列划入匹配到的第一个代表序列,若设置为1,那么还需要和其它代表序列进行
 	比较获取更为接近的类,因此分类效果会更好。但相应的会延长时间。默认为0哦

基本用法

cd-hit-est  -n 9 -g 1 -c 0.95 -G 0 -M 0 -d 0 -aS 0.9 -T 0
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值