要求:
大约10000个细菌基因组序列,每个fasta文件1个基因组,可能有多个contig或者单个contig。要求在每个基因组上随机截取2000bp的片段,每个基因组截取10段。
代码
- 联合使用grep,awk,和awk的随机化种子rand
#!/bin/bash
# this code used for resample seg from a genome contigs files
# $1:genome id containing many contigs fasta files
# $2:length of fragement to get
line=`grep -c ">" $1.fna`
if [ $line -ge 10 ]
then
echo $1_$line >> line10
grep ">" $1.fna|shuf -n 10|sed 's/>//'|awk '{system("grep -A 1 "$1" ""'"$1"'"".fna")}'|awk -v len=$2 'BEGIN{srand();}{if(NR%2==1){print $1"_""'"$1"'"} else if (NR%2==0) {start=int((length($0)-len)*rand());print substr($0,start,len);}}' >$1_2k.fa;
elif [ $line -lt 10 ]&&[ $line -gt 0 ]
then
echo $1_$line >> lin