Chapter12 Bioinformatics Shell Scripting, Writing Pipelines, and Parallelizing Tasks
We’ll see how to write rerunnable Bash shell scripts, automate fileprocessing tasks with find and xargs, run pipelines in parallel, and see a simple makefile.
Basic Bash Scripting
Writing and Running Robust Bash Scripts
1. A robust Bash header
#!/bin/bash
set -e
set -u
set -o pipefail
(1) This is called the shebang, and it indicates the path to the interpreter used to execute this script.
(2) 若某一行shell命令error,shell默认仍会执行完整个命令。set -e 使得error时直接终结整个命令,但某些可能不会有影响的error,如test -d file.txt will return a nonzero exit status if its argument is not a directory,却不会终结。
(3) 若某一命令含有未定义路径,shell仍会执行,如 rm -rf $TEMP_DIR/. If the shell variable $TEMP_DIR isn’t set, Bash will still substitute its value (which is nothing) in place of it. The end result is rm -rf /。使用set -u阻止类似命令的执行。
(4) 即使设置了set -e ,但在pipe中也只有最后一个program出现error时,才会导致程序跳出。set -o pipefail使得pipe 中任一program error都会终止程序。
2. Running Bash scripts
two ways:
bash script.sh: can run any script (as long as it has read permissions) with bash script.sh
./script.sh: might use interpreters other than /bin/bash (e.g., zsh, csh, etc.)
executable permissions:
$ chmod u+x script.sh
Variables and Command Arguments
$ results_dir="results/"
$ echo $results_dir
results/
${variables}
sample="CNTRL01A"
mkdir ${sample}_aln/
Quoting variables: prevents commands from interpreting any spaces or other special characters
sample="CNTRL01A"
mkdir "${sample}_aln/"
1. Command-line arguments
#!/bin/bash
echo "script name: $0"
echo "first arg: $1"
echo "second arg: $2"
echo "third arg: $3"
$ bash args.sh arg1 arg2 arg3
script name: args.sh
first arg: arg1
second arg: arg2
third arg: arg3
Bash assigns the number of command-line arguments to the variable $# (this does not count the script name, $0, as an argument). This is useful for user-friendly messages:
#!/bin/bash
if [ "$#" -lt 3 ] # are there less than 3 arguments?
then
echo "error: too few arguments, you provided $#, 3 required"
echo "usage: script.sh arg1 arg2 arg3"
exit 1
fi
echo "script name: $0"
echo "first arg: $1"
echo "second arg: $2"
echo "third arg: $3"
Running this with too few arguments gives an error (and leads the process to exit with a nonzero exit status)
$ ./script.sh some_arg
error: too few arguments, you provided 1, 3 required
usage: script.sh arg1 arg2 arg3
Conditionals in a Bash Script: if Statements
Bash a bit unique is that a command’s exit status provides the true and false (remember: contrary to other languages, 0 represents true/success and anything else is false/failure)
if [commands]
then
[if-statements]
else
[else-statements]
fi
中文详解:https://blog.csdn.net/zhan570556752/article/details/80399154
[commands] is a placeholder for any command, set of commands, pipeline, or test condition. If the exit status of these commands is 0, execution continues to the block after then; otherwise execution continues to the block after else. The then keyword can be placed on the same line as if, but then a semicolon is required: if [commands]; then.
e.g.:
#!/bin/bash
if grep "pattern" some_file.txt > /dev/null
then
# commands to run if "pattern" is found
echo "found 'pattern' in 'some_file.txt'"
fi
#!/bin/bash
if grep "pattern" file_1.txt > /dev/null &&
grep "pattern" file_2.txt > /dev/null
then
echo "found 'pattern' in 'file_1.txt' and in 'file_2.txt'"
fi
$ test "ATG" = "ATG" ; echo "$?"
0 $
test "ATG" = "atg" ; echo "$?"
1 $
test 3 -lt 1 ; echo "$?"
1 $
test 3 -le 3 ; echo "$?"
0
$ test -d some_directory ; echo $? # is this a directory?
0 $
test -f some_file.txt ; echo $? # is this a file?
0 $
test -r some_file.txt ; echo $? $ is this file readable?
0 $
test -w some_file.txt ; echo $? $ is this file writable?
1
Combining test with if
if test -f some_file.txt
then
[...]
fi
Bash provides a simpler syntactic alternative to test statements: [ -f some_file.txt ] . Note the spaces around and within the brackets—these are required. This makes for much simpler if statements involving comparisons:
if [ -f some_file.txt ]
then
[...]
fi
When using this syntax, we can chain test expressions with -a as logical AND, -o as logical OR, ! as negation, and parentheses to group statements. Our familiar && and || operators won’t work in test, because these are shell operators.
e.g.Ensure our script has enough arguments and that the input file is readable:
#!/bin/bash
set -e
set -u
set -o pipefail
if [ "$#" -ne 1 -o ! -r "$1" ]
then
echo "usage: script.sh file_in.txt"
exit 1
fi
Processing Files with Bash Using for Loops and Globbing
There are three essential parts to creating a pipeline to process a set of files:
• Selecting which files to apply the commands to
• Looping over the data and applying the commands
• Keeping track of the names of any output files created
$ cat samples.txt
zmaysA R1 seq/zmaysA_R1.fastq
zmaysA R2 seq/zmaysA_R2.fastq
zmaysB R1 seq/zmaysB_R1.fastq
zmaysB R2 seq/zmaysB_R2.fastq
zmaysC R1 seq/zmaysC_R1.fastq
zmaysC R2 seq/zmaysC_R2.fastq
e.g.
Suppose that we want to loop over every file, gather quality statistics on each and every file (using the imaginary program fastq_stat), and save this information to an output file. Each output file should have a name based on the input file, so if a summary file indicates something is wrong we know which file was affected. There’s a lot of little parts to this, so we’re going to step through how to do this a piece at a time learning about Bash arrays, basename, and a few other shell tricks along the way.
(1) Load our filenames into a Bash array, which we can then loop over
$ sample_names=(zmaysA zmaysB zmaysC)
$ echo ${sample_names[0]}
zmaysA
$ echo ${sample_names[2]}
zmaysC
All elements are extracted with the cryptic-looking ${sample_files[@]}:
$ echo ${sample_names[@]}
zmaysA zmaysB zmaysC
Access how many elements are in the array (and each element’s index):
$ echo ${#sample_names[@]}
3
$ echo ${!sample_names[@]}
0 1 2
从samples.txt中提取第三列构建bash array
Extract the third column using cut -f 3 from samples.txt.
$ sample_files=($(cut -f 3 samples.txt))
$ echo ${sample_files[@]}
seq/zmaysA_R1.fastq seq/zmaysA_R2.fastq seq/zmaysB_R1.fastq
seq/zmaysB_R2.fastq seq/zmaysC_R1.fastq seq/zmaysC_R2.fastq
Leaving us with the most basic filename
$ basename seqs/zmaysA_R1.fastq
zmaysA_R1.fastq
Basename can also strip a suffix (e.g., extension) provided as the second argument from a filename (or alternatively using the argument -s):
$ basename seqs/zmaysA_R1.fastq .fastq
zmaysA_R1
$ basename -s .fastq seqs/zmaysA_R1.fastq
zmaysA_R1
#!/bin/bash
set -e
set -u
set -o pipefail
# specify the input samples file, where the third
# column is the path to each sample FASTQ file
sample_info=samples.txt
# create a Bash array from the third column of $sample_info
sample_files=($(cut -f 3 "$sample_info"))
for fastq_file in ${sample_files[@]}
do
# strip .fastq from each FASTQ file, and add suffix
# "-stats.txt" to create an output filename for each FASTQ file
results_file="$(basename $fastq_file .fastq)-stats.txt"
# run fastq_stat on a file, writing results to the filename we've
# above
fastq_stat $fastq_file > stats/$results_file
done
e.g.
Our loop must iterate over unique sample names, and we use these sample names to re-create the input FASTQ files used in alignment. This will be clearer in an example; suppose that we use the aligner BWA and our genome reference is named zmays_AGPv3.20.fa:
#!/bin/bash
set -e
set -u
set -o pipefail
# specify the input samples file, where the third
# column is the path to each sample FASTQ file
sample_info=samples.txt
# our reference
reference=zmays_AGPv3.20.fa
# create a Bash array from the first column, which are
# sample names. Because there are duplicate sample names
# (one for each read pair), we call uniq
sample_names=($(cut -f 1 "$sample_info" | uniq))
for sample in ${sample_names[@]}
do
# create an output file from the sample name
results_file="${sample}.sam"
bwa mem $reference ${sample}_R1.fastq ${sample}_R2.fastq \
> $results_file
done
Automating File-Processing with fnd and xargs
Using fnd and xargs
Finding Files with fnd
Unlike ls, find is recursive (it will search for matching files in subdirectories, and subdirectories of subdirectories, etc.)
A quick way to see a project directory’s structure:
$ find zmays-snps
zmays-snps
zmays-snps/analysis
zmays-snps/data
zmays-snps/data/seqs
zmays-snps/data/seqs/zmaysA_R1.fastq
zmays-snps/data/seqs/zmaysA_R2.fastq
zmays-snps/data/seqs/zmaysB_R1.fastq
zmays-snps/data/seqs/zmaysB_R2.fastq
zmays-snps/data/seqs/zmaysC_R1.fastq
zmays-snps/data/seqs/zmaysC_R2.fastq
zmays-snps/scripts
-maxdepth: 限制find的目录深度——当前目录:1; 第一子目录: 2
-name: 寻找文件名为**的文件
$ find data/seqs -name "zmaysB*fastq"
data/seqs/zmaysB_R1.fastq
data/seqs/zmaysB_R2.fastq
find’s Expressions
-type: 指定类型(e.g., files, directories, named pipes, links, etc.)
$ find data/seqs -name "zmaysB*fastq" -type f
data/seqs/zmaysB_R1.fastq
data/seqs/zmaysB_R2.fastq
logical AND/OR:
$ find data/seqs -name "zmaysB*fastq" -and -type f
data/seqs/zmaysB_R1.fastq
data/seqs/zmaysB_R2.fastq
$ find data/seqs -name "zmaysA*fastq" -or -name "zmaysC*fastq" -type f
data/seqs/zmaysA_R1.fastq
data/seqs/zmaysA_R2.fastq
data/seqs/zmaysC_R1.fastq
data/seqs/zmaysC_R2.fastq
$ find seqs -type f "!" -name "zmaysC*fastq"
seqs/zmaysA_R1.fastq
seqs/zmaysA_R2.fastq
seqs/zmaysB_R1.fastq
seqs/zmaysB_R2.fastq
$ touch seqs/zmaysB_R1-temp.fastq
$ find seqs -type f "!" -name "zmaysC*fastq"
seqs/zmaysB_R1-temp.fastq
seqs/zmaysA_R1.fastq
seqs/zmaysA_R2.fastq
seqs/zmaysB_R1.fastq
seqs/zmaysB_R2.fastq
不包含seqs/zmaysB_R1-temp.fastq:
$ find seqs -type f "!" -name "zmaysC*fastq" -and "!" -name "*-temp*"
seqs/zmaysA_R1.fastq
seqs/zmaysA_R2.fastq
seqs/zmaysB_R1.fastq
seqs/zmaysB_R2.fastq
Note that find’s operators like !, (, and ) should be quoted
find’s -exec: Running Commands on find’s Results
$ find . -name "*-temp.fastq" -exec rm -i {} \;
remove ./zmaysA_R1-temp.fastq? y
remove ./zmaysA_R2-temp.fastq? y
remove ./zmaysB_R1-temp.fastq? y
remove ./zmaysC_R1-temp.fastq? y
remove ./zmaysC_R2-temp.fastq? y
-exec: 识别语句末尾的;再运行
rm -i: 删除之前询问
xargs: A Unix Powertool
xargs allows us to take input passed to it from standard in, and use this input as arguments to another program, which allows us to build commands programmatically from values received through standard in.
xargs works by taking input from standard in and splitting it by spaces, tabs, and newlines into arguments. Then, these arguments are passed to the command supplied. For example, to emulate the behavior of find -exec with rm, we use xargs with rm:
$ find . -name "*-temp.fastq"
./zmaysA_R1-temp.fastq
./zmaysA_R2-temp.fastq
./zmaysC_R1-temp.fastq
./zmaysC_R2-temp.fastq
$ find . -name "*-temp.fastq" | xargs rm
An example of how to run find and xargs using the null byte delimiter: 用于文件名中存在空格
$ find . -name "samples [AB].txt" -print0 | xargs -0 rm
-n:设置一次传递参数的个数:
$ find . -name "*-temp.fastq" | xargs -n 1 rm
$ find . -name "*-temp.fastq" > files-to-delete.txt
$ cat files-to-delete.txt
./zmaysA_R1-temp.fastq
./zmaysA_R2-temp.fastq
./zmaysC_R1-temp.fastq
./zmaysC_R2-temp.fastq
$ cat files-to-delete.txt | xargs rm
remember, xargs’s behavior is very simple: it just places arguments after the command you provide
$ find . -name "*-temp.fastq" | xargs -n 1 echo "rm -i" > delete-temp.sh
$ cat delete-temp.sh
rm -i ./zmaysA_R1-temp.fastq
rm -i ./zmaysA_R2-temp.fastq
rm -i ./zmaysC_R1-temp.fastq
rm -i ./zmaysC_R2-temp.fastq
run this simple script with:
$ bash delete-temp.sh
remove ./zmaysA_R1-temp.fastq? y
remove ./zmaysA_R2-temp.fastq? y
remove ./zmaysC_R1-temp.fastq? y
remove ./zmaysC_R2-temp.fastq? y
Using xargs with Replacement Strings to Apply Commands to Files
xargs -I 和占位符**{}**
e.g. we want our output filenames to be paired with our input filenames and have corresponding
names.
$ find . -name "*.fastq" | xargs basename -s ".fastq"
zmaysA_R1
zmaysA_R2
zmaysB_R1
zmaysB_R2
zmaysC_R1
zmaysC_R2
$ find . -name "*.fastq" | xargs basename -s ".fastq" | \
xargs -I{} fastq_stat --in {}.fastq --out ../summaries/{}.txt
xargs and Parallelization
xargs is that it can launch a limited number of processes in parallel
-P < num >
$ find . -name "*.fastq" | xargs basename -s ".fastq" | \
xargs -P 6 -I{} fastq_stat --in {}.fastq --out ../summaries/{}.txt
xargs, Pipes, and Redirects
不能在xargs后使用管道符和重定向:shell process that reads your xargs command will interpret pipes and redirects as what to do with xarg’s output, not as part of the command run by xargs
解决方法:create a small Bash script containing the commands to process a single sample, and have xargs run this script in many parallel Bash processes.
e.g.
#!/bin/bash
set -e
set -u
set -o pipefail
sample_name=$(basename -s ".fastq" "$1")
some_program ${sample_name}.fastq | another_program >
${sample_name}-results.txt
Then, run this with:
$ find . -name "*.fastq" | xargs -n 1 -P 4 bash script.sh
Make and Makefles: Another Option for Pipelines
Another very powerful tool used to create bioinformatics pipelines