Bioinformatics Data Skills by Oreilly学习笔记-12

最新推荐文章于 2024-05-20 09:36:30 发布

weixin_42953727

最新推荐文章于 2024-05-20 09:36:30 发布

阅读量364

点赞数 1

分类专栏： bioinformatics 文章标签： Bioinformatics

本文链接：https://blog.csdn.net/weixin_42953727/article/details/100802214

版权

bioinformatics 专栏收录该内容

27 篇文章 5 订阅

订阅专栏

本文详细介绍了如何编写可重复运行的Bash shell脚本，利用find和xargs自动化文件处理任务，以及并行化任务。讨论了Bash脚本的头部设置，变量和命令参数，条件语句，以及for循环处理文件。还提到了find和xargs的使用，包括find的表达式，使用xargs进行文件操作和并行化处理，以及makefile在构建pipeline中的作用。

摘要由CSDN通过智能技术生成

Chapter12 Bioinformatics Shell Scripting, Writing Pipelines, and Parallelizing Tasks

We’ll see how to write rerunnable Bash shell scripts, automate fileprocessing tasks with find and xargs, run pipelines in parallel, and see a simple makefile.

Basic Bash Scripting

Writing and Running Robust Bash Scripts

1. A robust Bash header

#!/bin/bash
set -e
set -u
set -o pipefail

(1) This is called the shebang, and it indicates the path to the interpreter used to execute this script.
(2) 若某一行shell命令error，shell默认仍会执行完整个命令。set -e 使得error时直接终结整个命令，但某些可能不会有影响的error，如test -d file.txt will return a nonzero exit status if its argument is not a directory，却不会终结。
(3) 若某一命令含有未定义路径，shell仍会执行，如 rm -rf $TEMP_DIR/. If the shell variable $TEMP_DIR isn’t set, Bash will still substitute its value (which is nothing) in place of it. The end result is rm -rf /。使用set -u阻止类似命令的执行。
(4) 即使设置了set -e ，但在pipe中也只有最后一个program出现error时，才会导致程序跳出。set -o pipefail使得pipe 中任一program error都会终止程序。

2. Running Bash scripts
two ways:
bash script.sh: can run any script (as long as it has read permissions) with bash script.sh
./script.sh: might use interpreters other than /bin/bash (e.g., zsh, csh, etc.)
executable permissions:

$ chmod u+x script.sh

Variables and Command Arguments

$ results_dir="results/"
$ echo $results_dir
results/

${variables}

sample="CNTRL01A"
mkdir ${sample}_aln/

Quoting variables: prevents commands from interpreting any spaces or other special characters

sample="CNTRL01A"
mkdir "${sample}_aln/"

1. Command-line arguments

#!/bin/bash
echo "script name: $0"
echo "first arg: $1"
echo "second arg: $2"
echo "third arg: $3"

$ bash args.sh arg1 arg2 arg3
script name: args.sh
first arg: arg1
second arg: arg2
third arg: arg3

Bash assigns the number of command-line arguments to the variable $# (this does not count the script name, $0, as an argument). This is useful for user-friendly messages:

#!/bin/bash
if [ "$#" -lt 3 ] # are there less than 3 arguments?
then
echo "error: too few arguments, you provided $#, 3 required"
echo "usage: script.sh arg1 arg2 arg3"
exit 1
fi
echo "script name: $0"
echo "first arg: $1"
echo "second arg: $2"
echo "third arg: $3"

Running this with too few arguments gives an error (and leads the process to exit with a nonzero exit status)

$ ./script.sh some_arg
error: too few arguments, you provided 1, 3 required
usage: script.sh arg1 arg2 arg3

Conditionals in a Bash Script: if Statements

Bash a bit unique is that a command’s exit status provides the true and false (remember: contrary to other languages, 0 represents true/success and anything else is false/failure)

if [commands]
then
[if-statements]
else
[else-statements]
fi

中文详解：https://blog.csdn.net/zhan570556752/article/details/80399154
[commands] is a placeholder for any command, set of commands, pipeline, or test condition. If the exit status of these commands is 0, execution continues to the block after then; otherwise execution continues to the block after else. The then keyword can be placed on the same line as if, but then a semicolon is required: if [commands]; then.

e.g.:

#!/bin/bash
if grep "pattern" some_file.txt > /dev/null
then
# commands to run if "pattern" is found
echo "found 'pattern' in 'some_file.txt'"
fi

#!/bin/bash
if grep "pattern" file_1.txt > /dev/null &&
grep "pattern" file_2.txt > /dev/null
then
echo "found 'pattern' in 'file_1.txt' and in 'file_2.txt'"
fi

$ test "ATG" = "ATG" ; echo "$?"
0 $
test "ATG" = "atg" ; echo "$?"
1 $
test 3 -lt 1 ; echo "$?"
1 $
test 3 -le 3 ; echo "$?"
0

在这里插入图片描述

$ test -d some_directory ; echo $? # is this a directory?
0 $
test -f some_file.txt ; echo $? # is this a file?
0 $
test -r some_file.txt ; echo $? $ is this file readable?
0 $
test -w some_file.txt ; echo $? $ is this file writable?
1

在这里插入图片描述
Combining test with if

if test -f some_file.txt
then
[...]
fi

Bash provides a simpler syntactic alternative to test statements: [ -f some_file.txt ] . Note the spaces around and within the brackets—these are required. This makes for much simpler if statements involving comparisons:

if [ -f some_file.txt ]
then
[...]
fi

When using this syntax, we can chain test expressions with -a as logical AND, -o as logical OR, ! as negation, and parentheses to group statements. Our familiar && and || operators won’t work in test, because these are shell operators.
e.g.Ensure our script has enough arguments and that the input file is readable:

#!/bin/bash
set -e
set -u
set -o pipefail
if [ "$#" -ne 1 -o ! -r "$1" ]
then
echo "usage: script.sh file_in.txt"
exit 1
fi

Processing Files with Bash Using for Loops and Globbing

There are three essential parts to creating a pipeline to process a set of files:
• Selecting which files to apply the commands to
• Looping over the data and applying the commands
• Keeping track of the names of any output files created

$ cat samples.txt
zmaysA R1 seq/zmaysA_R1.fastq
zmaysA R2 seq/zmaysA_R2.fastq
zmaysB R1 seq/zmaysB_R1.fastq
zmaysB R2 seq/zmaysB_R2.fastq
zmaysC R1 seq/zmaysC_R1.fastq
zmaysC R2 seq/zmaysC_R2.fastq

e.g.
Suppose that we want to loop over every file, gather quality statistics on each and every file (using the imaginary program fastq_stat), and save this information to an output file. Each output file should have a name based on the input file, so if a summary file indicates something is wrong we know which file was affected. There’s a lot of little parts to this, so we’re going to step through how to do this a piece at a time learning about Bash arrays, basename, and a few other shell tricks along the way.
(1) Load our filenames into a Bash array, which we can then loop over

$ sample_names=(zmaysA zmaysB zmaysC)

$ echo ${sample_names[0]}
zmaysA
$ echo ${sample_names[2]}
zmaysC

All elements are extracted with the cryptic-looking ${sample_files[@]}:

$ echo ${sample_names[@]}
zmaysA zmaysB zmaysC

Access how many elements are in the array (and each element’s index):

$ echo ${#sample_names[@]}
3
$ echo ${!sample_names[@]}
0 1 2

从samples.txt中提取第三列构建bash array
Extract the third column using cut -f 3 from samples.txt.

$ sample_files=($(cut -f 3 samples.txt))
$ echo ${sample_files[@]}
seq/zmaysA_R1.fastq seq/zmaysA_R2.fastq seq/zmaysB_R1.fastq
seq/zmaysB_R2.fastq seq/zmaysC_R1.fastq seq/zmaysC_R2.fastq

Leaving us with the most basic filename

$ basename seqs/zmaysA_R1.fastq
zmaysA_R1.fastq

Basename can also strip a suffix (e.g., extension) provided as the second argument from a filename (or alternatively using the argument -s):

$ basename seqs/zmaysA_R1.fastq .fastq
zmaysA_R1
$ basename -s .fastq seqs/zmaysA_R1.fastq
zmaysA_R1

#!/bin/bash
set -e
set -u
set -o pipefail
# specify the input samples file, where the third
# column is the path to each sample FASTQ file
sample_info=samples.txt
# create a Bash array from the third column of $sample_info
sample_files=($(cut -f 3 "$sample_info"))
for fastq_file in ${sample_files[@]}
do
# strip .fastq from each FASTQ file, and add suffix
# "-stats.txt" to create an output filename for each FASTQ file
results_file="$(basename $fastq_file .fastq)-stats.txt"
# run fastq_stat on a file, writing results to the filename we've
# above
fastq_stat $fastq_file > stats/$results_file
done

e.g.
Our loop must iterate over unique sample names, and we use these sample names to re-create the input FASTQ files used in alignment. This will be clearer in an example; suppose that we use the aligner BWA and our genome reference is named zmays_AGPv3.20.fa:

#!/bin/bash
set -e
set -u
set -o pipefail
# specify the input samples file, where the third
# column is the path to each sample FASTQ file
sample_info=samples.txt
# our reference
reference=zmays_AGPv3.20.fa
# create a Bash array from the first column, which are
# sample names. Because there are duplicate sample names
# (one for each read pair), we call uniq
sample_names=($(cut -f 1 "$sample_info" | uniq))
for sample in ${sample_names[@]}
do
# create an output file from the sample name
results_file="${sample}.sam"
bwa mem $reference ${sample}_R1.fastq ${sample}_R2.fastq \
> $results_file
done

Automating File-Processing with fnd and xargs

Using fnd and xargs

Finding Files with fnd

Unlike ls, find is recursive (it will search for matching files in subdirectories, and subdirectories of subdirectories, etc.)
A quick way to see a project directory’s structure:

$ find zmays-snps
zmays-snps
zmays-snps/analysis
zmays-snps/data
zmays-snps/data/seqs
zmays-snps/data/seqs/zmaysA_R1.fastq
zmays-snps/data/seqs/zmaysA_R2.fastq
zmays-snps/data/seqs/zmaysB_R1.fastq
zmays-snps/data/seqs/zmaysB_R2.fastq
zmays-snps/data/seqs/zmaysC_R1.fastq
zmays-snps/data/seqs/zmaysC_R2.fastq
zmays-snps/scripts

-maxdepth: 限制find的目录深度——当前目录：1; 第一子目录: 2
-name: 寻找文件名为**的文件

$ find data/seqs -name "zmaysB*fastq"
data/seqs/zmaysB_R1.fastq
data/seqs/zmaysB_R2.fastq

find’s Expressions

-type: 指定类型（e.g., files, directories, named pipes, links, etc.）

$ find data/seqs -name "zmaysB*fastq" -type f
data/seqs/zmaysB_R1.fastq
data/seqs/zmaysB_R2.fastq

logical AND/OR:

$ find data/seqs -name "zmaysB*fastq" -and -type f
data/seqs/zmaysB_R1.fastq
data/seqs/zmaysB_R2.fastq

$ find data/seqs -name "zmaysA*fastq" -or -name "zmaysC*fastq" -type f
data/seqs/zmaysA_R1.fastq
data/seqs/zmaysA_R2.fastq
data/seqs/zmaysC_R1.fastq
data/seqs/zmaysC_R2.fastq

在这里插入图片描述

$ find seqs -type f "!" -name "zmaysC*fastq"
seqs/zmaysA_R1.fastq
seqs/zmaysA_R2.fastq
seqs/zmaysB_R1.fastq
seqs/zmaysB_R2.fastq

$ touch seqs/zmaysB_R1-temp.fastq
$ find seqs -type f "!" -name "zmaysC*fastq"
seqs/zmaysB_R1-temp.fastq
seqs/zmaysA_R1.fastq
seqs/zmaysA_R2.fastq
seqs/zmaysB_R1.fastq
seqs/zmaysB_R2.fastq

不包含seqs/zmaysB_R1-temp.fastq：

$ find seqs -type f "!" -name "zmaysC*fastq" -and "!" -name "*-temp*"
seqs/zmaysA_R1.fastq
seqs/zmaysA_R2.fastq
seqs/zmaysB_R1.fastq
seqs/zmaysB_R2.fastq

Note that find’s operators like !, (, and ) should be quoted

find’s -exec: Running Commands on find’s Results

$ find . -name "*-temp.fastq" -exec rm -i {} \;
remove ./zmaysA_R1-temp.fastq? y
remove ./zmaysA_R2-temp.fastq? y
remove ./zmaysB_R1-temp.fastq? y
remove ./zmaysC_R1-temp.fastq? y
remove ./zmaysC_R2-temp.fastq? y

-exec: 识别语句末尾的;再运行
rm -i: 删除之前询问

xargs: A Unix Powertool

xargs allows us to take input passed to it from standard in, and use this input as arguments to another program, which allows us to build commands programmatically from values received through standard in.
xargs works by taking input from standard in and splitting it by spaces, tabs, and newlines into arguments. Then, these arguments are passed to the command supplied. For example, to emulate the behavior of find -exec with rm, we use xargs with rm:

$ find . -name "*-temp.fastq"
./zmaysA_R1-temp.fastq
./zmaysA_R2-temp.fastq
./zmaysC_R1-temp.fastq
./zmaysC_R2-temp.fastq
$ find . -name "*-temp.fastq" | xargs rm

An example of how to run find and xargs using the null byte delimiter: 用于文件名中存在空格

$ find . -name "samples [AB].txt" -print0 | xargs -0 rm

-n：设置一次传递参数的个数：

$ find . -name "*-temp.fastq" | xargs -n 1 rm

$ find . -name "*-temp.fastq" > files-to-delete.txt
$ cat files-to-delete.txt
./zmaysA_R1-temp.fastq
./zmaysA_R2-temp.fastq
./zmaysC_R1-temp.fastq
./zmaysC_R2-temp.fastq
$ cat files-to-delete.txt | xargs rm

remember, xargs’s behavior is very simple: it just places arguments after the command you provide

$ find . -name "*-temp.fastq" | xargs -n 1 echo "rm -i" > delete-temp.sh
$ cat delete-temp.sh
rm -i ./zmaysA_R1-temp.fastq
rm -i ./zmaysA_R2-temp.fastq
rm -i ./zmaysC_R1-temp.fastq
rm -i ./zmaysC_R2-temp.fastq

run this simple script with:

$ bash delete-temp.sh
remove ./zmaysA_R1-temp.fastq? y
remove ./zmaysA_R2-temp.fastq? y
remove ./zmaysC_R1-temp.fastq? y
remove ./zmaysC_R2-temp.fastq? y

Using xargs with Replacement Strings to Apply Commands to Files

xargs -I 和占位符**{}**
e.g. we want our output filenames to be paired with our input filenames and have corresponding
names.

$ find . -name "*.fastq" | xargs basename -s ".fastq"
zmaysA_R1
zmaysA_R2
zmaysB_R1
zmaysB_R2
zmaysC_R1
zmaysC_R2
$ find . -name "*.fastq" | xargs basename -s ".fastq" | \
xargs -I{} fastq_stat --in {}.fastq --out ../summaries/{}.txt

xargs and Parallelization

xargs is that it can launch a limited number of processes in parallel
-P < num >

$ find . -name "*.fastq" | xargs basename -s ".fastq" | \
xargs -P 6 -I{} fastq_stat --in {}.fastq --out ../summaries/{}.txt

xargs, Pipes, and Redirects

不能在xargs后使用管道符和重定向：shell process that reads your xargs command will interpret pipes and redirects as what to do with xarg’s output, not as part of the command run by xargs
解决方法：create a small Bash script containing the commands to process a single sample, and have xargs run this script in many parallel Bash processes.
e.g.

#!/bin/bash
set -e
set -u
set -o pipefail
sample_name=$(basename -s ".fastq" "$1")
some_program ${sample_name}.fastq | another_program >
${sample_name}-results.txt

Then, run this with:

$ find . -name "*.fastq" | xargs -n 1 -P 4 bash script.sh

Make and Makefles: Another Option for Pipelines

Another very powerful tool used to create bioinformatics pipelines

weixin_42953727

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录