Chapter 3 Remedial Unix Shell
In this chapter, we’ll cover remedial concepts that deeply underly how we use the shell in bioinformatics: streams, redirection, pipes, working with running programs, and command substitution.
For a project like variant calling, this program would include steps for raw sequence read process‐
ing, read mapping, variant calling, filtering variant calls, and final data analysis.
Working with Streams and Redirection
1. Redirecting Standard Error
$ ls -l tb1.fasta leafy1.fasta > listing.txt 2> listing.stderr
$ cat listing.txt
-rw-r--r-- 1 vinceb staff 152 Jan 20 21:24 tb1.fasta
$ cat listing.stderr
ls: leafy1.fasta: No such file or directory
Additionally, 2> has 2>>, which is analogous to >> (it will append to a file rather than
overwrite it)
Output written to /dev/null disappears, which is why it’s sometimes jokingly referred to as a “black‐
hole” by nerds.
Using tail -f to Monitor Redirected Standard Error
$ program < inputfile > outputfile
It’s a bit more common to use Unix pipes (e.g., cat inputfile | program > output file) than <. Many programs we’ll see later (like grep, awk, sort) also can take a file argument in addition to input through standard input.
2. Pipes in Action: Creating Simple Programs with Grep and Pipes
e.g. 找出FAST文件中的非DNA碱基字母:
$ grep -v "^>" tb1.fasta | \
grep --color -i "[^ATCG]"
First, we remove the FASTA header lines, which begin with the > character
Second, we want to find any characters that are not A, T, C, or G; Also, we ignore case with -i; Finally, we add grep’s --color option to color the matching nonnucleotide characters.
3. Combining Pipes and Redirection
$ program1 input.txt 2> program1.stderr | \
program2 2> program2.stderr > results.txt
program1 processes the input.txt input file and then outputs its results to standard output. program1’s standard error stream is redirected to the program1.stderr logfile. As before, the backslash is used to split these commands across multiple lines to improve readability (and is optional in your own work).
Meanwhile, program2 uses the standard output from program1 as its standard input. The shell redirects program2’s standard error stream to the program2.stderr logfile, and program2’s standard output to results.txt
$ program1 2>&1 | grep "error"
The 2>&1 operator is what redirects standard error to the standard output stream
4. Even More Redirection: A tee in Your Pipe
$ program1 input.txt | tee intermediate-file.txt | program2 > results.txt
program1’s standard output is both written to intermediate-fle.txt and piped directly into program2’s standard input
Managing and Interacting with Processes
1. Background Processes
We can tell the Unix shell to run a program in the background by appending an ampersand (&) to the end of our command
$ program1 input.txt > results.txt &
[1] 26577
The number returned by the shell is the process ID or PID of program1. This is a unique ID that allows you to identify and check the status of program1 later on. We can check what processes we have running in the background with jobs:
$ jobs
[1]+ Running program1 input.txt > results.txt
fg will bring the most recent process to the foreground. If you have many processes running in the background, they will all appear in the list output by the program jobs. The numbers like [1] are job IDs (which are different than the process IDs your system assigns your running programs). To return a specific background job to the foreground, use fg % where is its number in the job list. If we wanted to return program1 to the foreground, both fg and fg %1 would do the same thing, as there’s only one background process:
$ fg
program1 input.txt > results.txt
running a process in the background does not guarantee that it won’t die when your terminal closes. To prevent this, we need to use the tool nohup or run it from within Tmux, two topics we’ll cover in much more detail in Chapter 4.
place a process already running in the foreground into the background
$ program1 input.txt > results.txt # forgot to append ampersand
$ # enter control-z
[1]+ Stopped program1 input.txt > results.txt
$ bg
[1]+ program1 input.txt > results.txt
2. Killing Processes
Control-C only works if this process is running in the foreground, so if it’s in the background you’ll have to use the fg discussed earlier.
Exit Status: How to Programmatically Tell Whether Your Command Worked
One concern with long-running processes is that you’re probably not going to wait around to monitor them. How do you know when they complete? How do you know if they successfully finished without an error? Unix programs exit with an exit status, which indicates whether a program terminated without a problem or with an error. By Unix standards, an exit status of 0 indicates the process ran successfully, and any nonzero status indicates some sort of error has occurred (and hopefully the program prints an understandable error message, too).
Occasionally, programmers forget to handle errors well (and this does indeed happen in bioinformatics programs), and programs can error out and still return a zero-exit status
$ program1 input.txt > results.txt
$ echo $?
0
we want to start program2 only after program1 returns a zero (successful) exit code. The shell operator && executes subsequent commands only if previous commands have completed successfully:
$ program1 input.txt > intermediate-results.txt && \
program2 intermediate-results.txt > results.txt
Using the || operator, we can have the shell execute a command only if the previous command has failed (exited with a nonzero status). This is useful for warning messages:
$ program1 input.txt > intermediate-results.txt || \
echo "warning: an error occurred"
Additionally, if you don’t care about the exit status and you just wish to execute two commands sequentially, you can use a single semicolon ;
$ false; true; false; echo "none of the previous mattered"
none of the previous mattered
Command Substitution
$ grep -c '^>' input.fasta
416
$ echo "There are $(grep -c '^>' input.fasta) entries in my FASTA file."
There are 416 entries in my FASTA file.
Using this command substitution approach, we can easily create dated directories using the command date +%F, where the argument +%F simply tells the date program to output the date in a particular format. date has multiple formatting options, so your European colleagues can specify a date as “19 May 2011” whereas your American colleagues can specify “May 19, 2011:"
$ mkdir results-$(date +%F)
$ ls results-2015-04-13
alias mkpr="mkdir -p {data/seqs,scripts,analysis}"
alias today="date +%F"
mkdir results-$(today)
慎用别名!!!