Bioinformatics Data Skills by Oreilly学习笔记-3

Chapter 3 Remedial Unix Shell

In this chapter, we’ll cover remedial concepts that deeply underly how we use the shell in bioinformatics: streams, redirection, pipes, working with running programs, and command substitution.

For a project like variant calling, this program would include steps for raw sequence read process‐
ing, read mapping, variant calling, filtering variant calls, and final data analysis.

Working with Streams and Redirection

1. Redirecting Standard Error

$ ls -l tb1.fasta leafy1.fasta > listing.txt 2> listing.stderr
$ cat listing.txt
-rw-r--r-- 1 vinceb staff 152 Jan 20 21:24 tb1.fasta
$ cat listing.stderr
ls: leafy1.fasta: No such file or directory

Additionally, 2> has 2>>, which is analogous to >> (it will append to a file rather than
overwrite it)
Output written to /dev/null disappears, which is why it’s sometimes jokingly referred to as a “black‐
hole” by nerds.
Using tail -f to Monitor Redirected Standard Error

$ program < inputfile > outputfile

It’s a bit more common to use Unix pipes (e.g., cat inputfile | program > output file) than <. Many programs we’ll see later (like grep, awk, sort) also can take a file argument in addition to input through standard input.

2. Pipes in Action: Creating Simple Programs with Grep and Pipes
e.g. 找出FAST文件中的非DNA碱基字母:

$ grep -v "^>" tb1.fasta | \
grep --color -i "[^ATCG]"

First, we remove the FASTA header lines, which begin with the > character
Second, we want to find any characters that are not A, T, C, or G; Also, we ignore case with -i; Finally, we add grep’s --color option to color the matching nonnucleotide characters.

3. Combining Pipes and Redirection

$ program1 input.txt 2> program1.stderr | \
program2 2> program2.stderr > results.txt

program1 processes the input.txt input file and then outputs its results to standard output. program1’s standard error stream is redirected to the program1.stderr logfile. As before, the backslash is used to split these commands across multiple lines to improve readability (and is optional in your own work).

Meanwhile, program2 uses the standard output from program1 as its standard input. The shell redirects program2’s standard error stream to the program2.stderr logfile, and program2’s standard output to results.txt

$ program1 2>&1 | grep "error"

The 2>&1 operator is what redirects standard error to the standard output stream

4. Even More Redirection: A tee in Your Pipe

$ program1 input.txt | tee intermediate-file.txt | program2 > results.txt

program1’s standard output is both written to intermediate-fle.txt and piped directly into program2’s standard input

Managing and Interacting with Processes

1. Background Processes
We can tell the Unix shell to run a program in the background by appending an ampersand (&) to the end of our command

$ program1 input.txt > results.txt &
[1] 26577

The number returned by the shell is the process ID or PID of program1. This is a unique ID that allows you to identify and check the status of program1 later on. We can check what processes we have running in the background with jobs:

$ jobs
[1]+ Running program1 input.txt > results.txt

fg will bring the most recent process to the foreground. If you have many processes running in the background, they will all appear in the list output by the program jobs. The numbers like [1] are job IDs (which are different than the process IDs your system assigns your running programs). To return a specific background job to the foreground, use fg % where is its number in the job list. If we wanted to return program1 to the foreground, both fg and fg %1 would do the same thing, as there’s only one background process:

$ fg
program1 input.txt > results.txt

running a process in the background does not guarantee that it won’t die when your terminal closes. To prevent this, we need to use the tool nohup or run it from within Tmux, two topics we’ll cover in much more detail in Chapter 4.

place a process already running in the foreground into the background

$ program1 input.txt > results.txt # forgot to append ampersand
$ # enter control-z
[1]+ Stopped program1 input.txt > results.txt
$ bg
[1]+ program1 input.txt > results.txt

2. Killing Processes
Control-C only works if this process is running in the foreground, so if it’s in the background you’ll have to use the fg discussed earlier.

Exit Status: How to Programmatically Tell Whether Your Command Worked

One concern with long-running processes is that you’re probably not going to wait around to monitor them. How do you know when they complete? How do you know if they successfully finished without an error? Unix programs exit with an exit status, which indicates whether a program terminated without a problem or with an error. By Unix standards, an exit status of 0 indicates the process ran successfully, and any nonzero status indicates some sort of error has occurred (and hopefully the program prints an understandable error message, too).
Occasionally, programmers forget to handle errors well (and this does indeed happen in bioinformatics programs), and programs can error out and still return a zero-exit status

$ program1 input.txt > results.txt
$ echo $?
0

we want to start program2 only after program1 returns a zero (successful) exit code. The shell operator && executes subsequent commands only if previous commands have completed successfully:

$ program1 input.txt > intermediate-results.txt && \
program2 intermediate-results.txt > results.txt

Using the || operator, we can have the shell execute a command only if the previous command has failed (exited with a nonzero status). This is useful for warning messages:

$ program1 input.txt > intermediate-results.txt || \
echo "warning: an error occurred"

Additionally, if you don’t care about the exit status and you just wish to execute two commands sequentially, you can use a single semicolon ;

$ false; true; false; echo "none of the previous mattered"
none of the previous mattered
Command Substitution
$ grep -c '^>' input.fasta
416
$ echo "There are $(grep -c '^>' input.fasta) entries in my FASTA file."
There are 416 entries in my FASTA file.

Using this command substitution approach, we can easily create dated directories using the command date +%F, where the argument +%F simply tells the date program to output the date in a particular format. date has multiple formatting options, so your European colleagues can specify a date as “19 May 2011” whereas your American colleagues can specify “May 19, 2011:"

$ mkdir results-$(date +%F)
$ ls results-2015-04-13
alias mkpr="mkdir -p {data/seqs,scripts,analysis}"
alias today="date +%F"
mkdir results-$(today)

慎用别名!!!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
This practical book teaches the skills that scientists need for turning large sequencing datasets into reproducible and robust biological findings. Many biologists begin their bioinformatics training by learning scripting languages like Python and R alongside the Unix command line. But there's a huge gap between knowing a few programming languages and being prepared to analyze large amounts of biological data. Rather than teach bioinformatics as a set of workflows that are likely to change with this rapidly evolving field, this book demsonstrates the practice of bioinformatics through data skills. Rigorous assessment of data quality and of the effectiveness of tools is the foundation of reproducible and robust bioinformatics analysis. Through open source and freely available tools, you'll learn not only how to do bioinformatics, but how to approach problems as a bioinformatician. Go from handling small problems with messy scripts to tackling large problems with clever methods and tools Focus on high-throughput (or "next generation") sequencing data Learn data analysis with modern methods, versus covering older theoretical concepts Understand how to choose and implement the best tool for the job Delve into methods that lead to easier, more reproducible, and robust bioinformatics analysis Table of Contents Part I. Ideology: Data Skills for Robust and Reproducible Bioinformatics Chapter 1. How to Learn Bioinformatics Part II. Prerequisites: Essential Skills for Getting Started with a Bioinformatics Project Chapter 2. Setting Up and Managing a Bioinformatics Project Chapter 3. Remedial Unix Shell Chapter 4. Working with Remote Machines Chapter 5. Git for Scientists Chapter 6. Bioinformatics Data Part III. Practice: Bioinformatics Data Skills Chapter 7. Unix Data Tools Chapter 8. A Rapid Introduction to the R Language Chapter 9. Working with Range Data Chapter 10. Working with Sequence Data Chapter 11. Working with Alignment Data Chapter 12. Bioinformatics Shell Scripting, Writing Pipelines, and Parallelizing Tasks Chapter 13. Out-of-Memory Approaches: Tabix and SQLite Chapter 14. Conclusion

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值