2019/07/12_NGS Data Analysis Course (Harvard Chan Bioinformatics Core)_4

#Learning objectives

  • Learn how to search for characters or patterns in a text file using the grep command
  • Learn how to write to file and append to file using output redirection
  • Explore how to use the pipe (|) character to chain together commands

#Searching files

我们上一节用less 查找一个文件中的内容,如果我们想在多个文件中查找, 并且不打开它们  那我们就要用 grep

Grep is a command-line utility for searching plain-text data sets for lines matching a pattern or regular expression (regex).(不太懂啥意思。。。)简言之,grep是用于搜索纯文本数据全部与模式或正则表达式(regex)匹配的行

Suppose we want to see how many reads in our file Mov10_oe_1.subset.fq are “bad”, with 10 consecutive Ns (NNNNNNNNNN).

$ cd ~/unix_lesson/raw_fastq

$ grep NNNNNNNNNN Mov10_oe_1.subset.fq

 

We get back a lot of lines. What if we want to see the whole fastq record for each of these reads?

We can use the -B and -A arguments for grep to return the matched line plus one before (-B1) and two lines after (-A2). Since each record is four lines and the second line is the sequence, this should return the whole record.

$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_1.subset.fq

不懂,觉得需要实战才能明白

#Redirection

The redirection command for writing something to file is >.

Let’s try it out and put all the sequences that contain ‘NNNNNNNNNN’ from all the files into another file called bad_reads.txt.

$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_1.subset.fq > bad_reads.txt

The prompt should sit there a little bit, and then it should look like nothing happened. But you should have a new file called bad_reads.txt.

$ ls -l

Take a look at the file and see if it contains what you think it should. NOTE: If we already had a file named bad_reads.txt in our directory, it would have overwritten it without any warning.

用> 来重定向,>加一个文件名,就可以命名一个新的文件。 

The redirection command for appending something to an existing file is >>.

If we use >>, it will append to rather than overwrite a file. This can be useful for saving more than one search。

The redirection command for using the output of a command as input for a different command is |.

将命令输出用作其他命令输入的重定向命令是 |。

We can also do count the number of lines using the wc command. wc stands for word count.

| 表示只执行 | 之后的命令

‘cut’ is a program that will extract columns from files

$ grep exon chr1-hg19_genes.gtf | cut -f1,4,5,7 | head

此命令表示提取了该文件的1,4,5,7列

Removing duplicate exon

we can use a new tool, sort, to remove exons that show up more than once. We can use the sort command with the -u option to return only unique lines.

$ grep exon chr1-hg19_genes.gtf | cut -f1,4,5,7 | sort -u | head

Counting the total number of exons

First, let’s check how many lines we would have without using sort -u by piping the output to wc -l.

grep exon chr1-hg19_genes.gtf | cut -f1,4,5,7 | wc -l

Now, to count how many unique exons are on chromosome 1, we will add back the sort -u and pipe the output to wc -l

$ grep exon chr1-hg19_genes.gtf | cut -f1,4,5,7 | sort -u | wc -l

#Commands, options, and keystrokes covered in this lesson

grep
> (output redirection)
>> (output redirection, append)
| (output redirection, pipe)
wc
cut
sort

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值