U33: Match making
search files to find lines that match a certain pattern. The Unix command grep
does this (and much more). To find only those header lines in a FASTA file, we can use grep
, which just requires you specify a pattern to search for, and one or more files to search:
one common option is to get grep
to show lines that don’t match your input pattern. You can do this with the -v
option and in this example we are seeing just the sequence part of the FASTA file.
U34: Your first ever Unix pipe
look at the output from any command in a controlled manner. Can send the output from any command to any other Unix program (as long as the second program accepts input of some sort). By using what is known as a pipe. This is implemented using the ‘|
’ character. Press the forward slash (/) key in less
, you can then specify a search pattern. Type ATGTGA after the slash and press enter. The less
program will highlight the location of these matches on each line. Note that grep matches patterns on a per line basis. So if one line ended ATG and the next line started TGA, then grep would not find it.Any time you run a Unix program or command that outputs a lot of text to the screen, you can instead pipe that output into the less program.
U35: Heads and tails
just want to see a few lines to get a feeling for what the output looks like, or just check that our program (or Unix command) is working properly : head and tail , show (by default) the first or last 10 lines of a file. use the -i
option of grep
which ‘ignores’ case(upper-case or lower-case letters) when searching
The *
character acts as a wildcard meaning ‘search all files in the current directory’ and the head command restricts the total amount of output to 10 lines. Notice that the output also includes the name of the file containing the matching pattern.
U36: Getting fancy with regular expressions
A concept that is supported by many Unix programs and also by most programming languages (including Perl) is that of using regular expressions.
# grep "^ATG.*ACACAC.*TGA$" chr1.fasta | less
# grep "^ATG*ACACAC*TGA$" chr1.fasta | less
None
The asterisk in a regular expression is similar to, but NOT the same, as the other asterisks that we have seen so far. An asterisk in a regular expression means: ‘match zero or more of the preceding character or pattern’.
# grep "ACGT" chr1.fasta | less
# grep "AC.GT" chr1.fasta | less
# grep "AC*GT" chr1.fasta | less
grep "A...T" chr1.fasta | less
grep "AG*T" chr1.fasta | less
grep "A*G*C*T*" chr1.fasta | less