Unix and perl primer for Biologists - Part2 :Advanced Unix- Reading Notes(U37-U45)

最新推荐文章于 2021-04-30 01:41:58 发布

whereis redhat

最新推荐文章于 2021-04-30 01:41:58 发布

阅读量588

点赞数

分类专栏： Unix and perl primer for Biologist 文章标签： Unix

本文链接：https://blog.csdn.net/weixin_38872771/article/details/86604150

版权

Unix and perl primer for Biologist 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

U37: Counting with grep
Running grep -c simply counts how many lines match the specified pattern.

root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c i2 intron_IME_data.fasta 
9785
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "ACCCCCCCCCA" intron_IME_data.fasta 
0
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "ACCCCCCA" intron_IME_data.fasta 
11
#here can't less the matched results
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "ACCCCCCA" intron_IME_data.fasta | less
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# pwd
/root/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# ls
At_genes.gff  At_proteins.fasta  chr1.fasta  intron_IME_data.fasta
#match with all the files with the extension of fasta in the current directory
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "ACCCCCCCCA"  *.fasta
At_proteins.fasta:0
chr1.fasta:2
intron_IME_data.fasta:0
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "ACGT" *.fasta
At_proteins.fasta:70
chr1.fasta:50612
intron_IME_data.fasta:11924
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "^ATG.*ACACAC.*TGA$"  *.fasta
At_proteins.fasta:0
chr1.fasta:3
intron_IME_data.fasta:0
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "AC.GT" *.fasta
At_proteins.fasta:47
chr1.fasta:62327
intron_IME_data.fasta:17998
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "AC*GT" *.fasta
At_proteins.fasta:2600
chr1.fasta:288454
intron_IME_data.fasta:103917
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "AC*GT" *.fasta
At_proteins.fasta:2600
chr1.fasta:288454
intron_IME_data.fasta:103917
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "A*C*G*T*" *.fasta
At_proteins.fasta:269436
chr1.fasta:385224
intron_IME_data.fasta:250978

U38: Regular expressions in less
If you are viewing a file with less, you can type a forward-slash / character, and this allows you to then specify a pattern and it will then search for (and highlight) all matches to that pattern. Technically it is searching forward from whatever point you are at in the file. You can also type a question-mark ? and less will allow you to search backwards. The real bonus is that the patterns you specify can be regular expressions.

Task U38.1
Try viewing a sequence file with less and then searching for a pattern such as ATCG.*TAG$. This should make it easier to see exactly where your regular expression pattern matches. After typing a forward-slash (or a question-mark), you can press the up and down arrows to select previous searches.

U38_forward_slash_and_question_mark_searching_forward_backward

U39: Let me transl(iter)ate that for you
upper-case characters to lower-case characters. Unix command tr (short for transliterate)
U39_tr_upper_to_lower

U40: That’s what she sed
change a particular pattern into something completely different. sed that is capable of performing a variety of text manipulations. The ‘s’ part of the sed command puts sed in ‘substitute’ mode, where you specify one pattern (between the first two forward slashes) to be replaced by another pattern (specified between the second set of forward slashes).
U40_head_sed

U41: Word up
get a feeling for how large a file is before you start running lots of commands against it. know how many ‘lines’ it has. That is because many Unix commands like grep and sed work on a line by line basis. Unix command called wc (word count) . count the number of lines, words and bytes in the specified file(s). run wc -l, the -l option would have shown us just the line count.
u41_word_count_wc

U42: GFF and the art of redirection
** GFF** file. This is a common file format in bioinformatics and GFF files are used to describe the location of various features on a DNA sequence. Features can be exons, genes, binding sites etc, and the sequence can be a single gene or (more commonly) an entire chromosome. create a new (smaller) file that contains a subset of the original:
want to redirect the output into an actual file, and that is what the > symbol is doing, it acts as one of three redirection operators in Unix.
GFF file that we are working with is a standard file format in bioinformatics. For now, all you really need to know is that every GFF file has 9 fields, each separated with a tab character. There should always be some text at every position (even if it is just a ‘.’ character). The last field often is used to store a lot of text.
U42_gff_features
U42_gff_subset_redirection
U43: Not just a pipe dream
The 2nd and/or 3rd fields of a GFF file are usually used to describe some sort of biological feature. We might be interested in seeing how many different features are in our file:
u43_not_just_a_pipe_dream_cut_sort_uniq_

1.The cut command first takes the At_genes_subset.gff file and ‘cuts’ out just the 3rd column (as specified by the -f option). Luckily, the default behavior for the cut command is to split text files into columns based on tab characters (if the columns were separated by another character such as a comma then we would need to use another command line option to specify the comma).
2.The sort command takes the output of the cut command and sorts it alphanumerically.
3.The uniq command (in its default format) only keeps lines which are unique to the output (otherwise you would see thousands of fields which said ‘curated’, ‘Coding_transcript’ etc.)

Want to find which features start earliest in the chromosome sequence. The start coordinate of features is always specified by column 4 of the GFF file, so: cut out just two columns of interest (3 & 4) . The -f option of the cut command lets us specify which columns we want to remove. sort will sort alphanumerically, use the -n option to specify that sort numerically. could sort based on either column. The -k 2 specifies that use the second column. use the head command to get just the 10 rows of output. lines from the GFF file the lowest starting coordinate.
U43_cut_sort_head

U44: The end of the line
pressing enter will generate one of two different events (depending on what computer you are using). pressing enter generates a newline character which is represented internally by either a line feed or carriage return character (actually, Windows uses a combination of both to represent a newline). text file looks unreadable in the Unix text viewer. In Unix (and in Perl and other programming languages) the patterns \n and \r can both be used to denote newlines. A common fix for this requires substituting \r for \n.

Use less to look at the Data/Misc/excel_data.csv file. This is a simple 4-line file that was exported from a Mac version of Microsoft Excel. You should see that if you use less, then this appears as one line with the newlines replaced with ^M characters. You can convert these carriage returns into Unix-friendly line-feed characters by using the tr command like so:
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# less excel_data.csv U34_newline_character

root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# pwd
/root/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# ls
excel_data.csv  oligos.txt
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# less excel_data.csv 

root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# tr '\r' '\n'  < excel_data.csv 
sequence 1,acacagagag
sequence 2,acacaggggaaa
sequence 3,ttcacagaga
sequence 4,cacaccaaacac
sequence 5,tttatatttaatataroot@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# less excel_data.csv 

root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# tr '\r' '\n'  < excel_data.csv  >excel_data_formatted.csv
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# ls
excel_data.csv  excel_data_formatted.csv  oligos.txt
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# less excel_data_formatted.csv

U44_tr_redirect_operator

U45: This one goes to 11
Arabidopsis intron_IME_data.fasta, Every intron sequence in this file has a header line that contains the following pieces of information:
1.gene name
2.intron position in gene
3.distance of intron from transcription start site (TSS)
4.type of sequence that intron is located in (either CDS or UTR)

extract five sequences from this file that are: a) from first introns, b) in the 5’ UTR, and c) closest to the TSS. Notice that use one of the other redirect operators < to read from a file.
U45

Summary
If you have learnt (and understood) all of the Unix commands so far then you probably will never need to learn anything more in order to do a lot of productive Unix work. But keep on dipping into the man page for all of these commands to explore them in even further detail. If you include the three, as-yet-unmentioned, commands in the last column, then you will probably be able to achieve >95% of everything that you will ever want to do in Unix (remember, you can use the man command to find out more about top, ps, and kill). The power comes from how you can use combinations of these commands.
Summary