MATLAB的函数和功能,
Bioinformatics Toolbox User's Guide:
http://www.mathworks.com/access/helpdesk/help/pdf_doc/bioinfo/bioinfo_ug.pdf
Bioinformatics Toolbox Reference:
http://www.mathworks.com/access/helpdesk/help/pdf_doc/bioinfo/bioinfo_ref.pdf
希望对大家有所帮助
------------------------------------------------------------------------------
Features and Functions
Bioinformatics Toolbox includes many functions to help you with genome and proteome analysis. Most functions are implemented in M-code (the MATLAB programming language) with the source available for you to view. This open environment lets you explore and customize the existing toolbox algorithms or develop your own.
Data Formats and Databases Access online databases, copy data into the MATLAB workspace, and read and write to files with standard bioinformatic formats.
Sequence Alignments Compare nucleotide or amino acid sequences using pair-wise and multiple sequence alignment functions.
Sequence Utilities and Statistics Manipulate sequences and determine physical, chemical, and biological characteristics.
Protein Property Analysis Determine protein characteristics and simulate enzyme cleavage reactions.
Phylogenetic Analysis Explore phylogenetic data with functions and a GUI to draw phylograms (trees)
Microarray Data Analysis Read, filter, normalize, and visualize microarray data.
Mass Spectrometry Data Analysis Preprocess raw mass spectrometry data and use statistical learning functions to identify patterns.
Graph Theory Functions Apply basic graph theory algorithms to sparse matrices.
Graph Visualization View relationships between data visually with interactive maps, hierarchy plots, and pathways.
Statistical Learning and Visualization Classify and identify features in data sets, set up cross-validation experiments, and compare different classification methods.
Prototyping and Development Environment Create new algorithms, try new ideas, and analyze alternatives.
Data Visualization Visually compare pair-wise sequence alignments, multiply aligned sequences, gene expression data from microarrays, and plot nucleic acid and protein characteristics.
Algorithm Sharing and Application Deployment Create GUIs and stand-alone applications.
Data Formats and Databases
Bioinformatics Toolbox supports access to many of the databases on the Web and other online data sources. It also reads many common genome file formats, so that you do not have to write and maintain your own file readers.
Web-based databases — You can directly access public databases on the Web and copy sequence and gene expression information into MATLAB.
The sequence databases currently supported are GenBank (getgenbank), GenPept (getgenpept), European Molecular Biology Laboratory EMBL (getembl), and Protein Data Bank PDB (getpdb). You can also access data from the NCBI Gene Expression Omnibus (GEO) web site by using a single function (getgeodata).
Get multiply aligned sequences (gethmmalignment), hidden Markov model profiles (gethmmprof), and phylogenetic tree data (gethmmtree) from the PFAM database.
Gene Ontology database — Load the database from the Web into a gene ontology object (geneont). Select sections of the ontology with methods for the geneont object (getancestors, getdescendants, getmatrix, getrelatives), and manipulate data with utility functions (goannotread, num2goid).
Read data from instruments — Read data generated from gene sequencing instruments (scfread, joinseq, traceplot), mass spectrometers (jcampread), and Agilent microarray scanners (agferead).
Reading data formats — The toolbox provides a number of functions for reading data from common bioinformatic file formats.
Sequence data: GenBank (genbankread), GenPept (genpeptread), EMBL (emblread), PDB (pdbread), and FASTA (fastaread)
Multiply aligned sequences: ClustalW and GCG formats (multialignread)
Gene expression data from microarrays: Gene Expression Omnibus (GEO) data (geosoftread), GenePix data in GPR and GAL files (gprread, galread), SPOT data (sptread), Affymetrix® GeneChip® data (affyread), and ImaGene results files (imageneread).
Note: The function affyread only works on PC supported platforms.
Hidden Markov model profiles: PFAM-HMM file (pfamhmmread)
Writing data formats — The functions for getting data from the Web include the option to save the data to a file. However, there is a function to write data to a file using the FASTA format (fastawrite).
BLAST searches — Request Web-based BLAST searches (blastncbi), get the results from a search (getblast) and read results from a previously saved BLAST formatted report file (blastread).
MATLAB has built-in support for other industry-standard file formats including Microsoft Excel and comma-separated value (CSV) files. Additional functions perform ASCII and low-level binary I/O, allowing you to develop custom functions for working with any data format.
Sequence Alignments
You can select from a list of analysis methods to perform pair-wise or multiple sequence alignment.
Pair-wise sequence alignment — Efficient MATLAB implementations of standard algorithms such as the Needleman-Wunsch (nwalign) and Smith-Waterman (swalign) algorithms for pair-wise sequence alignment. The toolbox also includes standard scoring matrices such as the PAM and BLOSUM families of matrices (blosum, dayhoff, gonnet, nuc44, pam). Visualize sequence similarities with seqdotplot and sequence alignment results with showalignment.
Multiple sequence alignment — Functions for multiple sequence alignment (multialign, profalign) and functions that support multiple sequences (multialignread, fastaread, showalignment). There is also a graphical interface (multialignviewer) for viewing the results of a multiple sequence alignment and manually making adjustment.
Multiple sequence profiles — MATLAB implementations for multiple alignment and profile hidden Markov model algorithms (gethmmprof, gethmmalignment, gethmmtree, pfamhmmread, hmmprofalign, hmmprofestimate, hmmprofgenerate, hmmprofmerge, hmmprofstruct, showhmmprof).
Biological codes — Look up the letters or numeric equivalents for commonly used biological codes (aminolookup, baselookup, geneticcode, revgeneticcode).
Sequence Utilities and Statistics
You can manipulate and analyze your sequence to gain a deeper understanding of your data. Use a graphical user interface (GUI) with many of the sequence functions in Bioinformatics Toolbox (seqtool).
Sequence conversion and manipulation — The toolbox provides routines for common operations, such as converting DNA or RNA sequences to amino acid sequences, that are basic to working with nucleic acid and protein sequences (aa2int, aa2nt, dna2rna, rna2dna, int2aa, int2nt, nt2aa, nt2int, seqcomplement, seqrcomplement, seqreverse).
You can manipulate your sequence by performing an in-silico digestion with restriction endonucleases (restrict) and proteases (cleave).
Sequence statistics — Determine various statistics about a sequence (aacount, basecount, codoncount, dimercount, nmercount, ntdensity, codonbias, cpgisland, oligoprop), search for specific patterns within a sequence (seqshowwords, seqwordcount), or search for open reading frames (seqshoworfs). In addition, you can create random sequences for test cases (randseq).
Sequence utilities — Determine a consensus sequence from a set of multiply aligned amino acid, nucleotide sequences (seqconsensus, or a sequence profile (seqprofile). Format a sequence for display (seqdisp) or graphically show a sequence alignment with frequency data (seqlogo).
Additional functions in MATLAB efficiently handle string operations with regular expressions (regexp, seq2regexp) to look for specific patterns in a sequence and search through a library for string matches (seqmatch).
Look for possible cleavage sites in a DNA/RNA sequence by searching for palindromes (palindromes).
Protein Property Analysis
You can use a collection of protein analysis methods to extract information from your data. The toolbox provides functions to calculate various properties of a protein sequence, such as the atomic composition (atomiccomp), molecular weight (molweight), and isoelectric point (isoelectric). You can cleave a protein with an enzyme (cleave, rebasecuts) and create distance and Ramachandran plots for PDB data (pdbdistplot, ramachandran). The toolbox contains a graphical user interface for protein analysis (proteinplot) and plotting 3-D protein and other molecular structures with information from molecule model files, such as PDB files (molviewer).
Amino acid sequence utilities — Calculate amino acid statistics for a sequence (aacount) and get information about character codes (aminolookup).
Phylogenetic Analysis
Functions for phylogenetic tree building and analysis.
Phylogenetic tree data — Read and write Newick-formatted tree files (phytreeread, phytreewrite) into the MATLAB workspace as phylogenetic tree objects (phytree).
Create a phylogenetic tree — Calculate the pair-wise distance between biological sequences (seqpdist), estimate the substitution rates (dnds, dndsml), build a phylogenetic tree from pair-wise distances (seqlinkage, seqneighjoin, reroot), and view the tree in an interactive GUI that allows you to view, edit, and explore the data (phytreetool or view). This GUI also allows you to prune branches, reorder, rename, and explore distances.
Phylogenetic tree object methods — You can access the functionality of the phytreetool GUI using methods for a phylogenetic tree object (phytree). Get property values (get) and node names (getbyname). Calculate the patristic distances between pairs of leaf nodes (pdist, weights) and draw a phylogenetic tree object in a MATLAB figure window as a phylogram, cladogram, or radial treeplot (plot). Manipulate tree data by selecting branches and leaves using a specified criterion (select, subtree) and removing nodes (prune). Compare trees (getcanonical) and use Newick-formatted strings (getnewickstr).
Microarray Data Analysis
MATLAB is widely used for microarray data analysis. However, the standard normalization and visualization tools that scientists use can be difficult to implement. Bioinformatics Toolbox includes these standard functions.
Microarray data — Read Affymetrix GeneChip files (affyread) and plot data (probesetplot), ImaGene results files (imageneread), SPOT files (sptread) and Agilent microarray scanner files (agferead). Read GenePix GPR files (gprread) and GAL files (galread). Get Gene Expression Omnibus (GEO) data from the web (getgeodata) and read GEO data from files (geosoftread).
A utility function (magetfield) extracts data from one of the microarray reader functions (gprread, agferead, sptread, imageneread).
Microarray normalization and filtering — The toolbox provides a number of methods for normalizing microarray data, such as lowess normalization (malowess) and mean normalization (manorm), or across multiple arrays (quantilenorm). You can use filtering functions to clean raw data before analysis (geneentropyfilter, genelowvalfilter, generangefilter, genevarfilter), and calculate the range and variance of values (exprprofrange, exprprofvar).
Microarray visualization — The toolbox contains routines for visualizing microarray data. These routines include spatial plots of microarray data (maimage, redgreencmap), box plots (maboxplot), loglog plots (maloglog), and intensity-ratio plots (mairplot). You can also view clustered expression profiles (clustergram, redgreencmap). You can create 2-D scatter plots of principal components from the microarray data (mapcaplot).
Microarray utility functions — Use the following functions to work with Affymetrix and GeneChip data sets. Get library information for a probe (probelibraryinfo), gene information from a probe set (probesetlookup), and probe set values from CEL and CDF information (probesetvalues). Show probe set information from NetAffx (probesetlink) and plot probe set values (probesetplot).
The toolbox accesses statistical routines to perform cluster analysis and to visualize the results, and you can view your data through statistical visualizations such as dendrograms, classification, and regression trees.
Mass Spectrometry Data Analysis
The mass spectrometry functions preprocess and classify raw data from SELDI-TOF and MALDI-TOF spectrometers.
Reading raw data into MATLAB — Load raw mass/charge and ion intensity data from comma-separated-value (CSV) files, or read a JCAMP-DX formatted file with mass spectrometry data (jcampread) into MATLAB.
You can also have data in TXT files and use the importdata function.
Preprocessing raw data — Resample high-resolution data to a lower resolution (msresample) where the extra data points are not needed. Correct the baseline (msbackadj). Align a spectrum to a set of reference masses (msalign) and visually verify the alignment (msheatmap). Normalize the area between spectra for comparing (msnorm), and filter out noise (mslowess and mssgolay).
Spectrum analysis — Load spectra into a GUI (msviewer) for selecting mass peaks and further analysis.
The following graphic illustrates the roles of the various mass spectrometry functions in Bioinformatics Toolbox:
Graph Theory Functions
Graph theory functions in Bioinformatics Toolbox apply basic graph theory algorithms to sparse matrices. A sparse matrix represents a graph, any nonzero entries in the matrix represent the edges of the graph, and the values of these entries represent the associated weight (cost, distance, length, or capacity) of the edge. Graph algorithms that use the weight information will cancel the edge if a NaN or an Inf is found. Graph algorithms that do not use the weight information will consider the edge if a NaN or an Inf is found, because these algorithms look only at the connectivity described by the sparse matrix and not at the values stored in the sparse matrix.
Sparse matrices can represent four types of graphs:
Directed Graph — Sparse matrix, either double real or logical. Row (column) index indicates the source (target) of the edge. Self-loops (values in the diagonal) are allowed, although most of the algorithms ignore these values.
Undirected Graph — Lower triangle of a sparse matrix, either double real or logical. An algorithm expecting an undirected graph ignores values stored in the upper triangle of the sparse matrix and values in the diagonal.
Direct Acyclic Graph (DAG) — Sparse matrix, double real or logical, with zero values in the diagonal. While a zero-valued diagonal is a requirement of a DAG, it does not guarantee a DAG. An algorithm expecting a DAG will not test for cycles because this will add unwanted complexity.
Spanning Tree — Undirected graph with no cycles and with one connected component.
There are no attributes attached to the graphs; sparse matrices representing all four types of graphs can be passed to any graph algorithm. All functions will return an error on nonsquare sparse matrices.
Graph algorithms do not pretest for graph properties because such tests can introduce a time penalty. For example, there is an efficient shortest path algorithm for DAG, however testing if a graph is acyclic is expensive compared to the algorithm. Therefore, it is important to select a graph theory function and properties appropriate for the type of the graph represented by your input matrix. If the algorithm receives a graph type that is different from what it expects, it will either:
Return an error when it reaches an inconsistency, for example, if you pass a cyclic graph to the graphshortestpath function and specify Acyclic as the method property.
Produce an invalid result. For example, if you pass a directed graph to a function with an algorithm that expects an undirected graph, it will ignore values in the upper triangle of the sparse matrix.
The graph theory functions include graphallshortestpaths, graphconncomp, graphisdag, graphisomorphism, graphisspantree, graphmaxflow, graphminspantree, graphpred2path, graphshortestpath, graphtopoorder, graphtraverse.
Graph Visualization
Bioinformatics Toolbox includes functions, objects, and methods for creating, viewing, and manipulating graphs, such as interaction maps, hierarchy plots, and pathways.
The object constructor function (biograph) lets you create a biograph object to hold graph data. Methods of the biograph object let you calculate the position of nodes (dolayout), draw the graph (view), get handles to the nodes and edges (getnodesbyid and getedgesbynodeid) to further query information, and find relations between the nodes (getancestors, getdescendants, andgetrelatives). There are also methods that apply basic graph theory algorithms to the biograph object.
Various properties of a biograph object let you programmatically change the properties of the rendered graph. You can customize the node representation, for example, drawing pie charts inside every node (CustomNodeDrawFcn). Or you can associate your own callback functions to nodes and edges of the graph, for example, opening a Web page with more information about the nodes (NodeCallback and EdgeCallback).
Statistical Learning and Visualization
Bioinformatics Toolbox provides functions that build on the classification and statistical learning tools in Statistics Toolbox (classify, kmeans, andtreefit).
These functions include imputation tools (knnimpute), support vector machine classifiers (svmclassify, svmtrain) and K-nearest neighbor classifiers (knnclassify).
Other functions include set up of cross-validation experiments (crossvalind) and comparison of the performance of different classification methods (classperf). In addition, there are tools for selecting diversity and discriminating features (rankfeatures, randfeatures).
Prototyping and Development Environment
MATLAB is a prototyping and development environment where you can create algorithms and easily compare alternatives.
Integrated environment — Explore biological data in an environment that integrates programming and visualization. Create reports and plots with the built-in functions for mathematics, graphics, and statistics.
Open environment — Access the source code for Bioinformatics Toolbox functions. The toolbox includes many of the basic bioinformatics functions you will need to use, and it includes prototypes for some of the more advanced functions. Modify these functions to create your own custom solutions.
Interactive programming language — Test your ideas by typing functions that are interpreted interactively with a language whose basic data element is an array. The arrays do not require dimensioning and allow you to solve many technical computing problems,
Using matrices for sequences or groups of sequences allows you to work efficiently and not worry about writing loops or other programming controls.
Programming tools — Use a visual debugger for algorithm development and refinement and an algorithm performance profiler to accelerate development.
Data Visualization
In addition, MATLAB 2-D and volume visualization features let you create custom graphical representations of multidimensional data sets. You can also create montages and overlays, and export finished graphics to a PostScript image file or copy directly into Microsoft PowerPoint.
Algorithm Sharing and Application Deployment
The open MATLAB environment lets you share your analysis solutions with other MATLAB users, and it includes tools to create custom software applications. With the addition of the MATLAB Compiler, you can create stand-alone applications independent of MATLAB, and with the addition of MATLAB Builder for COM, you can create GUIs and stand-alone applications within other programming environments.
Share algorithms with other MATLAB users — You can share data analysis algorithms created in the MATLAB language across all MATLAB supported platforms by giving M-files to other MATLAB users. You can also create GUIs within MATLAB using the Graphical User Interface Development Environment (GUIDE).
Deploy MATLAB GUIs — Create a GUI within MATLAB using GUIDE, and then use the MATLAB Compiler to create a stand-alone GUI application that runs separately from MATLAB.
Create dynamic link libraries (DLL) — Use the MATLAB compiler to create dynamic link libraries (DLLs) for your functions, and then link these libraries to other programming environments such as C and C++.
Create COM objects — Use MATLAB Builder for COM to create COM objects, and then use a COM compatible programming environment (Visual Basic) to create a stand-alone application.
Create Excel add-ins — Use MATLAB Builder for Excel to create Excel add-in functions, and then use the add-in functions with Excel spreadsheets.
Create Java™ classes — Use MATLAB Builder for Java to automatically generate Java classes from MATLAB algorithms. You can run these MATLAB based classes outside the MATLAB environment.