Table of Contentshttps://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideTEXT.htm#_Preparing_Data_Files
Table of Contentshttps://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideTEXT.htm
GenePatternhttps://www.genepattern.org/modules/docs/ssGSEAProjection/4
Input Files
- Input expression dataset
The GCT file containing the input dataset’s gene expression data (see the GCT file format information). Gene symbols are typically listed in the column with header Name; however, GCT files containing RNAi data may list the gene symbol name in alternative columns. The “gene symbol column name” parameter specifies which of the input GCT file’s columns contains the gene symbols.
The input GCT file’s row identifiers must draw from the same family of gene identifiers (the same ontology or name space, such as HUGO Gene Nomenclature) as those used to identify genes in the gene sets database file (see next item below). Typically these are human gene symbols.
If a GCT file’s row identifiers are probe IDs, and gene sets are defined through a list of human gene symbols, it will be necessary to collapse all probe set expression values for a given gene into a single expression value and use a human gene symbol to represent that gene. The CollapseDataset GenePattern module can make this transformation.
- Gene sets database files
One or more optional GMT or GMX file containing a collection of gene set definitions (see the GMT file format and the GMX file format in the GenePattern file formats documentation).
Data formats - GeneSetEnrichmentAnalysisWiki
Expression Datasets
An expression dataset file contains features (genes or probes), samples, and an expression value for each feature in each sample. It is a tab-delimited text file in gct, res, pcl, or txt format. For descriptions and examples of each file format, see GSEA file formats.
Because most gene expression data is already in tab-delimited text files, or in spreadsheet and database programs that allow you to export the data into tab-delimited text files, creating expression dataset files for GSEA is relatively easy:
1. Start with a tab-delimited file that contains your gene expression data.
2. Open the file in Excel or a text editor.
3. Make the necessary format changes: compare your current file with the file format described in GSEA file formats; add header rows, remove extra columns, and make any other changes necessary to create a properly formatted file.
4. Save the file as a tab-delimited text file with the appropriate file extension (gct, res, pcl, or txt).
Note: GSEA expects a very specific formatting for .txt files. See the file formats page on the website for details. Also, be aware that some editors on some platforms automatically attach the “txt” extension to other file types (e.g. “.gct.txt”), which may confuse GSEA during parsing. Make sure to remove the extra .txt extension from the name before using the file with GSEA.
Note: When you create an expression dataset file, the GSEA team recommends that the file name include the name of the chip used to produce the expression data; for example, all_aml_dataset_hgu95av2.gct.
When creating expression dataset files, keep in mind the following:
● Image data. GSEA does not process image data. If you have image data, you must use external software (such as Rosetta Resolver or Stanford Microarray Database) to convert the image data to numeric data.
● RNA-seq data. GSEA does not normalize RNA-seq data. RNA-seq data must be normalized for between-sample comparisons using an external normalization procedure (e.g. those in DESeq2 or Voom).
● cDNA two-color ratio data. See cDNA Microarray Data.
● CEL files. If you are analyzing CEL files, each of which contains data for one sample, you will need to merge the collection of CEL files into a single expression dataset file. You can use the GenePattern module ExpressionFileCreator to merge CEL files into an expression dataset file. Alternatively, you can use tools such as RMAExpress or DCHIP to merge the CEL files and then create your expression dataset file based on that merged file.
● Genes. Each feature (gene or probe) must have a unique identifier. If the expression dataset contains redundant identifiers, GSEA arbitrarily selects one of the redundant features, removes the others and continues the analysis. The analysis report lists the redundant identifiers.
● Samples. Each sample must have a unique identifier. If you have technical replicates, you generally want to remove them by averaging or some other data reduction technique. For example, assume you have five tumor samples and five control samples each run three times (three replicate columns) for a total of 30 data columns. You would average the three replicate columns for each sample and create a dataset containing 10 data columns (five tumor and five control).
● Present/Marginal/Absent Calls. GSEA ignores Present/Marginal/Absent calls. If your dataset contains such calls, do not filter the data based on that information. The GSEA algorithm expects different levels of expression and provides better results when given all of the data.
● Missing expression values. The gct, txt, and pcl file formats support missing expression values; simply leave the cell blank if the expression data is missing. The res file format, which is specific to Affymetrix chips, does not allow missing expression values.
You can run the gene set enrichment analysis against an expression dataset that is missing values. The GSEA software does not impute missing values or filter out genes that have too many missing values; it simply ignores the missing values in its ranking metric calculations. However, too many missing values for a gene may cause the differential expression scores for that gene to be inaccurate. For example, consider a dataset that contains 10 samples in class_A and 15 samples in class_B. Assume that a gene has only 3 values in class_A and all 15 values in class_B. The GSEA software uses the 3 values in class_A and the 15 values in class_B to score the gene by its differential expression. In the signal-to-noise calculation, the mean and variance estimates for the gene are based on different sample sizes; a situation which it would be better to avoid. (If you wish, you can use external tools to impute missing values or filter out genes that have too many missing values.)
● Filtering based on expression values. For many other analytical algorithms, such as clustering, it makes sense to pre-process a dataset. For example, before running hierarchical clustering, you might remove genes that have low variance across the dataset. This prevents flat genes from driving the clustering result and improves processing time by focusing on a smaller number of interesting genes. The GSEA algorithm does not filter the expression dataset and generally does not benefit from your filtering of the expression dataset. During the analysis, genes that are poorly expressed or that have low variance across the dataset populate the middle of the ranked gene list and the use of a weighted statistic ensures that they do not contribute to a positive enrichment score. By removing such genes from your dataset, you may actually reduce the power of the statistic and processing time is rarely a factor as GSEA can easily analyze 22,000 genes with even modest processing power. However, an exception exists for RNA-seq datasets where GSEA may benefit from the removal of extremely low count genes (i.e., genes with artifactual levels of expression such that they are likely not actually expressed in any of the samples in the dataset).
Although GSEA does not require that you preprocess the expression dataset, it can be used effectively on preprocessed datasets. For example, Monti et al used a filtered dataset to further analyze genes consistently expressed across two datasets, as described in “Molecular profiling of diffuse large B-cell lymphoma identifies robust subtypes including one characterized by host inflammatory response” (http://www.bloodjournal.org/cgi/content/full/bloodjournal;105/5/1851).