To run CoGA, you must upload on the left sidebar files containing gene expression data, phenotype labels, and a collection of gene sets. Detailed information about the data formats are described below.
The expression data is a tab delimited file that contains the gene expression levels of all the samples. The first line has the following format:
Name[tab][sample 1 name][tab][sample 2 name][tab] ...
It is allowed to include a Description field, which will be ignored. In that case, the first line is:
Name[tab]Description[sample 1 name][tab][sample 2 name][tab] ...
Each line contains the gene (or probe set) name, the gene description (optional), and a value for each sample.
You can create the gene expression input file with a spreadsheet software, and then save it as a tab delimited text file. The final file must have the *.txt extension.
The figure below illustrates an expression input file:
If your dataset is already collapsed to gene symbols (i.e., the row identifiers are the gene symbols, and each gene is represented by a single row), just select the "Keep the original expression data rows" option and skip to the next section. Otherwise, select the "Collapse dataset to gene symbols" option, and then choose a method to summarize the rows (probe sets) representing the same gene by a single representative. The methods for collapsing are the same used in the WGCNA package (Langfelder and Horvath, 2008), which we describe below:
If the "Connectivity based collapsing" option is "True", the collapsing procedure is:
If a gene has exactly two corresponding probe sets, then it chooses the row according to the selected collapsing method. If a gene has three or more corresponding probe sets, it computes the pairwise correlations among the rows that represent the same gene and then selects the highly correlated one.
For a detailed discussion about the collapsing methods described above, we refer to the Miller et al. (2011) paper.
After selecting a collapsing method, upload an annotation file that maps the microarray probe set IDs to gene symbols. We recommend to use the GSEA microarray annotation files, which are freely available at the Broad ftp site (ftp://gseaftp.broadinstitute.org/pub/gsea/annotations).
The annotation file has three columns separated by tabulations. The first line has the following format:
Probe set ID[tab]Gene symbol[tab]Gene title
The figure below illustrates an annotation file:
The categorical class file identifies the phenotype class that each sample belongs to. It is a text file delimited by blank spaces. The file must contain tree lines and the *.cls extension.
The first line has the number of samples and classes (phenotypes), and the number one, in that order. The second line has the "#" symbol followed by the class names. The third and last line contains a class label for each sample. You can use any symbol for the labels (a number, text, or the same name you have specified on the previous line). The first label corresponds to the first class that appears on the second line; the second label corresponds to the second class; and so on.
Below, we show the cls file format:
[number of samples][space][number of classes][space]1 # [space][class 1 name][space][class 2 name] [sample 1 class][space][sample 2 class][space] ... [sample N class]
Consider that you want to analyze 65 samples from the class AII and 30 samples from the class ODII. If the first columns of the gene expression matrix correspond to the AII microarrays and the last ones to the ODII microarrays, then the cls input could be:
95 2 1 # AII ODII 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Here, "0" indicates de samples belonging to class AII, and "1" indicates the samples belonging to ODII.
The gene set database file describes the groups of genes that will be analyzed. The columns are tab delimited. Each row corresponds to a gene set. The first column describes the gene set name, the second one contains a gene set description, and the following columns contain the genes that belong to the set. The file must have the *.gmt extension and is illustrated by the figure below:
Several gmt format gene set collections are freely available at the Molecular Signature Database (MSigDB) (http://www.broadinstitute.org/gsea/msigdb/index.jsp) (Subramanian et al., 2005).