WGCNA of ortholog expression for the E5 timeseries molecular project
Today I ran a preliminary WGCNA on ortholog expression for the E5 time series project.
Weighted Gene Co-expression Network Analysis (WGCNA) of Ortholog Expression Across Coral Species
Introduction
Weighted Gene Co-expression Network Analysis (WGCNA) is a systems biology method for understanding gene expression patterns and identifying modules of highly correlated genes. In this analysis, we applied WGCNA to orthologous gene expression data from three coral species to identify gene modules that respond coordinately across species, site and time points in the E5 molecular time series project.
This analysis builds upon our previous orthology identification work, which established 9,800 ortholog groups with representatives across three coral species. By examining co-expression patterns of these orthologs, we can identify conserved transcriptional programs and species-specific responses to environmental conditions.
Study Design
Species and Samples
The analysis includes RNA-seq data from three coral species representing different ecological strategies:
- Acropora pulchra (ACR): Fast-growing, branching coral (40 samples)
- Porites evermanni (POR): Slow-growing, massive coral (38 samples)
- Pocillopora tuahiniensis (POC): Intermediate growth, branching coral (39 samples)
Experimental Design
Samples were collected from Moorea, French Polynesia across four time points (TP1-TP4) representing different environmental conditions. The experiment included:
- Total samples: 117 RNA-seq libraries
- Time points: 4 developmental/environmental time points per colony
- Sites: Manava and Mahana collection sites
- Biological replicates: Multiple colonies per species per time point
Input Data
The analysis used variance-stabilized transformation (VST) normalized counts from the DESeq2 pipeline:
- Input file:
vst_counts_matrix.csv
from analysis 14-pca-orthologs - Genes: 9,800 ortholog groups (OG_XXXXX identifiers)
- Samples: 117 columns representing individual RNA-seq libraries
- Format: VST-normalized expression values across all samples
VST normalization was chosen because it stabilizes variance across the expression range, making the data more suitable for correlation-based analyses like WGCNA.
WGCNA Methodology
Network Construction Parameters
The analysis followed the standard WGCNA workflow with the following key parameters:
1. Data Preparation
- Removed genes with zero variance (only 4 genes removed, retaining 9,796 genes)
- Validated sample-metadata correspondence
- Transposed matrix for WGCNA input (samples as rows, genes as columns)
2. Soft-Thresholding Power Selection
- Network type: Signed network (distinguishes positive and negative correlations)
- Tested powers: 1-10, 12, 14, 16, 18, 20
- Selection criterion: Scale-free topology fit R² ≥ 0.8
- Chosen power: 16 (based on achieving scale-free topology)
The soft-thresholding power transforms the correlation matrix into a weighted adjacency matrix, emphasizing strong correlations while downweighting weak ones. A signed network was used to preserve biological directionality of co-expression relationships.
3. Module Detection
- Method: blockwiseModules (for computational efficiency)
- TOM type: Signed Topological Overlap Matrix
- Correlation type: Pearson
- Minimum module size: 30 genes
- Merge cut height: 0.25 (merges modules with eigengene correlation > 0.75)
- Dynamic tree cut: Used to identify initial modules
4. Trait Association
The analysis examined module correlations with three categorical traits:
- Time point: TP1, TP2, TP3, TP4
- Site: Manava, Maharepa
- Species: Acropora pulchra, Porites evermanni, Pocillopora tuahiniensis
Results
Module Identification
WGCNA identified 14 distinct co-expression modules plus one unassigned module (grey or 0):
module | n_orthologs |
---|---|
1 | 2646 |
2 | 1198 |
3 | 1144 |
4 | 967 |
5 | 884 |
6 | 876 |
0 | 757 |
7 | 706 |
8 | 250 |
9 | 84 |
10 | 63 |
11 | 63 |
12 | 62 |
13 | 50 |
14 | 46 |
Total: 9,796 genes assigned across 15 groups
Module Size Distribution
The modules show a hierarchical distribution typical of biological networks:
- Large modules (>1,000 genes): representing core biological processes
- Medium modules (500-1,000 genes): specialized but common functions
- Small modules (30-250 genes): highly specialized or condition-specific responses
- Grey module (757 genes): Unassigned genes with low network connectivity
Module-Trait Relationships
The analysis generated a comprehensive heatmap showing correlations between module eigengenes (ME) and experimental traits.
This heatmap shows each module on the y axis grouped in a dendrogram. The x axis shows the categorical variables. Each cell shows the correlation of ortholog expression in the module and the variable with non-significant associations shown without text. Red indicates positive correlation and blue indicates negative correlation.
Key findings include:
Temporal Patterns:
- Several modules showed strong time-dependent expression patterns
- Identified warm season (TP1/TP2, e.g., ME8) vs cool season (TP3/TP4, e.g., ME11) modules
Species-Specific Patterns:
- Some modules exhibited species-specific expression
- Species-divergent modules suggest lineage-specific adaptations with some showing positive correlations with 1 species in particular
Site Effects:
- Modules correlated with collection site (Manava vs Maharepa)
- Site-specific modules may reflect local environmental adaptation
Statistical Significance:
- Only correlations with p < 0.05 are displayed in the heatmap
- Benjamini-Hochberg (BH) multiple testing correction applied per trait
- Module eigengene clustering revealed related modules with similar trait associations
Hub Genes and Module Membership
The analysis calculated module membership (kME) for each gene, representing its correlation with its module eigengene. Hub genes (high kME values) are:
- Central to module function: Highly connected within their module
- Potential biomarkers: Representative of module expression patterns
- Regulatory candidates: May coordinate expression of other module members
For each module, the top 20 hub genes were identified based on kME values. These hub orthologs represent conserved regulatory nodes across all three coral species.
We can examine this data in more detail as desired.
Technical Notes
Computational Approach
- R packages: WGCNA, tidyverse, ComplexHeatmap
- Multi-threading: Enabled for improved performance
- Memory management: blockwiseModules used for efficient handling of large datasets
Quality Control
- Sample QC: goodSamplesGenes() function verified data quality (all samples passed)
- Gene filtering: Removed zero-variance genes (minimal - only 4 genes)
- Correlation method: Pearson correlation appropriate for VST-normalized data
- Missing data: All samples had complete expression profiles
Statistical Considerations
- Multiple testing correction: BH adjustment applied per trait to control false discovery rate
- Sample size: 117 samples provides adequate power for correlation analysis
- Module stability: Minimum module size of 30 ensures robust modules
- Merge criterion: 0.75 eigengene correlation threshold balances module granularity
References
The WGCNA method is described in:
Langfelder P, Horvath S (2008) WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9:559.
Zhang B, Horvath S (2005) A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology 4:17.
Source Code
The complete analysis is documented in the R Markdown notebook:
- Script:
18-ortholog-wgcna.Rmd
- Output directory:
M-multi-species/output/18-ortholog-wgcna/