Skip to content

Time-series proteomics analysis pipeline using ANOVA, RM-ANOVA, LMM, and EMMeans to identify protein abundance changes under osmotic stress. Includes full statistical modeling, pairwise comparisons, and automated CSV outputs.

Notifications You must be signed in to change notification settings

RimaZ1597/StatisticalModeling_TimeSeriesData

Repository files navigation

📊 Time-Series Proteomics Analysis Pipeline

Dataset: Selevsek et al., 2015 — DIA/SWATH-MS time-course protein abundance under osmotic stress

Methods: ANOVA · Repeated Measures ANOVA · LMM · LM · EMMeans · Pairwise Comparisons


Introduction

Understanding how cells modulate their proteomic composition in response to environmental challenges is a central question in systems biology. Saccharomyces cerevisiae serves as a well-established model for exploring dynamic responses to osmotic perturbation, which involve coordinated regulation across the proteome. Selevsek et al.(2015) generated a high-resolution temporal dataset using SWATH-MS, capturing protein abundance changes at six post-treatment intervals following NaCl-induced stress.

Building on this resource, the present study applies statistical methods tailored for longitudinal data. The goal is to test whether protein expression varies significantly across timepoints and whether replicate-level variability contributes meaningfully to overall variation, thereby identifying proteins responsive to osmotic stress.

Through a comparative framework involving fixed-effects ANOVA, repeated measures ANOVA, linear mixed-effects models (LMMs), and fixed-effects linear models (LMs), the analysis illustrates how progressively flexible statistical methods address core challenges in longitudinal proteomics, guiding model selection based on variance structure and model fit.


Project Overview

This repository contains a fully reproducible time-series proteomics analysis pipeline implemented in R. The workflow analyzes protein abundance changes in Saccharomyces cerevisiae exposed to osmotic stress over six time points, using univariate statistical models and model-based pairwise comparisons.

The analysis follows the structure of a statistical proteomics report and includes:

  • Data preprocessing & transformation
  • Filtering and random protein selection
  • One-way ANOVA
  • Repeated Measures ANOVA
  • Linear Mixed-Effects Models (LMM)
  • Fallback Linear Models (LM)
  • Model selection using ICC
  • Nested LM comparison
  • Pairwise comparisons (Tukey & EMMeans)
  • Volcano plot for significant contrasts
  • Extraction of significant proteins

All intermediate files (summary tables, model results, p-value tables, EMMeans outputs, etc.) are automatically saved as CSV outputs.


Installation & Requirements

Install Required R Packages

install.packages(c(
  "tidyverse", "lme4", "lmerTest", "ez", "performance", "cluster",
  "ggplot2", "ggVennDiagram", "emmeans"
))

Ensure Files Are in the Working Directory

  • Selevsek2015_DIA_Spectronaut_annotation.csv
  • Selevsek2015.csv
  • TIME_SERIES_DATA_ANALYSIS.Rmd

Analysis Steps

1️⃣ Data Preparation

  • Read metadata & protein abundance matrix
  • Pivot to long format
  • Merge annotation info
  • Remove missing values and low-variance proteins

2️⃣ Protein Subsampling

Random selection of 150 proteins for efficient modeling.

3️⃣ One-Way ANOVA

  • Per-protein ANOVA
  • P-value distribution summary
  • Full ANOVA table exported

4️⃣ Repeated Measures ANOVA

  • Biological replicate treated as within-subject factor
  • Extraction of p-values, F-values, and model tables

5️⃣ Linear Mixed-Effects Modeling

  • LMM with random intercepts
  • Check for singular fits
  • Fallback to LM when appropriate
  • Calculate ICC and determine best-fitting model

6️⃣ Final Model Selection

  • Select LMM or LM based on ICC ≥ 0.01
  • Save p-values, AIC, ICC, and chosen model

7️⃣ Pairwise Comparisons

  • Tukey HSD for ANOVA
  • EMMeans for LMM/LM with FDR correction
  • Top proteins with the most significant contrasts

8️⃣ Volcano Plot

Contrast: T030 vs T000

  • log2FC calculated
  • FDR-adjusted p-values
  • Red = significant proteins

Output Summary

The pipeline generates:

Statistical Outputs

  • ANOVA p-values and full tables
  • RM ANOVA p-values and F-values
  • LMM vs LM model selection
  • ICC values for repeated measures
  • Nested LM comparison results

Pairwise Results

  • Tukey pairwise tables
  • EMMeans contrast tables
  • Top proteins by significant timepoint changes

Visualizations

  • Venn diagram of significant proteins
  • Model usage heatmap
  • Volcano plot (T030 vs T000)

Result Discussion

  • This study presents a statistical evaluation of time-dependent proteomic changes in S. cerevisiae under osmotic stress, using a repeated-measures design and a layered modeling framework.

  • Initial one-way ANOVA detected significant timepoint effects in 52% of proteins, suggesting early rejection of the null hypothesis that protein abundance remains constant over time. However, it did not account for within-subject correlation.

  • Repeated measures ANOVA improved on this by modeling intra-subject variation but failed to detect strong replicate effects. The assumption of compound symmetry and limited power due to only three biological replicates likely contributed to its reduced sensitivity.

  • Linear mixed-effects models (LMMs) offered the most robust analysis, capturing both fixed time effects and random replicate-level variation.

  • Among 47 proteins modeled with LMMs, 89% showed significant temporal changes, with moderate ICC values confirming replicate-specific contributions. For proteins where replicate effects were negligible or non-estimable, fixed-effects linear models (LMs) served as a fallback, identifying significant timepoint effects in 66% of remaining cases.

  • Overall, 71% of proteins showed significant time-dependent expression under at least one model.

  • This underscores the dynamic nature of the proteome in response to stress and validates the use of a model selection strategy guided by ICC and AIC.

  • Future work should consider expanding the number of biological replicates and applying more flexible or hierarchical Bayesian models to better quantify subject-level variance and enhance inference reliability.

About

Time-series proteomics analysis pipeline using ANOVA, RM-ANOVA, LMM, and EMMeans to identify protein abundance changes under osmotic stress. Includes full statistical modeling, pairwise comparisons, and automated CSV outputs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published