Research Projects

Principled Inference, Statistics, and Graph Neural Networks

Graph Neural Networks (GNNs) extend deep learning to graph-structured data and have become a core tool for learning on relational structures. Despite their rapid adoption, the statistical properties and limitations of GNNs remain poorly understood, especially in scientific domains where reproducibility, interpretability, and robustness are critical.

Illustration of a Graph Neural Network block: aggregation (convolution) and transformation steps, repeated to propagate information through the graph.

Our work aims to develop a principled foundation for GNNs — moving from heuristic architectures to models with performance guarantees, clear interpretability, and reliable uncertainty quantification. This involves:

Uncertainty Quantification: Designing methods to measure the robustness and confidence of GNN predictions, especially under noise or limited labels.
Theoretical Analysis: Studying bias–variance tradeoffs, oversmoothing, and topology-dependent performance.
Model Selection: Developing statistically sound cross-validation and tuning procedures tailored for graph-structured data.
Interpretability: Creating tools to link model outputs to causal or biologically meaningful structures.

Structured Estimation and Graph-Constrained Models

Many high-dimensional estimation problems involve latent structure — such as sparsity, grouping, or alignment with a known network — that can be leveraged for more accurate and interpretable inference. We design algorithms for structured dimension reduction and graph-constrained matrix factorization, with provable error bounds and direct applications to biological datasets.

Our methods exploit graph topology or external meta-information to improve estimation of canonical directions, low-rank structure, or latent factors, with a focus on making results reproducible and trustworthy.

Multimodal Data Integration and Uncertainty Quantification

Biological and environmental datasets increasingly combine diverse data modalities — genomics, transcriptomics, metabolomics, imaging, and environmental covariates — linked through shared samples or spatial context. We develop statistical frameworks such as sparse canonical correlation analysis (CCA), regularized multivariate regression, and probabilistic graphical models to integrate these heterogeneous datasets.

A major emphasis is on uncertainty quantification: providing confidence bounds for estimated relationships so downstream conclusions are statistically sound. This is particularly important in applications where experimental validation is costly or time-consuming.

Applications

Thermotolerance in photosynthetic microbes: Integrating genomic, transcriptomic, and metabolomic data from cyanobacteria and Chlamydomonas to predict optimal growth temperatures and identify molecular mechanisms of heat adaptation.
Family networks and child welfare: Modeling kinship structures as complex networks to study how family configurations influence outcomes in protective custody and child services.
Spatial transcriptomics: Applying GNNs and structured estimation to detect spatial patterns in gene expression, uncover cell–cell interactions, and map tissue organization.
Microbial community modeling: Using graph-based latent variable models to study associations in marine and host-associated microbiomes, accounting for spatial and environmental structure.

Selected Past Projects

COVID-19 Modeling

Statistical modeling for pooled testing strategies under correlation and heterogeneity.
Modeling the effect of variability in reproduction number on epidemic forecasts.
Integrating testing data with surveys to assess transmission risk in live events.

Neuroscience

I have worked on multiple projects in brain connectomics and neuroimaging, focusing on the development of statistical tools for the analysis of functional MRI and other multimodal brain data. This includes methods for:

Connectome inference: Developing Bayesian and variational models to infer brain connectivity from noisy, high-dimensional fMRI data.
Network dynamics: Quantifying structural and functional changes in brain networks over time, using graph signal processing and hierarchical clustering methods.
Variability studies: Participating in large-scale reproducibility analyses, including the Nature study on variability in neuroimaging analyses across teams.

Claire Donnat