Recent advances in single cell analysis have enabled the genome-wide measurement of DNA sequence, RNA expression, methylation and protein abundance for thousands to hundreds-of-thousands of indi vidual cells. Methods that capture or infer spatial or temporal information provide additional contextual information to create a detailed, cell-level picture of gene activity and function. Although genome-wide single cell profiling provides unprecedented insights into the biology of complex tissues, analyzing such data on a gene-by-gene basis is extraordinarily challenging due to the large number of tested hypotheses and consequent low statistical power and difficult interpretation. While a gene-level analysis presents similar problems for bulk tissue data, the issues are magnified for single cell data due to increased noise, inflated zero counts and multi-modal distributions. One promising approach for addressing these challenges is gene set testing, or pathway analysis.
Originally developed for bulk tissue data, gene set testing is a hypothesis aggregation method that leverages prior knowledge regarding the functional relationships between genes to test a smaller number of more biologically meaningful hypotheses and thereby improve interpretation, replication and statistical power. By combining the single cell measure ments for all genes in a set or pathway, gene set testing can also decrease variance and mitigate the impact of sparsity and multi-model distributions. Unfortunately, statistical and biological differences between single cell and bulk genomic data make it challenging to use gene set collections and testing methods developed for bulk tissue on single cell data. Despite the motivation for customized techniques, little work has been done to develop gene set testing methods or collections that are specialized for single cell data and, for some important applications like the analysis of dynamic processes, relevant methods do not yet exist. We will address the limitations of current support for gene set analysis of single cell data by developing a suite of gene set testing methods optimized for the characteristics of single cell gene expression data. To validate these methods, we will use them to identify pathways associated with the development and function of tissue-resident memory CD8 T cells.