Research Overview

Research in the Sartor Lab interfaces with high-throughput statistical analysis and bioinformatics methods to interpret big data with respect to biological knowledge. The laboratory focuses on developing methods and tools for analysis of genome-wide regulatory and epigenomic data, especially applied to cancer and environmental health research.

Recent cancer studies have focused on tumor subtypes and markers of prognosis for HPV-related head and neck cancers.Other recent studies involve developing methods for the analysis and/or interpretation of ChIP-Seq data, genome-wide DNA methylation/hydroxymethylation analysis, and improving our ability to interpret the biological functionality of results. We continue research in the areas of understanding regulatory and epigenomics changes in the context of cellular pathways, and of omics data integration.

Gene set enrichment (GSE) methods

Gene set enrichment (GSE) methods facilitate the interpretion of high-throughput experiments, assisting investigators to bridge the gap from long lists of genes or genomic regions to understanding how and which biological pathways and processes are regulated at a systems level perspective.

Sets of genomic regionsmay derive from a variety of sources, including ChIP-seq, ATAC-seq, GWAS SNPs, repetitive regions, or copy number variations. The Sartor Lab has developed multiple methods and tools for GSE analysis: ConceptGen and LRpath for gene expression data, and ChIP-Enrich, Broad-Enrich, and Poly-Enrich for sets of genomic regions. ProxReg complements GSE analysis by testing proximity of regions to enhancers or transcription start sites (TSSs).

  • We’ve learned that the optimal method for genomic regions depends on several properties of the data (e.g. number, size, and strength of regions).

  • We’re also studying various approaches to define enhancer regions and assign enhancer-target gene pairs, using gold standard data to optimize the process.

  • Current lab projects involve pathway analysis for single cell (or spatially resolved) molecular data, and studying the regulatory consequences of DNA methylation changes.

Head and Neck squamous cell carcinomas (HNSCC)

Our main biological focus is cancer, specifically head and neck cancers (HNSCC), with a particular focus on human papillomavirus (HPV)-associated oropharyngeal and oral cavity cancers. Oropharyngeal cancers are interesting because they represent one of the only tumor sites where we can study molecular differences between chemical-induced and viral-induced tumors (i.e. due to smoking versus HPV). changes.

In collaboration with the Rozek Lab, we study the molecular signatures (genetic, transcriptomic, and epigenomic) between HPV-positive and HPV-negative oropharyngeal and oral cavity tumors, with the goal of identifying novel biomarkers for prognosis and/or treatment. We were the first to characterize two main subtypes of HPV-positive oral cancers (IMU and KRT), in terms of pathways, CNAs, mutations, and HPV gene expression.

We found most of the differences can be explained by HPV integration into the host genome (see Figure 2 for associations with HPV integration status), and that HPV integration status appears to be a marker of prognosis with patients lacking HPV integration having better survival. We are now beginning to focus on the role of the shorter isoforms of the HPV E6 oncogene, E6*, in HPV-positive cancers (Figure 3) and other potential molecular markers for precision therapy.

In addition, we are studying the progression characteristics of oral dysplasias, prediction of risk based on multi-factorial genomic/clinical variables, and the early molecular events leading to oral cancer.


Epigenetics is defined as the study of heritable traits due to a mechanism other than the DNA itself. Epigenomics studies marks such as DNA methylation and modifications to histone tails. Over the past few years, the field of epigenomics has been revolutionized by a myriad of new high-throughput approaches to assess genome-wide epigenetic marks. These technologies have led to several exciting discoveries of aberrant epigenetic marks in diseases such as cancers, where important epigenetic events often occur early in the carcinogenic process. These include hydroxymethylation (5hmC), which is the first step in the DNA demethylation pathway.

We are studying what information about tumors and clinical outcomes DNA methylation and 5hmC signatures can tell us. Our lab developed an R package, methylSig, for testing for differential methylated CpGs and regions for whole-genome or reduced representation bisulfite sequencing experiments, and a program, mint, to integrate methylation and 5hmC data.

A current project is to be able to understand which epigenetic changes drive transcriptional gene and pathway level changes. In addition to cancer, exposure to environmental chemicals is also able to modify the epigenome, and several are associated with increased risk of disease later in life. In collaboration with the Dolinoy Lab, we are studying the effects of early-life exposure to chemicals such as lead (Pb), DEHP, and Bisphenol A (BPA) on the epigenome and transcriptome.

Analysis of replicated ChIP-seq or ATAC-seq data

Chromatin immunoprecipitation followed by deep sequencing (ChIP-seq) is the standard to identify in vivo protein-DNA interaction or histone modification sites on a genome-wide scale. ATAC-seq data identifies regions of open chromatin.

Both ChIP-seq and ATAC-seq data result in read pile-ups called peaks representing the important regions of interest. While early ChIP-seq peak calling programs reported the statistical significance of how likely a region is bound by the protein of interest, little effort was devoted to assessing biological variance or prioritizing the peaks based on replicates or the binding profile relative to external annotations. Biological variation is extremely important when performing differential analyses, such as comparing histone modification profiles between groups of individuals.

We developed the Peak-finding and Prioritization (PePr) pipeline that accounts for variation among biological samples and learns information from neighboring regions. Current potential research projects involve taking into account location of peaks relative to genic and regulatory information, use of enhancer annotations and gene target predictions, or expanding to allow for single cell data.