Software

The Sartor lab develops bioinformatics methods and tools for the analysis of high-throughput genome-wide regulatory and epigenomics data, and focuses on understanding the biological/clinical significance of results.

Sartor Lab GitHub: https://github.com/sartorlab

GSE Suite

Our GSE Suite is a one-stop shop containing methods and resources for gene set enrichment testing for both sets of genomic regions and gene expression data. We now have a comprehensive tool to link regulatory genomic regions (e.g. peaks in enhancers) to their target genes using a variety of methods and databases. Our default approach has been carefully optimized for gene set enrichment testing by objectively ranking over a thousand different approaches.

ChIP-Enrich, Poly-Enrich, and proxReg: Gene Set Enrichment Testing for large sets of genomic regions

ChIP-Enrich and Poly-Enrich test for enrichment of biological pathways in large sets of narrow genomic regions, such as from ChIP-seq or ATAC-seq peaks, repetitive region families, etc. (For broad genomic regions, see Broad-Enrich instead.) ProxReg is a complementary tool that tests whether the regulation of a gene set tends to be mainly from promoter regions or enhancer regions. Users have several options including Gene Ontology, KEGG, MESH, MSigDB, and other types of gene sets. Using an uploaded input file, ChIP-Enrich and Poly-Enrich assign genomic regions to genes based on a chosen "locus definition". The "locus" of a gene is the region from which the gene is predicted to be regulated. We are now adding smart enhancer-gene target links, which we’ve shown perform better than simply assigning each genomic region to the gene with the nearest TSS. ChIP-Enrich uses a logistic regression model to test for association between the presence of at least one peak in a gene and gene set membership, while Poly-Enrich uses a negative binomial regression model to test the association between the number of peaks in a gene and gene set membership. They empirically adjust for the relationship between the length of the loci (and optionally mappability) and the outcome using a cubic smoothing spline term within the model. Poly-enrich canalso take weighted, or scored, genomic regions. Output includes summary plots, peak to gene assignments, and enrichment (and depletion) results including odds ratio, p-value, and FDR for each gene set.

Broad-Enrich: Gene Set Enrichment Testing for large sets of broad genomic regions

Broad-Enrich tests sets of broad genomic regions (e.g., from ChIP-seq data for histone modifications or copy number variations) for enriched biological pathways, Gene Ontology terms, or other gene sets. The pre-defined gene sets are the same as used in LRpath, and can be browsed here. Using an input .bed, .narrowPeak or.broadPeak file, Broad-Enrich determines the proportion of each gene locus covered by a peak, using a chosen "gene locus definition". The "locus" of a gene is the region from which the gene is predicted to be regulated. Broad-Enrich uses a logistic regression model to test for association between the proportion of each gene locus covered by a peak and gene set membership. It empirically adjusts for the bias due to locus length using a binomial cubic smoothing spline within the logistic model. Output includes summary plots, peak to gene assignments, and enrichment (and depletion) results including odds ratio, p-value, and FDR for each gene set.

Annotatr: Annotation of Genomic Regions to Genomic Annotations

The annotatr Bioconductor package provides an easy way to summarize and visualize the intersection of genomic sites/regions with genomic annotations. Given a set of genomic sites/regions (e.g. ChIP-seq peaks, CpGs, differentially methylated CpGs or regions, SNPs, etc.) it is often of interest to investigate the intersecting genomic annotations. Such annotations include those relating to gene models (promoters, 5'UTRs, exons, introns, and 3'UTRs), CpGs (CpG islands, CpG shores, CpG shelves), or regulatory sequences such as enhancers.

PePr: Peak Prioritization Pipeline

PePr is a python-based analysis pipeline for ChIP-Seq experiments with biological replicates. The program accounts for the variation among biological replicates using a negative binomial model, and uses local information to improve estimates of variance. It uses a novel between-sample normalization strategy to account for variable antibody efficiency, and post hoc steps to increase peak resolution and reduce false positives. It can be used either to determine histone modifications or transcription factor binding versus control data, or for two group comparisons (i.e. differential binding). With PePr, users do not need to separately call peaks in each sample first; the differential peak calling is all performed in one analysis.

Methylation Integration (Mint) Pipeline

The mint pipeline analyzes single-end reads coming from sequencing assays measuring DNA methylation and hydroxymethylation. The pipeline analyzes reads from both bisulfite-converted assays such as WGBS and RRBS, and from pulldown assays such as MeDIP-seq, hMeDIP-seq, and hMeSeal. Moreover, with data measuring both 5-methylcytosine (5mc) and 5-hydroxymethylcytosine (5hmc), the mint pipeline integrates the two data types to classify genomic regions of 5mc, 5hmc, a mixture, or neither.

The pipeline is available as both a command line(https://github.com/sartorlab/mint) and a Galaxy graphical user interface too (https://github.com/sartorlab/mint_galaxy). Both implementations require minimal configuration while remaining flexible to experiment specific needs.

LR Path and RNA-Enrich

LRpath performs gene set enrichment testing using logistic regression, allowing the input data to remain on a continuous scale. RNA-Enrich additionally takes into account gene coverage for RNA-seq data. This web-based tool tests against several annotation databases, including Gene Ontology, multiple pathway databases, metabolite, transcription factor and microRNA target sets, and literature-derived annotations. LRpath performs well with both small and large sample sizes. Additional benefits of using the LRpath program include (1) the ability to perform both “directional” and “non-directional” enrichment tests that allow for two different perspectives and (2) the ability to easily compare and visualize results across multiple studies using LRpath clustering.