The Sartor lab develops bioinformatics methods and tools for the analysis of high-throughput genome-wide regulatory and epigenomics data, and focuses on understanding the biological/clinical significance of results.
Our GSE Suite is a one-stop shop containing methods and resources for gene set enrichment testing for both sets of genomic regions and gene expression data. We now have a comprehensive tool to link regulatory genomic regions (e.g. peaks in enhancers) to their target genes using a variety of methods and databases. Our default approach has been carefully optimized for gene set enrichment testing by objectively ranking over a thousand different approaches.
ChIP-Enrich, Poly-Enrich, and proxReg: Gene Set Enrichment Testing for large sets of genomic regions
Bioconductor package: http://bioconductor.org/packages/release/bioc/html/chipenrich.html
ChIP-Enrich and Poly-Enrich test for enrichment of biological pathways in large sets of narrow genomic regions, such as from ChIP-seq or ATAC-seq peaks, repetitive region families, etc. (For broad genomic regions, see Broad-Enrich instead.) ProxReg is a complementary tool that tests whether the regulation of a gene set tends to be mainly from promoter regions or enhancer regions. Users have several options including Gene Ontology, KEGG, MESH, MSigDB, and other types of gene sets. Using an uploaded input file, ChIP-Enrich and Poly-Enrich assign genomic regions to genes based on a chosen "locus definition". The "locus" of a gene is the region from which the gene is predicted to be regulated. We are now adding smart enhancer-gene target links, which we’ve shown perform better than simply assigning each genomic region to the gene with the nearest TSS. ChIP-Enrich uses a logistic regression model to test for association between the presence of at least one peak in a gene and gene set membership, while Poly-Enrich uses a negative binomial regression model to test the association between the number of peaks in a gene and gene set membership. They empirically adjust for the relationship between the length of the loci (and optionally mappability) and the outcome using a cubic smoothing spline term within the model. Poly-enrich canalso take weighted, or scored, genomic regions. Output includes summary plots, peak to gene assignments, and enrichment (and depletion) results including odds ratio, p-value, and FDR for each gene set.
Broad-Enrich: Gene Set Enrichment Testing for large sets of broad genomic regions
PePr: Peak Prioritization Pipeline
Github Site: https://github.com/shawnzhangyx/PePr
PePr is a python-based analysis pipeline for ChIP-Seq experiments with biological replicates. The program accounts for the variation among biological replicates using a negative binomial model, and uses local information to improve estimates of variance. It uses a novel between-sample normalization strategy to account for variable antibody efficiency, and post hoc steps to increase peak resolution and reduce false positives. It can be used either to determine histone modifications or transcription factor binding versus control data, or for two group comparisons (i.e. differential binding). With PePr, users do not need to separately call peaks in each sample first; the differential peak calling is all performed in one analysis.
Methylation Integration (Mint) Pipeline
Github site: https://github.com/sartorlab/mint
The mint pipeline analyzes single-end reads coming from sequencing assays measuring DNA methylation and hydroxymethylation. The pipeline analyzes reads from both bisulfite-converted assays such as WGBS and RRBS, and from pulldown assays such as MeDIP-seq, hMeDIP-seq, and hMeSeal. Moreover, with data measuring both 5-methylcytosine (5mc) and 5-hydroxymethylcytosine (5hmc), the mint pipeline integrates the two data types to classify genomic regions of 5mc, 5hmc, a mixture, or neither.
The pipeline is available as both a command line(https://github.com/sartorlab/mint) and a Galaxy graphical user interface too (https://github.com/sartorlab/mint_galaxy). Both implementations require minimal configuration while remaining flexible to experiment specific needs.
LR Path and RNA-Enrich
LRpath performs gene set enrichment testing using logistic regression, allowing the input data to remain on a continuous scale. RNA-Enrich additionally takes into account gene coverage for RNA-seq data. This web-based tool tests against several annotation databases, including Gene Ontology, multiple pathway databases, metabolite, transcription factor and microRNA target sets, and literature-derived annotations. LRpath performs well with both small and large sample sizes. Additional benefits of using the LRpath program include (1) the ability to perform both “directional” and “non-directional” enrichment tests that allow for two different perspectives and (2) the ability to easily compare and visualize results across multiple studies using LRpath clustering.