For the purpose of organizing, visualizing, analyzing and modeling data from high-throughput molecular profiling experiments we develop computational approaches that can assist experimental systems biologists to form rational hypotheses for further experimentation. We analyze high-dimensional data collected for projects integrating results from multiple layers of regulation (genomics, transcriptomics and proteomics). Algorithms and datatsets are delivered as software so that our methodologies can reach and impact the interested systems biology research community. Below are some of the software tools we developed:
Recently Released Resources
Collection of Web-based Applications to Execute Bioinformatics Workflows
Appyters extend Jupyter notebooks to broaden their accessibility by turning Jupyter notebooks into fully functional standalone web-based bioinformatics applications. Each Appyter presents to users an entry form enabling them to upload their data and set various parameters for executing a multitude of bioinformatics analysis pipelines. Once the form is filled, the Appyter executes the corresponding notebook in the cloud, producing a report without requiring the user to interact directly with the code. Appyters can be applied to a variety of workflows including building customized machine learning pipelines, analyzing omics data, and producing publishable figures.
Search Engine for Querying Annotated Sets of Drugs and Small Molecules
A database with a search engine for querying annotated sets of drugs and small molecules for performing drug set enrichment analysis. Utilizing the data within Drugmonizome, we also developed Drugmonizome-ML which enables users to construct customized machine learning pipelines using the drug set libraries from Drugmonizome. To demonstrate the utility of Drugmonizome, drug sets from 12 independent SARS-CoV-2 in vitro screens were subjected to consensus enrichment analysis. Despite the low overlap among these 12 independent in vitro screens, we identified common biological processes critical for blocking viral replication. To demonstrate Drugmonizome-ML, we constructed a machine learning pipeline to predict whether approved and preclinical drugs may induce peripheral neuropathy as a potential side effect.
Kinase Enrichment Analysis Version 3
Infers upstream kinases whose putative substrates are overrepresented in a user-inputted list of genes or differentially phosphorylated proteins. The KEA3 database contains putative kinase-substrate interactions collected from publicly available datasets. Gene sets of putative kinase substrates are used as the primary units of analysis in KEA3. These gene sets are organized in gene set “libraries.” Libraries are supersets of kinase substrate sets that are aggregated based on the database from which they are derived.
COVID-19 Drug and Gene Set Library
Collection of Drug and Gene Sets Related to COVID-19 Contributed by the Community
Enables users to view, download, analyze, visualize, and contribute drug and gene sets related to COVID-19 research. To evaluate the content of the library, we compared the results from six in vitro drug screens for COVID-19 repurposing candidates. Surprisingly, we observe low overlap across screens while highlighting overlapping candidates that should receive more attention as potential therapeutics for COVID-19. Overall, the COVID-19 Drug and Gene Set Library can be used to identify community consensus, make researchers and clinicians aware of new potential therapies, enable machine-learning applications, and facilitate the research community to work together toward a cure.
Most Popular Resources
Gene-List Enrichment Analysis Tool
Biological Knowledge Engine
A biological knowledge engine built on top of information about genes and proteins from 114 datasets. To create the Harmonizome, we distilled information from original datasets into attribute tables that define significant associations between genes and attributes, where attributes could be genes, proteins, cell lines, tissues, experimental perturbations, diseases, phenotypes, or drugs, depending on the dataset. Gene and protein identifiers were mapped to NCBI Entrez Gene Symbols and attributes were mapped to appropriate ontologies. We also computed gene-gene and attribute-attribute similarity networks from the attribute tables. These attribute tables and similarity networks can be integrated to perform many types of computational analyses for knowledge discovery and hypothesis generation.
Harmonizome mobile app
All RNA-seq and ChIP-seq Signature Search Space
ARCHS4 provides access to gene counts from HiSeq 2000 and HiSeq 2500 platforms for human and mouse experiments from GEO and SRA. The website enables downloading of the data in H5 format for programmatic access as well as a 3-dimensional view of the sample and gene spaces. Search features allow browsing of the data by meta data annotation, ability to submit your own up and down gene sets, and explore matching samples enriched for annotated gene sets. Selected sample sets can be downloaded into a tab separated text file through auto-generated R scripts for further analysis. Reads are aligned with Kallisto using a custom cloud computing platform. Human samples are aligned against the GRCh38 human reference genome, and mouse samples against the GRCm38 mouse reference genome.
Automated Generation of Interactive Notebooks for RNA-seq Data Analysis in the Cloud
BioJupies is a web server that enables automated creation, storage, and deployment of Jupyter Notebooks containing RNA-seq data analyses. Through an intuitive interface, novice users can rapidly generate tailored reports to analyze and visualize their own raw sequencing files, their gene expression tables, or fetch data from >9,000 published studies containing >300,000 preprocessed RNA-seq samples.
Submit Biomedical Terms to Receive Ranked Lists of Relevant Genes
Enables researchers to enter arbitrary search terms, to receive ranked lists of genes relevant to the search terms. Returned ranked gene lists contain genes that were previously published in association with the search terms, as well as genes predicted to be associated with the terms based on data integration from multiple sources. The search results are presented with interactive visualizations.
L1000 Characteristic Direction Signature Search Engine
Finds consensus signatures that match users’ input gene lists or input signatures. The underlying dataset is the LINCS L1000 small molecule expression profiles generated at the Broad Institute by the Connectivity Map team. The differentially expressed (DE) genes of these profiles were calculated using the Characteristic Direction method.
Large-scale Visualization of Drug-induced Transcriptomic Signatures
L1000 fireworks display (L1000FWD) is a web application that provides interactive visualization of over 16,000 drug and small-molecule induced gene expression signatures. L1000FWD enables coloring of signatures by different attributes such as cell type, time point, concentration, as well as drug attributes such as MOA and clinical phase. Signature similarity search is implemented to enable the search for mimicking or opposing signatures given as input of up and down gene sets. Each point on the L1000FWD interactive map is linked to a signature landing page, which provides multifaceted knowledge from various sources about the signature and the drug. Notably such information includes most frequent diagnoses, co-prescribed drugs and age distribution of prescriptions as extracted from the Mount Sinai Health System electronic medical records (EMR). Overall, L1000FWD serves as a platform for identifying functions for novel small molecules using unsupervised clustering, as well as for exploring drug MOA.
eXpression2Kinases (X2K) Web
X2K Web infers upstream regulatory networks from signatures of differentially expressed genes. By combining transcription factor enrichment analysis, protein-protein interaction network expansion, with kinase enrichment analysis, X2K Web produces inferred networks of transcription factors, proteins, and kinases predicted to regulate the expression of the inputted gene list. X2K Web provides the results as tables and interactive vector graphic figures that can be readily embedded within publications.
ChIP-X Enrichment Analysis Version 3
A transcription factor enrichment analysis tool that ranks TFs associated with user-submitted gene sets. The ChEA3 background database contains a collection of gene set libraries generated from multiple sources including TF–gene co-expression from RNA-seq studies, TF–target associations from ChIP-seq experiments, and TF–gene co-occurrence computed from crowd-submitted gene lists.
A Suite of Gene Set Enrichment Analysis Tools for Model Organisms
An expansion of Enrichr for four model organisms: fish, fly, worm and yeast. The modEnrichr suite of tools provides the ability to convert gene lists across species using an ortholog conversion tool that automatically detects the species.
Explore More Ma’ayan Lab Resources
Toolkit to Evaluate the FAIRness of Research Digital Resources
As more digital resources are produced by the research community, it is becoming increasingly important to harmonize and organize them for synergistic utilization. The findable, accessible, interoperable, and reusable (FAIR) guiding principles have prompted many stakeholders to consider strategies for tackling this challenge. The FAIRshake toolkit was developed to enable the establishment of community-driven FAIR metrics and rubrics paired with manual and automated FAIR assessments. FAIR assessments are visualized as an insignia that can be embedded within digital-resources-hosting websites. Using FAIRshake, a variety of biomedical digital resources were manually and automatically evaluated for their level of FAIRness.
Repository and Search Engine for Bioinformatics Datasets, Tools and Canned Analyses
Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles.
LINCS Joint Project – Breast Cancer Network Browser
LJP-BCNB visualizes thousands of signatures from six breast cancer cell lines treated with ~100 single molecule perturbations, mostly kinase inhibitors. These perturbations were applied in different concentrations while gene expression was measured at different time points using the L1000 technology. Under the same conditions, the cells were imaged for cell viability. The distance between nodes represents response similarity computed using the cosine distance between the Characteristic Direction vectors of perturbations compared with their appropriate controls.
Drug Gene Budger
Identify Drugs and Small Molecules to Regulate Expression of Target Genes
Drug Gene Budger (DGB) is a web-based and mobile application developed to assist investigators in order to prioritize small molecules that are predicted to maximally influence the expression of their target gene of interest. With DGB, users can enter a gene symbol along with the wish to upregulate or downregulate its expression. The output of the application is a ranked list of small molecules that have been experimentally determined to produce the desired expression effect. The table includes log-transformed fold change, p-value and q-value for each small molecule, reporting the significance of differential expression as determined by the limma method.
Web-based Heatmap Visualization and Analysis Tool for High-Dimensional Biological Data
Clustergrammer is a web-based visualization tool with interactive features such as: zooming, panning, filtering, reordering, sharing, performing enrichment analysis, and providing dynamic gene annotations. Clustergrammer can be used to generate shareable interactive visualizations by uploading a data table to a web-site, or by embedding Clustergrammer in Jupyter Notebooks.
Browser Extension for Extracting Differentially Expressed Gene Sets from GEO
A web application and two browser extensions (one for Chrome and another for Firefox) designed to facilitate the extraction of signatures from studies posted on the Gene Expression Omnibus (GEO) database. These signatures are then submitted to Enrichr for downstream functional analysis.
Crowd Extracted Expression of Differential Signatures
Collections of processed gene, drug and disease signatures from GEO.
Gene Expression and Enrichment Vector Analyzer
A web-based system that enables the integrative analysis of aggregated collections of tagged gene expression signatures identified and extracted from GEO. Each tagged collection of signatures is presented in a report that consists of heatmaps of the differentially expressed genes; principal component analysis of all signatures; enrichment analysis with several gene set libraries across all signatures, which we term enrichment vector analysis; and global mapping of small molecules that are predicted to reverse or mimic each signature in the aggregate.
Side Effect Prediction Based on L1000 Data
Serves the results of the predicted ADRs for the drugs and small-molecule compounds profiled in the LINCS L1000 project. A network of predictive ADRs was constructed based on their drug similarity and visualized using a stacked bubble chart. Each drug and ADR has a dedicated page with a list of the relevant predictions and external links.
Principal Angle Enrichment Analysis
Dimensionally Reduced Multivariate Gene Set Enrichment Analysis Tool
Uses the geometrical concept of the principal angle to quantify gene-set enrichment. We find that PAEA outperforms a selection of commonly used gene set enrichment methods including GSEA. To benchmark PAEA with other enrichment methods we use real data. We examined the ranking of transcription factors by performing enrichment analysis on gene expression signatures from many studies that knocked-down, knocked-out or over-expressed transcription factors, and performed the enrichment analysis with a library of gene sets created from ChIP-Seq data profiling the same transcription factors.
LINCS L1000 Slicr [GSE70138 data only]
A metadata search engine that searches for LINCS L1000 gene expression profiles and signatures matching users’ input parameters. It features download of selected search results as csv files in a zipped folder and visualization of selected results in a 3D scatter plot using PCA or MDS. Slicr consists of three views: the search view, the checkout view and the 3D scatter view.
LINCS Canvas Browser
LINCS L1000 Clustering, Visualization and Enrichment Analysis Tool
A web-based tool that enables users to explore thousands of genome-wide gene expression experiments applied to breast cancer cell lines. The browser visualizes results from L1000 experiments where drugs or endogenous ligands were applied to six human breast cancer cell lines in different concentrations and where expression was measured at different time points. The visualization of the results is organized by cell-line and batch where perturbations that induced similar responses are clustered together on a canvas.
Data Visualization Tool
An interactive HTML5 data visualization tool for interacting with three of the recently published datasets of cancer cell lines/drug-viability studies. DCB uses clustering and canvas visualization of the drugs and the cell lines, as well as a bar graph that summarizes drug effectiveness for the tissue of origin or the cancer subtypes for single or multiple drugs. DCB can help in understanding drug response patterns and prioritizing drug/cancer cell line interactions by tissue of origin or cancer subtype.
Chrome Extension for Data and Paper Citations with Text Importance Highlighting
Functions on specific pages of GEO, PubMed, and DataMed. It has two functions: (1) to create downloadable citations for GEO data and PubMed articles and (2) to highlight the most important sentences in PubMed abstracts in a graded manner (based on TextRank algorithm).
Drug Pair Seeker
Predict and Prioritize Pairs of Drugs
A Java program that attempts to predict and prioritize pairs of drugs using the Connectivity Map dataset. Users can enter lists of up and down differentially expressed genes from their experiments to receive a ranked list of drug combinations that are predicted to either reverse or augment the gene expression state of the cells or tissue of interest using a simple formula.
Perform and Visualize Gene-set and Drug-set Enrichment Analyses
ChIP-X Enrichment Analysis
Database contains manually extracted datasets of transcription-factor/target-gene interactions from over 100 experiments such as ChIP-chip, ChIP-seq, ChIP-PET applied to mammalian cells. We use the database to analyze mRNA expression data where we perform gene-list enrichment analysis as the prior biological knowledge gene-list library. The system is delivered as web-based interactive software. With this software users can input lists of mammalian genes for which the program computes over-representation of transcription factor targets from the ChEA database.
Kinase Enrichment Analysis
A web-based tool with an underlying database providing users with the ability to link lists of mammalian proteins/genes with the kinases that phosphorylate them. The system draws from several available kinase-substrate databases to compute kinase enrichment probability based on the distribution of kinase-substrate proportions in the background kinase-substrate database compared with kinases found to be associated with an input list of genes/proteins.
Grid Analysis of Time-series Expression
A computational software platform for integrated visualization and analysis of expression time-series. Given a high-dimensional time-series dataset, GATE employs a clustering algorithm that creates movies of expression dynamics by assigning individual genes/proteins to hexagons on a hexagonal array and dynamically coloring each hexagon according to the expression level of the molecular species with which it is associated. Additionally, in order to infer potential regulatory control mechanisms from patterns of time-series correlations, GATE allows interactive interrogation of the movies with a wide variety of background knowledge datasets.
Utilizes FANs and a PPI Network to Build Subnetworks that Connect Lists of Human and Mouse Genes
A web-based tool and a database that utilizes 14 carefully constructed functional association networks (FANs) and a large-scale protein-protein interaction (PPI) network to build subnetworks that connect input lists of human and mouse genes. The FANs are created from mammalian gene set libraries where mouse genes are converted to their human orthologs. The tool takes as input a list of human or mouse Entrez gene symbols to produce a subnetwork and a ranked list of intermediate genes that are used to connect the query input list. In addition, users can enter any PubMed search term and then the system automatically converts the returned results to gene lists using GeneRIF. This gene list is then used as input to generate a subnetwork from the user’s PubMed query.
Network Inference from Repeated Observations of Sets
A general method for network inference from repeated observations of sets of related entities. Given experimental observations of sets of related entities, S2N infers the underlying network of binary interactions between these entities by generating an ensemble of networks consistent with the data; the frequency of occurrence of a given interaction throughout this ensemble is interpreted as the probability that the interaction is present in the underlying real network.
Gene Expression Data Analysis
A method to identify upstream regulators likely responsible for observed patterns in genome-wide gene expression. By integrating ChIP-seq/chip and position-weight-matrices (PWMs) data, protein-protein interactions, and kinase-substrate phosphorylation reactions, X2K can better identify regulatory mechanisms upstream of genome-wide differences in gene expression. X2K first infers the most likely transcription factors that regulate the differences in gene expression, then uses protein-protein interactions to connect the identified transcription factors using additional proteins for building transcriptional regulatory subnetworks centered on these factors, and finally uses kinase-substrate protein phosphorylation reactions, to identify and rank candidate protein-kinases that most likely regulate the formation of the identified transcriptional complexes.
Integrated Analysis of Gene/Protein Lists
A web-based system that allows users to upload and analyze lists of mammalian gene-sets in a client-server software application. Within their workspace users can examine the overlap among the lists they upload, manipulate lists with different set operations, expand lists using existing mammalian networks of protein-protein, co-expression correlations, or background knowledge annotation correlations, and apply simple gene-set enrichment analyses on many gene lists at once against a plethora of prior knowledge datasets.
Tool for Creating Subnetworks from Lists of Mammalian Genes or Proteins
A software tool that can be used to place lists of mammalian genes in the context of background mammalian signalome and interactome networks. The input to the program is a list of human Entrez Gene gene symbols and background networks in SIG format, while the output includes: (a) all identified interactions for the genes/proteins, (b) a subnetwork connecting the genes/proteins using intermediate components that are used to connect the genes, (c) ranking of the specificity of intermediate components to interact with the list of genes/proteins.
Flash-based Network Viewer
Visualization of small to moderately sized biological networks and pathways. FVN can also be used to embed pathways inside PDF files for the communication of pathways in soft publication materials.
Identify Biological Themes from Gene Lists
A word-cloud generator and a word-cloud viewer that is based on WordCram implemented using Java, Processing, AJAX, mySQL, and PHP. Text is fetched from several sources and then processed to extract the most relevant terms with their computed weights based on word frequency.
Tool for Converting Flat Files to BioPAX
A command-line Java program that can be used to convert structured text files describing molecular interactions into the BioPAX Level 3 standard format.
Desktop Application for Analysis and Visualization of Large-Scale Cell Signaling Networks
A Windows-based desktop application that implements standard network analysis methods to compute the clustering, connectivity distribution, and detection of network motifs, as well as provides means to visualize networks and network motifs. SNAVI is capable of generating linked web pages from network datasets loaded in text format. SNAVI can also create networks from lists of gene or protein names. SNAVI is a useful tool for analyzing, visualizing and sharing cell signaling data. SNAVI is open source free software.
AJAX Viewer for Signaling Networks
Web-based Viewer of Interactive Cell signaling Networks
A visualization tool for viewing and sharing intracellular signaling, gene regulation and protein interaction networks. AVIS is implemented as an AJAX-enabled syndicated Google gadget. It allows any webpage to render an image from a text file representation of signaling, gene regulatory, or protein interaction networks.
PubMed Alert Me!
PubMed SDI Software Application
A software utility that allows users to enter a list of PubMed queries. Once a list of queries is configured, the program runs either daily or weekly. It searches PubMed and if it finds new matching published papers, the program sends an e-mail notification with a list of links to the new articles.
Database for Integrating High-content Data Collected from Human and Mouse Embryonic Stem Cells
A mammalian embryonic stem cell (ESC)-specific database created by collecting and integrating data reporting results from various published studies that profiled human and mouse ESCs including: protein-DNA binding interactions extracted from ChIP-seq/chip experiments, gene regulatory interactions from loss/gain-of-function studies followed by genome-wide mRNA expression profiling, protein interactions from immunoprecipitation followed by mass-spectrometry proteomics, a list of potential pluripotency regulators from RNA interference screens, ESC-specific proteins and phosphoproteins with specified phosphosites from proteomics and phosphoproteomics studies, time-course genome-wide mRNA microarray datasets from differentiating mouse ESCs, and histone modification status from genome-wide studies.
Literature-based Protein-Protein Interaction Network
A comprehensive literature-derived biochemical network developed in collaboration with Benny Geiger’s Lab. The network is made of known interactions and cellular components composing the focal adhesion complex in mammalian cells. The Adhesome website provides a reference to and supporting materials for the analysis published in Nature Cell Biology.
Model Representing Signaling Pathways and Cellular Machines in the Hippocampal CA1 Neuron
Consists of cell signaling interactions extracted from literature describing components and interactions in mammalian neurons. This network integrates cell signaling pathways specific to mammalian neurons.
PRE Literature-based Protein-Protein Interaction Network
Consists of literature-based protein-protein interactions extracted from low-throughput experimental studies reporting interactions in mammalian presynaptic nerve terminals.