Gene regulatory network inference and validation using relative change ratio analysis and time-delayed dynamic Bayesian network
© Li et al.; licensee Springer. 2014
Received: 16 January 2014
Published: 16 July 2014
The Dialogue for Reverse Engineering Assessments and Methods (DREAM) project was initiated in 2006 as a community-wide effort for the development of network inference challenges for rigorous assessment of reverse engineering methods for biological networks. We participated in the in silico network inference challenge of DREAM3 in 2008. Here we report the details of our approach and its performance on the synthetic challenge datasets. In our methodology, we first developed a model called relative change ratio (RCR), which took advantage of the heterozygous knockdown data and null-mutant knockout data provided by the challenge, in order to identify the potential regulators for the genes. With this information, a time-delayed dynamic Bayesian network (TDBN) approach was then used to infer gene regulatory networks from time series trajectory datasets. Our approach considerably reduced the searching space of TDBN; hence, it gained a much higher efficiency and accuracy. The networks predicted using our approach were evaluated comparatively along with 29 other submissions by two metrics (area under the ROC curve and area under the precision-recall curve). The overall performance of our approach ranked the second among all participating teams.
KeywordsGene regulatory network (GRN) Dialogue for Reverse Engineering Assessments and Methods (DREAM) Relative change ratio (RCR) Time-delayed dynamic Bayesian network (TDBN)
Recent development of high-throughput technologies such as DNA microarray and RNA-Seq (i.e., next-generation sequencing of RNA transcripts) has made it possible for biologists to simultaneously measure gene expression at a genome scale. High dimensional datasets generated using such technologies provide a system-wide overview of how genes interact with each other in a network context. However, reconstruction of complex networks of genetic interactions and unraveling of unknown relationships among genes based on such high-throughput datasets remain a very challenging computational problem.
Various mathematical methods and computational approaches have been proposed to infer gene regulatory networks (GRN) from DNA microarray data, including Boolean networks , information theory , differential equations , and Bayesian networks -. However, the relative performances among these algorithms are not well studied because computational biologists must repeatedly test them on large-scale and high-quality datasets obtained from different experimental conditions and derived from different networks. Unfortunately, experimental datasets of customized size and design are usually unavailable and most biological networks are unknown or incomplete. Since each of these methods uses different datasets and comparison strategies, it is difficult to systematically validate the interactions predicted by different computational approaches.
Due to limited knowledge of experimentally validated biological networks of gene interactions, simulated data generated artificially from in silico gene networks provide a ‘gold’ standard to systematically evaluate the performance of different genetic networks inferring algorithms . In silico networks are composed of a known network topology that determines the structure and model for each of the interactions among the genes. In such simulated data, all aspects of the networks are under full control and different types of data and levels of noise are allowed. Many methods have been proposed for creating in silico genetic networks, including continuous , probabilistic , and dynamic  approaches.
The performance of network inference algorithms has rarely been assessed and compared in terms of their strength and weakness using rigorous metrics ,. As a community effort to address the deficiency in GRN reconstruction methodology, a Dialogue for Reverse Engineering Assessments and Methods (DREAM) project was initiated in 2006  to catalyze the interaction between experiment and theory, specifically in the area of cellular network inference and quantitative model building (http://www.the-dream-project.org/). One of the key goals of DREAM is the development of community-wide challenges for objective assessment of reverse engineering methods for biological networks . The in silico network inference challenge of DREAM3 was designed to explore the extent to which underlying gene networks of various sizes and connection densities can be inferred from simulated data . In participation of this challenge, we developed a novel approach of combining relative change ratio (RCR) and time-delayed dynamic Bayesian network to deduce GRNs from synthetic datasets for Escherichia coli and Saccharomyces cerevisiae (budding yeast) provided by the challenge. Among 29 participating teams, the performance of our approach was second only to the best performing method in the 10-node and the 50-node network sub-challenges . Here we present the details of our approach and its performance on the challenge datasets.
Materials and methods
The in silico network inference challenge was structured as three separate sub-challenges with networks of 10, 50, and 100 genes (nodes), respectively . For each sub-challenge, five in silico networks (two for E. coli and three for S. cerevisiae) were created as benchmark or gold standard networks. The rationale for this design was to evaluate the consistence of inference methods in predicting the topology of five independent networks of the same type and size. These benchmark networks were generated by Daniel Marbach of Ecole Polytechnique Fédérale de Lausanne through extracting sub-networks with a topology of connections from the currently accepted E. coli and S. cerevisiae GRNs and imbuing the networks with dynamics using a thermodynamic model of gene expression . The in silico ‘measurements’ were generated by continuous differential equations which were deemed reasonable approximations of gene expression regulatory functions ,. A small amount of Gaussian noise was added to these values to simulate measurement error .
For each sub-challenge network, three experimental gene expression datasets were simulated for both E. coli and S. cerevisiae: heterozygous knockdown, null-mutants, and time series trajectories. The heterozygous knockdown dataset contained the steady state gene expression levels for the wild-type and the heterozygous knockdown (a gene reduced by half) strains for each gene. The null-mutant dataset contained the steady state levels for the wild-type and the null-mutant (expression of a gene set to zero) strains. Time series trajectories dataset contained time courses of the network recovering from several external perturbations. All of the datasets can be downloaded at the DREAM Project website: http://wiki.c2b2.columbia.edu/dream/index.php/D3c4.
Relative change ratio
A GRN represents the interactions of all genes in the network. For a given GRN structure, the change of the expression level of one gene results in changes of the expression levels of all others genes regulated by this gene. If a gene plays an important role in the GRN, knockout or null-mutation of an important gene (key gene) leads to more significant changes of the expression levels of other genes that are directly interacted with the hub gene. Thus, the wild-type, knockout, and null-mutant datasets provide useful information (prior knowledge) that we can use for improving the accuracy of GRN inference. Here we introduce the RCR method to preprocess and analyze the given datasets to identify the key genes that can be used for further GRN inference. The RCR method can reveal the relationships between a knockout gene and the influenced genes so it can also be directly used for inference of a GRN.
If the absolute change of gene expression values compared to their own reference value is less than a chosen threshold (e.g., 0.05), even though the relative change ratio is more than 0.30, we still consider these genes as noise and remove them from the regulated genes list.
Dynamic Bayesian network
Kevin Murphy and co-workers , implemented a Bayesian network toolbox (BNT), in which the actual structure learning was performed by calling one of the BNT functions learn_struct_dbn_reveal, which used the REVEAL algorithm .
Time-delayed dynamic Bayesian network
In the traditional DBN proposed by ,, the effectiveness is not sufficient for two main reasons. The first is the extremely high computational cost. In Murphy's implementation, all the genes in the dataset are considered as parents (regulators) of a given target gene, which makes it impossible to model large-scale gene networks because of exponentially increasing computational time when the algorithm tries to find all of the subsets of parent genes given a target gene. Usually, the number of genes is restricted to less than 30, and more genes will be too much time consuming according to our testing. The second is that biologically relevant transcriptional time lags cannot be determined in Murphy's BNT, which reduces the inference accuracy of gene regulatory networks.
To address the above limitations of traditional DBN, Zou and Conzen  introduced a time-delayed dynamic Bayesian network (TDBN)-based analysis method, which can reconstruct GRNs from time series gene expression data. The improved method can dramatically reduce computational time and significantly increased accuracy. According to ,, most transcriptional regulators exhibit either an earlier or simultaneous change in the expression level when compared to their targets. In this way, one can limit the potential parents of each target gene and thus dramatically decrease the computational cost. The other improvement by Zou and Conzen  is to perform an estimation of the transcriptional time lag between potential regulators and their target genes. The time difference between the initial expression change of a potential regulator and its target gene represents a biologically relevant time period.
The initial expression change of a potential regulator is expected to allow a more accurate estimation of the transcriptional time lag between potential regulators and their targets, because it takes into account variable expression relationships of different regulator-target pairs. These improvements in  are related to transcriptional time-delayed lags between regulators and target genes, so it can also be considered as a time-delayed DBN and directly used to predict networks from time series gene expression data, such as the trajectory time series data in the DREAM3 challenge.
Inferring networks using a method that combines RCR and TDBN
In this combined method, we first used the simple RCR model to find key genes from the given heterozygous knockdown data and null-mutant knockout data. These key genes have a higher potential than other genes to play critical roles in simulated GRNs. After the data was preprocessed, we constructed a gene interaction network that indicated potential regulation among the selected key genes. The TDBN method was then used to infer another GRN from time series trajectory datasets. If gene interactions exist in both networks inferred by RCR and TDBN methods, we choose these interactions as our predicted edges in our final inferred networks. The predicted networks were assessed against the benchmark networks ,.
Results and discussion
Inferred networks as compared with the true networks
In this work, our approach was applied to inferring GRNs in three different ways: For in silico networks with 10 genes, the gene regulatory networks were inferred only by the RCR method from steady state data, in which we used mainly the gene knockout dataset; for networks with 50 genes, the networks inferred using RCR and TDBN separately were combined into the final networks; for networks with 100 genes, we used only TDBN to reconstruct gene networks from time series trajectory gene expression dataset. In doing this, we sought to determine which method had better performance in inferring gene regulatory networks.
Performance of network inference from synthetic datasets
The performance of each method was evaluated by two metrics: the area under the precision-recall (AUPR) curve and the area under the receiver operating characteristic (AUROC) curve for the whole set of edge predictions for 15 networks ,. Precision is a measure of fidelity, whereas recall is a measure of completeness. Recall (R) is defined as and precision (P) as , where Ce is the number of correct edges, Me is the total number of missed edges (missed errors), and Fe is the number of false alarm errors. A missed error is defined as the connection between genes that exists in true networks, but the inference algorithms miss or make wrong orientations. A false alarm error is the connection that the inference algorithms create but does not exist in true networks.
A P value is the probability that a given or larger area under the curve value is obtained by random ordering of the T potential network links. An overall P value is the geometric mean of the n individual P values, calculated as . An overall AUROC P value represents the geometric mean of the five AUROC P values (Ecoli1, Ecoli2, Yeast1, Yeast2, and Yeast3). An overall AUPR P value is the geometric mean of the five AUPR P values.
To calculate AUPR and AUROC, each predicted network was submitted in the form of ranked lists of predicted edges. The lists were ordered according to the confidence of the predictions so that the first entry corresponded to the edge predicted with the highest confidence. In other words, the edges at the top of the list were believed to be present in the network, and the edges at the bottom of the list were believed to be absent from the network .
Assessment metrics for the first set of E. coli and yeast networks inferred using our approach
5.43E − 01
7.71E − 01
6.71E − 01
4.86E − 01
1.45E − 02
1.55E − 02
7.94E − 01
9.44E − 01
8.62E − 01
8.35E − 01
5.21E − 01
4.61E − 01
1.34E − 04
2.09E − 06
8.57E − 55
3.91E − 39
2.27E − 01
8.91E − 01
5.47E − 04
1.29E − 06
3.19E − 20
4.64E − 18
2.02E − 01
9.60E − 01
1.09E − 04
2.54E − 46
4.83E − 03
2.10E − 04
8.19E − 18
2.13E − 02
Role of RCR and TDBN in network inference
Overall performance of our approach for predicting all five sets of networks of different sizes
Impact of RCR threshold on network inference accuracy
In this study, a novel relative change ratio method was proposed to preprocess the null-mutant steady state data in order to find the key genes and build GRNs, in which these selected key genes have a higher potential than other genes to play very critical roles. Then, TDBN was used to infer GRNs from time series trajectory data, which were combined with previous knowledge gained in the initial step. Finally, the inferred networks were evaluated by using AUPR and AUROC metrics for the whole edge predictions for a network. The overall prediction results suggest that our approach was able to infer gene regulatory networks from in silico DREAM challenge data very efficiently and accurately in comparison with other participating teams. We have confidence that the DREAM project will eventually lead the reverse engineering community to resolve technical problems and overcome barriers between research groups towards reliable and accurate GRN inference from high dimensional gene expression data.
area under the precision-recall curve
area under the receiver operating characteristic (ROC) curve
Dialogue for Reverse Engineering Assessments and Methods
gene regulatory network
relative change ratio
time-delayed dynamic Bayesian network
We would like to thank Gustavo Stolovitzky for organizing the DREAM3 challenge and thank Daniel Marbach and his colleagues from the Laboratory of Intelligent Systems of the Swiss Federal Institute of Technology in Lausanne for providing the challenge datasets. This work was supported by the Environmental Quality and Installation Technologies Research Program of the US Army Corps of Engineers under contract #W912HZ-05-P-0145. Permission was granted by the Chief of Engineers to publish this information.
- Lähdesmäki H, Shmulevich I, Yli-Harja O: On learning gene regulatory networks under the Boolean network model. Mach. Learn. 2003,52(1–2):147-167. 10.1023/A:1023905711304View ArticleGoogle Scholar
- Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS: Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol 2007,5(1):e8. 10.1371/journal.pbio.0050008View ArticleGoogle Scholar
- Chen I, He HL, Church GM: Modeling gene expression with differential equations. Pac. Symp. Biocomput 1999, 4: 29-40.Google Scholar
- Liang S, Fuhrman S, Somogyi R: REVEAL, a general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. Biocomput. 1998, 3: 18-29.Google Scholar
- Imoto S, Goto T, Miyano S: Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. Pac. Symp. Biocomput. 2002, 7: 175-186.Google Scholar
- Stolovitzky G, Prill RJ, Califano A: Lessons from the DREAM2 challenges. Ann. N Y Acad. Sci. 2009,1158(1):159-195. 10.1111/j.1749-6632.2009.04497.xView ArticleGoogle Scholar
- Mendes P, Sha W, Ye K: Artificial gene networks for objective comparison of analysis algorithms. Bioinformatics 2003,19(2):122-129.Google Scholar
- Marbach D, Schaffter T, Mattiussi C, Floreano D: Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J. Comput. Biol. 2009,16(2):229-239. 10.1089/cmb.2008.09TTView ArticleGoogle Scholar
- Zou M, Conzen SD: A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 2005,21(1):71-79. 10.1093/bioinformatics/bth463View ArticleGoogle Scholar
- Yu H, Luscombe NM, Qian J, Gerstein M: Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trends Genet. 2003, 19: 422-427. 10.1016/S0168-9525(03)00175-6View ArticleGoogle Scholar
- Stolovitzky G, Monroe D, Califano A: Dialogue on reverse-engineering assessment and methods: the dream of high-throughput pathway inference. Ann. N Y Acad. Sci. 2007, 1115: 1-22. 10.1196/annals.1407.021View ArticleGoogle Scholar
- Cantone I, Marucci L, Iorio F, Ricci MA, Belcastro V, Bansal M, Santini S, Bernardo MD, Bernardo DD, Cosma MP: A yeast synthetic network for in vivo assessment of reverse-engineering and modeling approaches. Cell 2009, 137: 172-181. 10.1016/j.cell.2009.01.055View ArticleGoogle Scholar
- Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, Stolovitzky G: Revealing strengths and weaknesses of methods for gene network inference. Proc. Natl. Acad. Sci. U S A 2010,107(14):6286-6291. 10.1073/pnas.0913357107View ArticleGoogle Scholar
- Prill RJ, Marbach D, Saez-Rodriguez J, Sorger PK, Alexopoulos LG, Xue X, Clarke ND, Altan-Bonnet G, Stolovitzky G: Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS One 2010,5(2):e9202. 10.1371/journal.pone.0009202View ArticleGoogle Scholar
- Lähdesmäki H, Hautaniemi S, Shmulevich I, Yli-Harja O: Relationships between probabilistic Boolean networks and dynamic Bayesian networks as models of gene regulatory networks. Signal Process 2006,86(4):814-834. 10.1016/j.sigpro.2005.06.008View ArticleGoogle Scholar
- Friedman N, Murphy K, Russell S: Learning the structure of dynamic probabilistic networks. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI) 1998, 139-147.Google Scholar
- Murphy K: Dynamic Bayesian networks: representation, inference and learning. PhD Dissertation, University of California, Berkeley; 2002.Google Scholar
- Murphy K, Mian S: Modeling gene expression data using dynamic Bayesian networks. Technical report (Computer Science Division, University of California, Berkeley, CA; 1999.Google Scholar