Fastbreak: a tool for analysis and visualization of structural variations in genomic data
© Bressler et al; licensee Springer. 2012
Received: 16 January 2012
Accepted: 24 September 2012
Published: 9 October 2012
Genomic studies are now being undertaken on thousands of samples requiring new computational tools that can rapidly analyze data to identify clinically important features. Inferring structural variations in cancer genomes from mate-paired reads is a combinatorially difficult problem. We introduce Fastbreak, a fast and scalable toolkit that enables the analysis and visualization of large amounts of data from projects such as The Cancer Genome Atlas.
KeywordsCancer genomics Structural variation Translocation
Genomic analysis of cancer and other genetic diseases is changing from the study of individuals to the study of large populations. This is exemplified by large scale projects such as The Cancer Genome Atlas (TCGA), a multi-institution consortium working to build a comprehensive compendium of genomic information that promises to reveal the molecular basis of cancer, and lead to new discoveries and therapies. Currently, TCGA centers are targeted to undertake the integrated analysis of 20-25 cancer types using more than twenty thousand samples [1, 2]. This endeavor provides investigators with an unprecedented view of the genomic aberrations that define many human cancers . Cancer cells display diverse genetic structure even within a single individual . Analysis of these structural variations (SVs) across thousands of individuals requires tools that must execute quickly and minimize systematic bias and errors.
Structural variants can be inferred from mapped mate-pair sequencing data by analyzing read pairs that have unlikely positions or orientations relative to each other and several methods and applications for this purpose have been presented [5, 6]. However, identifying groups of unlikely reads that support a particular structural variation can involve computations that become combinatorially complex as the number of reads increases. Algorithms such as BreakDancer  that make pairwise comparisons between reads have running times that scale nonlinearly with input size and are thus expensive to apply to large data sets consisting of many high coverage genomes. We present Fastbreak, a toolset that has been designed to enable efficient and parallelizable SV analysis of next-generation sequencing data. The algorithm and associated tools are available as open source software at http://code.google.com/p/fastbreak/ and incorporates several features:
Scalable rule-based approach: The system uses a set of rules designed to detect the signatures of SVs in a single pass over the data and accumulate this information in efficient, parallelizable data structures. These rules can be further tailored to focus on the signature of cancer-associated SVs, greatly reducing false positives (see Rules used in sample analysis).
Robust analysis: Because of variations in coverage and quality in the large amounts of data available, the software chains together different tools and statistical methods to identify both statistically anomalous files and those sections of the data that are free from systematic bias (see Robustness of analysis and quality assurance of data).
Visual data mining: The tool incorporates a set of novel visualizations allowing for interactive exploration and the presentation of the results at different scales (see Interaction visual representation).
Running times in minutes for fastbreak, fastbreak on hadoop on a 9 server cluster, and BreakDancer
Fastbreak (both passes)
Fastbreak on hadoop (pass1 + pass2)
9 gb Tumor
4 + 25
20 gb Tumor
8 + 40
40 gb Blood
9 + 110
The linear scaling and linearizability of the Fastbreak algorithm are both due to the use of efficient spatial data structures to accumulate counts of the read pairs that satisfy a set of rules in a single pass over the data. A second set of rules is then applied to all of the regions in the spatial data structure to calculate the confidence that structural variation has affected that region. The data structures are implemented for accumulating both one dimensional (the position of a single read) and two dimensional (such as the positions of two paired reads) genetic data in coarse (1000 bp) bins. The first set of rules describes what may be considered an abnormal read pair and the data structure accumulates both the density of normal and abnormal read pairs in one and two dimensions. The second set of rules identifies, classifies and scores possible structural variations based on the size of abnormal read pair clusters and the local coverage as represented in these densities. The rules are described in detail in Rules used in sample analysis.
Rules used in sample analysis
The Fastbreak system uses sets of rules, designed to detect genetic structural variations in high throughput sequence data. For analysis of glioblastoma (GBM) and ovarian cancer, these rules have been further refined to detect features that occur prevalently in disease (cancer) samples. To detect these structural variations, rules have been developed to identify three different sorts of abnormal read pairs: those with an abnormal distance between mapped positions; those with inconsistent orientation; and those mapped to different chromosomes. The algorithm compiles a list of such abnormal ("odd") reads in a linear time pass over the data. These reads are stored in a spatial data structure allowing us to identify groups of similar odd read pairs that meet our criteria in a second linear time pass over the filtered data. This data structure uses a system of bins that limits the resolution of the data, but provides significant speed advantages.
In mate pair sequence data, the resolution at which breakpoints can be confidently detected is dictated by the longest distance between mate-paired reads that can be considered normal. In the non-disease (blood) samples that were analyzed, only 0.1% of the reads had mapped distances of more than 1000 base pairs. Fastbreak uses this length to define the size of the bins in its internal data structure so that most normal read pairs will fall within a single bin. This eliminates the combinatorial difficulty of identifying clusters of abnormal read pairs and is one of the key optimizations that allow the algorithm’s running time to scale linearly with data size.
The rule set used in the analysis of the cancer samples was developed to identify clusters of abnormal read pairs that appear as part of a signature present in the majority of tumor samples within a set of matched (from the same patient) tumor and blood samples. Fastbreak was used to identify the common signature of abnormal read pairs found in tumor samples. The analysis to determine the best rule set is run separately for each disease, and across all the tumors there was found to be an enrichment of mate-pair distances between 1000-7000 bp (see Figure 3). Pair distances greater than 7000 bp are not significantly enriched in tumor samples, indicating that they are caused primarily by random noise or by structural variations present in the normal tissue relative to the reference genome used for mapping. This upper limit of this window is on the same order as the fall off of structural variations longer than 2000 bp observed by Clark et al. Through the deep sequencing of a GBM cell line .
For the comparative cancer analysis, the rule system was designed to detect structural variations supported by orientation chromosome or pair distance in this 1000-7000 bp window. Because some of these reads will be the product of random noise, an additional analysis is done to determine how many reads and what percent of the total coverage of a region are needed to conservatively identify a structural variation. These rules can be shown to maximize the difference between the distance distributions of tumor and blood samples (Figure 4).
To account for differences in mapping quality of reads, each inferred feature is assigned a score which aggregates the mapping quality assigned to all supporting reads using a probabilistic interpretation, so that the score assigned to the feature is the probability that not all of the reads were mismapped. This provides a score for each identified feature that increases with both the number of supporting reads and their mapping quality. For the analysis presented, we specified that, for us to consider a cluster of abnormal reads a structural variation, the number of reads that show unusual characteristics must be greater than two, and must account for more than 5% of the local coverage. We found that these rules are well suited to the exome sequenced samples that we analyzed, but more or less conservative rules can be used depending on the quality and coverage of the data available.
Robustness of analysis and quality assurance of data
Fastbreak helps to formulate rules for the identification of biologically relevant features that are robust against false positives due to differences in coverage and other batch effects. In addition to sequencing errors, automated analyses need to remove coverage bias and sample anomalies. To minimize the effects of disparate coverage levels between samples and genes, a biclustering method has been integrated to identify a subgroup of genes and samples with relatively consistent coverage. Erroneous individual samples are removed by use of an internal QA process that analyses different read groups within a sample to find anomalies. Matched pairs that pass the QA tests are then processed for secondary analysis.
The example analysis here involves the identification of structural variants across different cancer types. The analysis used a data set of 172 GBM patients and 132 ovarian cancer patients. Of these, fewer than 50% (96 GBM samples and 38 ovarian samples) passed the QA test process (see below). The parameters of the tests can be changed to include more patients, either through analysis of fewer chromosome regions, or by lowering the quality/coverage thresholds.
The QA process is designed to identify biases across and within samples, and identify chromosome regions across patients that can be compared. The system identifies regions that have sufficient coverage across patients, so that biases due to coverage depth are minimized. Batch effects can be studied by looking for correlations between coverage and identified features (a generally undesired property), and by comparing across samples (see Figures 1 and 2). As Fastbreak can be optimized to identify features using rules specific to the system under study, and can compensate for differences in coverage, it shows some robustness to changes in conditions and corresponding batch effects.
Most disrupted genes across the ovarian cancer and glioblastoma cancer data sets
Number of disrupted GBM samples
Number of disrupted ovarian cancer samples
Functional enrichment of most structurally disrupted genes in pooled GBM and ovarian cancer samples
Guanyl-nucelotide exchange factor
Interactive visual representation
Gene behaviors within a cancer type can be explored by selecting the OV (ovarian) or GBM (glioblastoma) tab on the left side of the application (see Figure 6). The cancer specific visualization level shows mutual information distances between genes across all patients as a circular plot. Again, mouse-over events and alternate table views can be used to view the data in more detail.
The third level of visualization allows a user to view structural variations at a specific location for a selected patient sample. Comparisons between tissues (tumor and blood) and patients can be done at this level of the application. Selection of patients and chromosome location can be done in the “data and range selection” window, while selection of parameters specific to the visualization can be altered in the “advanced parameters” window. A depth-first graph traversal of the structural variant data is used as the underlying data of the visualization. Results are drawn as a cyclic tree such that each contiguous region is represented by a pair of orthogonal branches. Gene location is shown along the base and branches of the trees while coverage information is displayed below the tree. The thickness of the branches indicates the number of supporting reads for the particular structural variation event. Mouse-over and click events are also implemented to view more information regarding a specific SV event. The organic structure of this visualization allows the viewer to quickly distinguish between different topologies based on qualitative differences in tree appearances (Figure 5).
The web application described above may be viewed and explored at http://fastbreak.systemsbiology.net. A download of the underlying data and web application are also available on the site.
However, Fastbreak provides only a coarse view of structural variation. It can be used to identify the regions that have been affected by structural variation, but does not attempt to describe precisely what variation has occurred. It is our hope that future tools might use Fastbreak-like data structures and approaches to parallelization to accelerate more precise algorithms.
The algorithm was developed with input and advice from Sheila Reynolds and Jared Roach. All data were acquired from The Cancer Genome Atlas at http://cancergenome.nih.gov/. This work was supported by the National Cancer Institute [U24CA143835]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
- Cancer Genome Atlas Research Network: Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008, 455: 1061-1068. 10.1038/nature07385View ArticleGoogle Scholar
- Cancer Genome Atlas Research Network: Integrated genomic analyses of ovarian carcinoma. Nature 2011, 474: 609-615. 10.1038/nature10166View ArticleGoogle Scholar
- Collins FS, Barker AD: Mapping the cancer genome. Pinpointing the genes involved in cancer will help chart a new course across the complex landscape of human malignancies. Sci. Am. 2007, 296(3):50-57. 10.1038/scientificamerican0307-50View ArticleGoogle Scholar
- Nowak MA, Komarova NL, Sengupta A, Jallepalli PV, Shih L, Vogelstein B, Lengauer C: The role of chromosomal instability in tumor initiation. PNAS 2002, 99(25):16226-16231. 10.1073/pnas.202617399View ArticleGoogle Scholar
- Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. Nat Reviews Genetics 2006, 7: 85-97.View ArticleGoogle Scholar
- Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP, Shi X, Fulton RS, Ley T, Wilson RK, Li D, Mardis ER: BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat. Methods 2009, 6: 677-681. 10.1038/nmeth.1363View ArticleGoogle Scholar
- Clark MJ, Homer N, O’Connor BD, Chen Z, Eskin A, Lee H, Merriman B, Nelson SF: bioDist: Different distance measures. R package version 1.16.0
- Hubert L, Schultz J: Quadratic assignment as a general data-analysis strategy. Br. J. Math. Stat. Psychol. 1976, 29: 190-241. 10.1111/j.2044-8317.1976.tb00714.xMATHMathSciNetView ArticleGoogle Scholar
- van Uitert M, Meuleman W, Wessels L: Biclustering sparse binary genomic data. J. Comput. Biol. 2008, 10: 1329-1345.MathSciNetView ArticleGoogle Scholar
- Clark MJ, Homer N, O'Connor BD, Chen Z, Eskin A: U87MG decoded: the genomic sequence of a cytogenetically aberrant human cancer cell line. PLoS Genet. 2010, 6(1):e1000832. 10.1371/journal.pgen.1000832View ArticleGoogle Scholar
- Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protoc 2009, 4(1):44-57.View ArticleGoogle Scholar
- Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37(1):1-13. 10.1093/nar/gkn923View ArticleGoogle Scholar