In this study, we applied the proposed MCSD to subtype four types of GBM: pro-neural, neural, classical, and mesenchymal with multiple genetic data from TCGA. High classification accuracy was achieved by using CS-based technique (i.e., MCSD) along with the combination of multiple datasets. The results from combining two types of genomic data were compared with those from single type of data. Moreover, the performance of the classification with and without MCSD technique had also been compared. The comparisons showed that the CS-based combined analysis of multiple types of genetic data could significantly improve the accuracy of detecting GBM subtypes.
Combining different types of genomic data allows us to interpret the information in the datasets comprehensively. The information from miRNA and mRNA are complementary to each other; so a combined analysis can give a better result than single data type analysis. miRNAs are a recently discovered class of small non-coding RNAs that regulate gene expression , which can be combined with mRNA data for better disease subtyping. However, if no dimension reduction with CS was applied, we found from Table 2 that the classification accuracy from combined analysis was comparable to that from the single mRNA expression because of the redundancy added. The classification performance was significantly improved after we used CS method, indicating that CS may reduce redundancy  in the combined datasets and thus improve the classification accuracy.
Informative features/biomarkers selected in this study have also been validated to be associated with GBM and have been reported in the literatures. In the combined data analysis, the 121 features/probes selected (shown in Additional file 2), the 3 miRNA expression probes and 118 mRNA expression probes are listed. Two of the selected miRNAs probes that represent the same miRNA, “hsa-miR-9” (sequence “TCATACAGCTAGATAACCAA”), have been validated to have stemness potential and chemoresistance to GBM cells [27–29], and known to be specifically expressed during brain neurogenesis. In the listed mRNA expression probes, the four probes of “CD44” and the three probes of “ASCL1” are selected. Both of the genes have been validated as biomarkers in subtyping GBM in multiple genomic studies [9, 30–32]. It demonstrates the significance of “CD44” and “ASCL1” in discriminating different subtypes of GBM. The three probes from “THBS1” are also selected in the 121 probes list. “THBS1” is a subunit of a disulfide-linked homotrimeric protein. This protein has been shown to play roles in platelet aggregation, angiogenesis, and tumorigenesis . “THBS1” is also a major activator of “TGFB1” and the “TGFB1” expression is associated with GBM . Moreover, it has been found that “TbRII”, a receptor of “TGFB1”, has a strong relationship with human malignant glioblastoma cells . There are biomarkers listed in Additional file 2 that have not been reported yet. However, they may be potential biomarkers for GBM, deserving further study.
We also performed Gene Ontology (GO) analyses to determine that these genes were enriched in specific GO terms (biological processes). The GO term “antigen processing” and presentation “lymphocyte mediated immunity” (p = 1.78 × 10–6), and several GO terms related to wounding healing [e.g. “response to wounding” (p = 1.26 × 10–8); “wound healing” (p = 2.44 × 10–6)], and cell adhesion [e.g. “biological adhesion” (p = 6.53 × 10–7); “cell adhesion” (p = 6.41 × 10–7)] showed highly significant enrichment for our selected genes. These results were expected. Taking “lymphocyte mediated immunity”-related GO categories as an example, lymphocyte-mediated cellular responses play a critical role in the body’s ability to generate an antitumor immune response, and activation status of lymphocytes is an important determinant of sensitivity to tumor-mediated apoptosis . In addition, according to previous studies, the miRNAs we identified are related to glioblastoma. For example, it was found that “has-miR-9” inhibit differentiation of glioblastoma stem cells, and the calmodulin-binding transcription activator 1 (CAMTA1) as “has-miR-9” target is a tumor suppressor in glioblastoma .
To test the stability of the classification results, the samples in training and testing were randomly rearranged ten more times. The number of samples from each subtype in training and testing was maintained the same as in the description in the section “Data collection”. The overall classification rate has an average value of 87.1% with a standard deviation of 4.5%, indicating that the results are rather robust.
In summary, we have developed a CS-based technique for combining multiple genomic data to subtype glioblastoma more accurately. The biomarkers identified with our approaches have also been validated or reported in some existing literatures, indicating that the integrated approach can provide comprehensive information for better disease diagnosis.