# Identification of thresholds for dichotomizing DNA methylation data

- Yihua Liu
^{1}, - Yuan Ji
^{2}and - Peng Qiu
^{1}Email author

**2013**:8

**DOI: **10.1186/1687-4153-2013-8

© Liu et al.; licensee Springer. 2013

**Received: **1 March 2013

**Accepted: **23 May 2013

**Published: **6 June 2013

## Abstract

DNA methylation plays an important role in many biological processes by regulating gene expression. It is commonly accepted that turning on the DNA methylation leads to silencing of the expression of the corresponding genes. While methylation is often described as a binary on-off signal, it is typically measured using beta values derived from either microarray or sequencing technologies, which takes continuous values between 0 and 1. If we would like to interpret methylation in a binary fashion, appropriate thresholds are needed to dichotomize the continuous measurements. In this paper, we use data from The Cancer Genome Atlas project. For a total of 992 samples across five cancer types, both methylation and gene expression data are available. A bivariate extension of the StepMiner algorithm is used to identify thresholds for dichotomizing both methylation and expression data. Hypergeometric test is applied to identify CpG sites whose methylation status is significantly associated to silencing of the expression of their corresponding genes. The test is performed on either all five cancer types together or individual cancer types separately. We notice that the appropriate thresholds vary across different CpG sites. In addition, the negative association between methylation and expression is highly tissue specific.

## Introduction

DNA methylation plays an important role in cancer through hypermethylation to turn off tumor suppressors and hypomethylation to activate oncogenes [1, 2]. It is widely accepted that DNA methylation is associated with silencing of gene expression [3]. With data from high-throughput array and sequencing technologies, several studies have analyzed the relationship between methylation and gene expression [4–6].

When the relationship between methylation and gene expression is discussed, both are often described as binary signals (i.e., on-off, high-low) [7]. For example, for a gene whose expression can be controlled by the methylation of a CpG site in its promoter region: if the CpG site is methylated, the gene’s expression is typically low; if the CpG site is unmethylated, the expression of the gene can be either high or low, depending on other controlling mechanisms. On the other hand, measurements of methylation and expression obtained using microarrays and sequencing technologies are in continuous values. If we want to interpret the relationship between methylation and gene expression data using the binary language, appropriate thresholds are needed to dichotomize the measurements.

To jointly analyze methylation and gene expression, an ideal dataset would be a large collection of samples for which both data types are available. The Cancer Genome Atlas (TCGA) project provides such data for a large number of cancer samples [8–11]. Moreover, the TCGA samples are derived from multiple cancer and tissue types. The diversity among the samples may enable us to see relationships that cannot be observed in individual tissue types.

In this paper, we downloaded DNA methylation and gene expression data in TCGA. Data for a total of 992 samples were available, covering five cancer types. We extended the StepMiner algorithm [12] to identify thresholds to dichotomize methylation and expression measurements. Hypergeometric test was used to identify CpG sites whose methylation is significantly associated to silencing of expression of their corresponding genes. We observed that appropriate thresholds are highly CpG site specific, and the methylation-expression association for many genes is tissue-type specific.

## Materials and methods

### Methylation and expression data from TCGA

TCGA data portal (https://tcga-data.nci.nih.gov/tcga/tcgaDownload.jsp) provides three ways for accessing the data. Two of them, ‘data matrix’ and ‘bulk download,’ require investigators to manually select a subset of the data and then automatically collect relevant data files into a compressed.tar file for download. After that, additional effort is needed to parse and assemble the downloaded files into formats useful to programming environments such as Matlab or R. Since TCGA data keep growing and the manual selection can be tedious when multiple data types and disease types are considered, it is difficult to keep track of the manual selections and guarantee reproducibility. Therefore, we chose the third way, ‘open-access http directory,’ which contains links for all individual data files in TCGA (http://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/). We created Matlab scripts to programmatically grab methylation and RNA-seq data files for each individual disease type, automatically parse them, and organize them into tab delimited spreadsheets for subsequent analysis. Our scripts for automatically downloading TCGA data are available at http://odin.mdacc.tmc.edu/~pqiu/software/DownloadTCGA/.

Genome-wide methylation measurements were generated using the Illumina Infinium Human DNA Methylation27 array platform (Illumina, Inc., San Diego, CA, USA), which interrogates the methylation status of 27,578 CpG sites in proximal promoter regions of 14,475 genes in the human genome. As of 12 February 2013, methylation data for 2,796 samples across 12 cancer types were available. We downloaded the TCGA level 3 preprocessed data, which are the ratio *M*_{
i
}/(*U*_{
i
}+*M*_{
i
}) for each CpG site *i*. *M*_{i} represents the signal intensity of the methylated probe for CpG site *i*, and *U*_{i} is the signal intensity of the unmethylated probe. Therefore, the numerical range of the data is between 0 and 1. Zero (0) indicates unmethylated, whereas 1 indicates completely methylated. The data contain a small fraction of empty entries, because the corresponding probes either overlap with known single-nucleotide polymorphisms or other genomic variations, or their signal intensities are lower than the background.

TCGA uses several platforms to quantify gene expression, among which the Illumina GA II and HiSeq platforms profiled the largest number of samples. As of 12 February 2013, preprocessed RNA-seq data for 4,108 samples across 11 cancer types were available. The preprocessed data are the RPKM values for 20,532 genes in each sample. Roughly, the numerical range of the data is between 0 and 10^{5}. For each gene, we replaced the zero entries with the minimal non-zero value of this gene across all samples and transformed the data to log scale.

The total number of overlapping samples between the above methylation and expression data was 992. The overlap covered five different cancer types: breast cancer (BRCA, 313 samples), colon and rectal cancer (COAD/READ, 227 samples), kidney renal clear cell carcinoma (KIRC, 208 samples), squamous cell lung cancer (LUSC, 129 samples), and uterine corpus endometrioid carcinoma (UCEC, 115 samples). Our analysis was performed based on these 992 overlapping samples.

### Extend StepMiner for dichotomizing methylation and expression data

*X*with an unknown probability distribution and

*n*independent observations of the random variable

*x*

_{ k },

_{(k=1,2,...,n)}, the algorithm first sorts the observations in ascending order

*x*

_{(1)}≤

*x*

_{(2)}≤...≤

*x*

_{(n)}. Then, the sorted data are fitted by a step function,

*f*(

*i*)=

*μ*

_{1}

*I*(

*i*≤

*t*)+

*μ*

_{2}

*I*(

*i*>

*t*), where

*i*=1,2,...,

*n*and

*I*(.) is an indicator function. Denote the mean of all observations as

*μ*, the deviation of the fitted step function to sample mean as signal $={\sum}_{i=1}^{n}{\left(f\right(i)-\mu )}^{2}$, and the fitting error as noise $={\sum}_{i=1}^{n}{\left(f\right(i)-{x}_{\left(i\right)})}^{2}$. The goodness of fit can be defined by a signal-to-noise ratio (SNR), and the best fit parameters can be found by maximizing SNR $=\frac{\text{signal}}{\text{noise}}$. Operationally, the maximization problem can be solved by exhaustively enumerating all possible integer values for 1≤

*t*<

*n*. For each possible value of

*t*, calculating the ratio is straight forward because

*μ*

_{1}is the mean of the first

*t*observations in the sorted data and

*μ*

_{2}is the mean of the remaining observations. Once the

*t*

^{∗}that maximizes SNR is identified, the threshold for dichotomizing the observation can be defined as $\frac{1}{2}({x}_{\left({t}^{\ast}\right)}+{x}_{({t}^{\ast}+1)})$. The maximum value of signal-to-noise ratio SNR (

*t*

^{∗}) depends on the distribution of

*X*. If

*X*has an extreme bi-modal distribution and its probability density function is a sum of two delta functions, SNR (

*t*

^{∗}) approaches positive infinity. If

*X*follows a uniform distribution, SNR (

*t*

^{∗}) equals to 3. If

*X*follows a Gaussian distribution, SNR (

*t*

^{∗}) is approximately 1.75, regardless of the values of mean and variance of the Gaussian. Two examples using real data are shown in Figure 1, illustrating how univariate StepMiner identifies thresholds for gene expression of ESR1, and cg20253551 which measures the methylation status of a CpG site of that gene.

*x*-coordinate is the rank of this observation, and

*y*-coordinate is the observed value itself. The collection of all observations forms a non-decreasing curve which is fitted by a step function. In order to extend the algorithm to bivariate, we define a non-decreasing surface and fit it with a bivariate step function. Given

*n*observations of two random variable

*X*and

*Y*, (

*x*

_{ k },

*y*

_{ k }),

_{(k=1,2,...,n)}, we assign each observation (

*x*

_{ k },

*y*

_{ k }) to a point in a three-dimensional (3D) space. The

*x*-coordinate of the point is the rank of

*x*

_{ k }with respect to all the observations for

*X*; the

*y*-coordinate of the point is the rank of

*y*

_{ k }with respect to all the observations for

*Y*; and the

*z*-coordinate of the point is

*x*

_{ k }+

*y*

_{ k }. The collection of all points forms a surface, which is fitted by a bivariate step function with six parameters,

where *i* and *j* both range from 1 to *n*.

*x*

_{ k }is the

*i*th smallest among all observed value for

*X*, and

*y*

_{ k }is the

*j*th smallest among all observed value for

*Y*, the data point (

*x*

_{ k },

*y*

_{ k }) is mapped to point (

*i*,

*j*). The rank transformed data are shown in Figure 2b. To form a non-decreasing surface sitting on top of the rank transformed data, a ‘height’

*z*(

*i*,

*j*)=

*x*

_{ k }+

*y*

_{ k }is defined at point (

*i*,

*j*) to which the

*k*th observation is mapped. For an (

*i*,

*j*) point to which no observation is mapped, we define $z(i,j)=\underset{u\le i,v\le j}{max}z(u,v)$. Such a definition guarantees that the surface is non-decreasing, i.e., $z(i,j)\ge \underset{u\le i,v\le j}{max}z(u,v)$. To ensure that

*X*and

*Y*contribute equally, the observations are normalized to zero-mean-unit-variance before defining

*z*. Finally, denote $\mu =\frac{1}{{n}^{2}}{\sum}_{i,j=1}^{n}z(i,j)$, the parameter values of the best fit two-dimensional (2D) step function can be found by optimizing an SNR $=\left({\sum}_{i,j=1}^{n}{\left(f\right(i,j)-\mu )}^{2}\right)/\left({\sum}_{i,j=1}^{n}{\left(f\right(i,j)-z(i,j\left)\right)}^{2}\right)$ in a similar exhaustive search fashion as the one-dimensional (1D) case. Computing

*z*and optimizing SNR on a

*n*×

*n*grid can be time consuming when

*n*is large. For computational efficiency, we approximate the surface on a 50×50 grid. In Figure 2c, the surface

*z*is shown as a heatmap, where blue indicates small value and red indicates large value. Figure 2d shows the points (

*x*

_{ k },

*y*

_{ k },

*z*

_{ k }) and the best fit bivariate step function. The optimal SNR for this example is 5.17. Similar to the 1D case, the maximum value of SNR depends on the joint distribution of

*X*and

*Y*. If the joint probability density function is a sum of two or three delta functions, the optimal SNR approaches positive infinity. If

*X*and

*Y*are independent, the optimal SNR is 3 for uniform distribution and approximately 1.75 for Gaussian distribution.

### Hypergeometric test for methylation controlled genes

The optimal SNR value in StepMiner2D measures the multi-modality of the joint distribution of *X* and *Y*, rather than the association between the two variables. For example, if *X* and *Y* independently follow two bi-modal distributions, although there is no association between the two variables, the optimal SNR can be large. Thus, SNR does not seem to be suitable for evaluating the association between methylation and expression. Here, we are interested in one particular kind of association, whether methylation of a CpG site leads to down-regulation of its corresponding gene expression. After dichotomizing methylation and expression data, the sufficient statistics become counts of points in the four quadrants in Figure 2a. The significance of methylation controlled gene can be intuitively explained as whether the observed count in the upper-right quadrant is significantly less than expected. Popular statistical tests for 2×2 contingency tables, such as Fisher exact and chi-square tests, are designed to evaluate the whether counts are significantly unbalanced but not toward a specific direction. We choose to use hypergeometric test. Let *N* denote the total number of samples; *R* is the total number of methylated samples (sum of points in the upper-right quadrant and the lower-right quadrant); *U* is the total number of samples with high gene expression (total number of points in the two upper quadrants). Condition on *N*, *R*, and *U*, if the methylation and expression are independent, the number of samples in the upper-right quadrant *k* follows a hypergeometric distribution $p\left(k\right)=\left(\genfrac{}{}{0ex}{}{U}{k}\right)\left(\genfrac{}{}{0ex}{}{N-U}{R-k}\right)/\left(\genfrac{}{}{0ex}{}{N}{R}\right)$. To evaluate the significance of the observed count in the upper-right quadrant *K*, we can compute the probability of observing *K* or less points under the assumption that methylation and expression are independent *p* value $={\sum}_{k=0}^{K}p\left(k\right)$. This is a hypergeometric test specifically for evaluating the significance of whether methylation turns off gene expression.

## Results

### Data preprocessing

We preprocessed the TCGA data by filtering out CpG sites with small variance or many missing data points and matching methylation and expression data according to genes. The methylation data we downloaded from TCGA were generated by the Methylation27 array platform, which provided the methylation status of 27,578 CpG sites in 14,475 genes across 992 cancer samples. We excluded CpG sites whose annotated genes are not present in the expression data. We also excluded CpG sites with more than 1% missing data and ones whose methylation beta value is smaller than 0.01 for more than 95% of the samples. After applying these filtering criteria, we obtained a total of 11,189 CpG sites annotated to 7,344 unique genes. For approximately half of the genes, only one CpG site is measured for each gene; data for two CpG sites are available for the majority of the other half; for a very small number of genes, measurements of multiple CpG sites are available. In the subsequent subsections, for the methylation data of each of the 11,189 CpG sites, we extracted the expression data of its corresponding gene and focused our bivariate analysis on features paired according to genes. Preprocessed data and the code for our analysis is available at http://odin.mdacc.tmc.edu/~pqiu/projects/MethExpr/.

### Identification of methylation on-off threshold

*p*value threshold of 0.01 and Bonferroni correction, 2,976 pairs showed significant association, and the ESR1 example in Figures 1 and 2 was among the significant ones. Figure 3 shows a histogram of the identified methylation thresholds for the 2,976 significant associations, where we observed a wide-spread distribution. This result indicates that although the beta value quantification of methylation has a consistent numerical range of [0, 1] across different genes, the appropriate threshold for dichotomizing the beta values is highly CpG site specific.

### Tissue-specific association between methylation and expression

## Conclusions

We performed integrative analysis of methylation and gene expression data of five cancer types in TCGA. First, we pooled samples from all five cancer types together and applied StepMiner2D to identify thresholds for dichotomizing the methylation and expression data. In such a pan-cancer analysis strategy, the diversity and variation among samples allow us to observe positive and negative signals in sufficient number of samples and empower the method to identify the appropriate thresholds. Then, we applied hypergeometric test to identify CpG sites whose methylation is significantly associated to silencing of the expression of their corresponding genes, either using all five cancer types together or using individual cancer types separately. When all five cancer types were examined together, 2,976 CpG sites showed significant negative association with gene expression. However, when samples in different cancer types were considered separately, a much smaller number of significant associations were observed in at least one cancer type. We speculate that the associations only significant in pan-cancer analysis are likely to be induced by tissue differences, whereas significant associations observed in individual cancer types are more likely to reflect regulatory relationships between methylation and gene expression. For future work, there are a few possible extensions. The methylation data used here are generated by the Illumina Methylation 27k platform. TCGA also generates methylation data using the Illumina Methylation 450k platform, which measures roughly 20 times more CpG sites. We plan to redo the analysis using the 450k methylation data, which will enable us to identify more associations between methylation and expression. Moreover, the proposed analysis strategy can also be applied to examine associations among measurements made by other modalities, such as microRNA expression, DNA copy number variation, protein expression, etc.

## Declarations

### Acknowledgements

The authors would like to acknowledge The Cancer Genome Atlas Research Network for providing the methylation and expression data used in this paper. This work was partially supported by TCGA Genome Data Analysis Center grant at the University of Texas MD Anderson Cancer Center (U24 CA143883 02 S1), as well as NIH grants (R01CA163481 and R01CA174385) from the National Cancer Institute.

## Authors’ Affiliations

## References

- Ballestar E:
**An introduction to epigenetics.***Adv. Exp. Med. Biol*2011,**711:**1-11. 10.1007/978-1-4419-8216-2_1View ArticleGoogle Scholar - Jones P, Baylin S:
**The fundamental role of epigenetic events in cancer.***Nat. Rev. Genet*2002,**3**(6):415-428.Google Scholar - Laird P:
**Principles and challenges of genomewide dna methylation analysis.***Nat. Rev. Genet*2010,**11**(3):191-203.View ArticleGoogle Scholar - Li M, Balch C, Montgomery J, Jeong M, Chung J, Yan P, Huang T, Kim S, Nephew K:
**Integrated analysis of DNA methylation and gene expression reveals specific signaling pathways associated with platinum resistance in ovarian cancer.***BMC Med. Genomics*2009,**2:**34. 10.1186/1755-8794-2-34View ArticleGoogle Scholar - Shaknovich R, Geng H, Johnson N, Tsikitas L, Cerchietti L, Greally J, Gascoyne R, Elemento O, Melnick A:
**DNA methylation signatures define molecular subtypes of diffuse large B-cell lymphoma.***Blood*2010,**116**(20):e81-89. 10.1182/blood-2010-05-285320View ArticleGoogle Scholar - Widschwendter M, Jiang G, Woods C, Muller H, Fiegl H, Goebel G, Marth C, Muller-Holzner E, Zeimet A, Laird P, Ehrlich M:
**DNA hypomethylation and ovarian cancer biology.***Cancer Res*2004,**64**(13):4472-4480. 10.1158/0008-5472.CAN-04-0238View ArticleGoogle Scholar - Newell-Price J, Clark A, King P:
**DNA methylation and silencing of gene expression.***Trends Endocrinol. Metab*2000,**11**(4):142-148. 10.1016/S1043-2760(00)00248-4View ArticleGoogle Scholar - Cancer Genome Atlas Research Network:
**Integrated genomic analyses of ovarian carcinoma.***Nature*2011,**474**(7353):609-615. 10.1038/nature10166View ArticleGoogle Scholar - Cancer Genome Atlas Research Network:
**Comprehensive genomic characterization of squamous cell lung cancers.***Nature*2012,**487**(7417):519-525.Google Scholar - Cancer Genome Atlas Research Network:
**Comprehensive molecular characterization of human colon and rectal cancer.***Nature*2012,**487**(7407):330-333. 10.1038/nature11252View ArticleGoogle Scholar - Cancer Genome Atlas Research Network:
**Comprehensive molecular portraits of human breast tumours.***Nature*2012,**490**(7418):61-70. 10.1038/nature11412View ArticleGoogle Scholar - Sahoo D, Dill D, Gentles A, Tibshirani R, Plevritis S:
**Boolean implication networks derived from large scale, whole genome microarray datasets.***Genome Biol*2008,**9**(10):R157. 10.1186/gb-2008-9-10-r157View ArticleGoogle Scholar - Sahoo D, Dill D, Tibshirani R, Plevritis S:
**Extracting binary signals from microarray time-course data.***Nucleic Acids Res*2007,**35**(11):3705-3712. 10.1093/nar/gkm284View ArticleGoogle Scholar - Hinoue T, Weisenberger D, Lange C, Shen H, Byun H, Van Den Berg D, Malik S, Pan F, Noushmehr H, van Dijk C, Tollenaar R, Laird P:
**Genome-scale analysis of aberrant dna methylation in colorectal cancer.***Genome Res*2012,**22**(2):271-282. 10.1101/gr.117523.110View ArticleGoogle Scholar - Qiu P, Zhang L:
**Identification of markers associated with global changes in DNA methylation regulation in cancers.***BMC Bioinformatics*2012,**13**(Suppl 13):S7. 10.1186/1471-2105-13-S13-S7View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.