DNA methylation arrays as surrogate measures of cell mixture distribution

DNA methylation arrays as surrogate measures of cell mixture distribution

DNA methylation arrays as surrogate measures of cell mixture distribution

There has been a long-standing want in biomedical analysis for a way that quantifies the usually blended composition of leukocytes past what is feasible by easy histological or circulate cytometric assessments. The latter is restricted by the labile nature of recombinant protein epitopes, necessities for cell processing, and well timed cell evaluation. In a various array of illnesses and following quite a few immune-toxic exposures, leukocyte composition will critically inform the underlying immuno-biology to most persistent medical situations. Emerging analysis demonstrates that DNA methylation is chargeable for mobile differentiation, and when measured in complete peripheral blood, serves to differentiate most cancers instances from controls.

DNA methylation arrays as surrogate measures of cell mixture distribution
DNA methylation arrays as surrogate measures of cell mixture distribution

Statistical strategies

Let Y0h be an × 1 vector of methylation assay values, e.g. common beta values from an Infinium bead-array product akin to a purified blood pattern consisting of a homogenous mobile inhabitants  maxanim(e.g. monocytes or granulocytes), with the qualitative characterization of cell sort (amongst d0 such varieties) indicated by a d0 × 1 covariate vector wh. Here, h∈{1,…,n0}, the place n0 is the quantity of specimens and the m particular person values correspond to CpG websites on a DNA methylation microarray, presumably pre-selected to correspond to putative DMRs for distinguishing totally different mobile varieties. Correspondingly, let Y1i be an × 1 vector of methylation assay values for a similar CpG websites (in the identical order) as Y0h, however akin to a heterogeneous mixture of cells (e.g. peripheral complete blood) from a human topic. Here, i∈{1,…,n1}, n1 is the quantity of goal specimens, and z1is a d1×1 covariate vector representing phenotypes or exposures akin to the topic, e.g. d1 = 2 for a easy case/management examine with out confounders. Our objective is to grasp the associations between Y1i and z1i in phrases of associations between Y0and w0h, i.e. to deduce adjustments in mixtures of cell varieties related to phenotypes or exposures, utilizing DNA methylation as a surrogate measure of cell mixture. Thus, now we have two knowledge units, S0 = {(Y01,w1),…,(Y0n0,wn0)}, the set of knowledge from “purified” cell samples successfully representing exterior validation or gold-standard knowledge, and S1 = {(Y11,z1),…,(Y1n1,zn1)}, representing surrogate knowledge collected from a goal inhabitants. To this finish, we posit the next linear fashions:(1)Y0h=B0w0h+e0hY1i=B1z1i+e1i,the place B0 and B1 are, respectively, × d0 and × d1 matrices and e0 and e1 are error vectors. For simplicity we assume a one-way ANOVA parameterization for w, although within the Additional file1 we describe slight generalizations to account for design problems met in apply. We additionally assume an inexpensive regression parameterization for z, together with an intercept, and for comfort, denote the primary column of B0 as μ1, the × 1 intercept. The error vectors e0 and e1 might mirror independence amongst arrays h and i, or else might have extra complicated random results construction accounting for technical results or organic replication; nonetheless, their substructures are incidental to this evaluation, with the exception of the effective particulars of the bootstrap process proposed beneath.To implement a surrogacy relation, we suggest the next linking regression mannequin:


the place Γ is a d0 × d1 matrix that summarizes associations between the rows of B0j and B1i and U is a matrix of errors. Substituting equation (2) into (1), writing B0 = (b01,…,b0d0) explicitly in phrases of its columns and writingΓT=(γ1,,γd0), it follows that


To impart a organic interpretation, we assume that the DNA assayed in S1 arises as a mixture of DNA from cell varieties profiled in S0, with mixture coefficients whose inhabitants averages, conditional on z, are{ω1(z),,ωd0(z)}, in order that


the place the × 1 vector ξ(z) represents cell varieties excluded from consideration among the many purified samples in S0, or else non-cell-specific methylation, together with alterations on the molecular degree within the maintanence of DNA methylation patterns themselves (presumably publicity associated, age, or illness associated). It follows from (3) and (4) that the mixture coefficients are recoverable from Γ,ωl(z)=γlTz1i, offered ξ(z) is orthogonal to the column area of B0. As we talk about intimately within the Additional file1, bias can come up if variations in ξ(z) between distinct values of z have nonzero projection onto the column area of B0, though the magnitude of anticipated biases might be assessed by means of sensitivity evaluation.

It is feasible to assign interpretations to the elements of variation in (3). Let SSo represents total variability in Y1i, i.e.SSo=i=1n1Y1iμ¯12, the placeμ¯1=E(Y1i). From multivariate likelihood concept it’s simple to indicate that SS= SSe + SSv + SSu, the placeSSe=i=1n1e1i2,SSv=i=1n1(z1iz¯1)TΓTB0TB0Γ(z1iz¯1), andSSu=i=1n1{(z1iz¯1)TUTU(z1iz¯1)+m(z1iz¯1)Tγ0γ0T(z1iz¯1)}SSe measures variation unexplained by the covariates z1i, presumed to characterize a mixture of technical noise and unsystematic organic heterogeneity. SSv measures variability defined by mixtures of profiles within the set S0, whereas SSu measures variability in systematic organic heterogeneity that however stays unexplained by mixtures of profiles in S0, presumably as a consequence of some course of aside from variations in mixtures of cell varieties. Thus we suggest two partial coefficient of willpower measures:R1,02=SSv/SSo, which represents the proportion of whole variation in S1 defined by S0, andR1,12=SSv/(SSoSSe), which represents the proportion of systematic variation in S1 defined by S0. Note thatR1,12 is poorly outlined when SSoSSe.

Estimation procedes by making use of an acceptable linear mannequin, e.g. atypical least squares, linear blended results fashions[16], limma[17], or surrogate variable evaluation[18,19], to acquire estimatesB^0 andB^1. Estimates of γ0 and Γ are then obtained by projectingB^1 onto the column area ofB~0=(1m,B0), as described intimately within the Additional file Standard errors might be obtained in a single of 3 ways. The easiest estimator, SE0, is the “naive” estimator from easy least-squares concept, ignoring the truth thatB^0 andB^1 are estimates, i.e. doubtlessly variable. To account for variation in estimatingB^1, a easy various is to make use of a nonparametric bootstrap process. For every bootstrap iteration t, we pattern with substitute from S1 (or pattern errors in a way per a hierarchical experimental design) to acquireS1(t), producing bootstrap estimatesB^1(t) from which “single-bootstrap” normal errors SE1 are computed. Finally, it’s potential to account for variation in estimating B0by additionally bootstrapping S0; as a result of of doubtlessly small pattern sizes n0, we suggest utilizing a parametric bootstrap. A“double-bootstrap” normal error estimator, SE2, is computed from these two units of bootstraps. The double-bootstrap has the extra profit over the single-bootstrap, in that it may be used to evaluate bias as a consequence of measurement error (variability) inB^0. Estimation particulars are offered within the Additional fileas are the outcomes of simulation research.

Beyond bias as a consequence of measurement error, which is definitely corrected utilizing the double-bootstrap process, there are extra sources of potential bias. For instance, think about a univariate z1i representing case/management standing, the placeδξ(1)ξ(0)=B0α for some d0 × 1 vector α  0; i.e. δ is the imply distinction in DNA methylation between a case and management, contributed by cell mixtures that stay uncharacterized or non-cell-specific methylation. In such a scenario, there might be a bias equal to α in estimating the mixture variations. The Additional file1 offers an in depth evaluation of such biases, and proposes a sensitivity evaluation process for assessing the magnitude of potential bias in a given knowledge set.

While the main target of this paper is evaluation of inhabitants knowledge, it’s potential to make use of S0 to foretell distribution of leukocytes in a single pattern having DNA methylation profile Y. Equating the intercept time period of B1 in (1) with Y and making use of (2), we get hold of mixing proportion estimatesΓ=(B~0TB~0)1B~0TY. Estimates might be additional refined with the use of quadratic programming methods[20], proscribing the elements of Γ,γl ≥ 0, in minimizingYB~0Γ2 with respect to Γ. Such particular person projections of methylation profiles on the column area spanned by S0 facilitate the appliance of the elemental concepts proposed above to particular person, clinically-based diagnostic procedures. Note, nonetheless, that DNA methylation arrays are usually centered on the comparability of methylated to unmethylated CpG dinucleotides, not quantifying precise quantities of DNA. Therefore, info on cell mixtures from DNA methylation is proscribed to distributions, not precise counts, as one would possibly get hold of from circulate cytometry. Finally, we comment that it’s potential to mannequin z1i straight as a operate of mixture coefficients Γ obtained individually by way of the constraint γl ≥ 0, however the inferential implications are much less clear, and we view the proposed method for populations as extra statistically strong.


We describe a number of examples utilizing current methylation knowledge units as benchmarks for validating the proposed methodology, as a way to exhibit its medical or epidemiological utility. First we describe the validation knowledge set S0 utilized in all examples. Next we describe a laboratory reconstruction experiment, which validates our elementary proposition that DNA methylation retains substantial details about cell mixtures. Finally we describe the outcomes of making use of our methodology to a number of totally different goal knowledge units S1. For the pinnacle and neck most cancers and ovarian most cancers knowledge units, from which bead chip knowledge had been obtainable, a linear blended results mannequin with a random intercept for bead chip was used to estimate the corresponding row of B1.

For the remaining knowledge units, no bead chip knowledge had been obtainable; consequently, atypical least squares was used. 250 bootstrap iterations had been used for every instance and every of the 2 bootstrap strategies of normal error estimation.

Validation knowledge

All knowledge analyses contain DNA methylation knowledge obtained by the Infinium HumanMethylation27 Beadchip Microarrays from Illumina, Inc. (San Diego, CA). We used a subset of = 100 CpG websites on the array, chosen as described beneath. In all of our examples, S0 consisted of 46 white blood cell samples, de-identified specimens that weren’t topic to human topics overview by an institutional overview board (IRB). The sorted, regular, human, peripheral blood leukocyte subtypes had been bought from AllCells, LLC (Emeryville, CA) and had been remoted from complete blood utilizing a mixture of damaging and constructive choice with extremely particular cell floor antibodies conjugated to magnetic beads; supplies and protocols had been obtained from Miltenyi Biotec, Inc.

Cell mixture experiment

Proof of the utility of the proposed strategies in predicting leukocyte distributions for particular person samples requires in depth, detailed reconstruction experiments past the scope of the current paper. However, to offer proof that such experiments are worthwhile and present promise of constructive outcomes, we carried out a easy experiment involving six recognized mixtures of monocytes and B cells and 6 recognized mixtures of granulocytes and T cells. The outcomes of this experiment are described beneath in Results.

Head and neck most cancers

Our first goal knowledge set S1 consisted of arrays utilized to complete blood specimens collected in a random subset of people concerned in an ongoing population-based case-control examine  of head and neck most cancers (HNSCC): 92 instances and 92 age and intercourse matched controls. The examine was permitted by Brown University IRB, protocol #0707992334. Blood was drawn at enrollment (previous to remedy in 85% of the instances). Mean age among the many topics arrayed on this examine was 60 years, and there have been 56 females and 128 males, per the upper incidence of the illness in males. Thus, the covariate vector zconsisted of an indicator for case/management standing, an indiator for male intercourse, and age (in many years) centered on the imply. The clustering heatmap in  depicts the uncooked DNA methylation knowledge in S1.

Ovarian most cancers

We subsequent utilized our methodology to an ovarian most cancers knowledge set[22]. DNA methylation knowledge for blood samples can be found from Gene Expression Omnibus , Accession quantity GSE19711). We used solely these instances having blood drawn pre-treatment. After eradicating 4 arrays with a preponderance of lacking values, the info set consisted of 272 controls and 129 instances having blood drawn previous to remedy. A clustering heatmap displaying the DNA methylation knowledge seems within the Additional file  In this evaluation, zconsisted of case-control standing, age (categorized in 5-year increments), and a pair of bisulfite conversion effectivity measures.

Down syndrome

We additionally utilized our methodology to a trisomy 21 (Down syndrome) knowledge se consisting of 29 whole peripheral blood leukocyte samples from Down syndrome instances and 21 controls, as nicely as 6 T cell samples from instances and Four T cell samples from controls (GEO Accession quantity GSE25395). Because of the potential for bias induced by copy quantity amplification, we excluded Four CpG websites on Chromosome 21, leading to = 96 CpG websites used for evaluation. A clustering heatmap displaying the DNA methylation knowledge seems within the Additional file.

Finally, we utilized our methodology to an weight problems knowledge set consisting of 7 lean African-Americans and seven Obese African-Americans (GEO Accession quantity GSE25301). A clustering heatmap displaying the DNA methylation knowledge seems within the Additional file. In this evaluation, zconsisted of weight problems standing.

Additional analyses

If the topic inhabitants for which = 0 is sufficiently homogeneous with respect to blood cell distribution to confess wise characterization of that distribution, then it’s potential to recuperate estimates fromΓ^. The Additional file studies the outcomes of such an evaluation utilized to the HNSCC case/management knowledge set. Finally, we carried out an extra evaluation the place we took S0 to consist of solely samples with pure CD4+ or CD8+ cells and S1 to consist solely of samples having the much less purified T-lymphocytes. For such S1, there have been no covariates, so zconsisted solely of an intercept.


We carried out in depth simulation research as a way to confirm the finite-sample statistical properties of our proposed methodology. Simulation parameters had been obtained from the HNSCC knowledge set, and most simulations assumed no sources of organic bias (DNA methylation adjustments arising from processes not mediated by the profiled leukocytes, together with shifts in distribution inside cell varieties not profiled). In each simulation, we specified S0 to consist of 5 B-cell samples, 10 granulocyte samples, 5 monocyte samples, 15 NK samples, 5 normal “Pan-T” T-cell samples, eight particular CD4+ T cell samples, and a pair of particular CD8+ T cell samples. Estimates from the exterior validation set S0, described above, had been used for imply methylation profiles amongst WBC varieties, utilizing

DNA methylation arrays as surrogate measures of cell mixture distribution
DNA methylation arrays as surrogate measures of cell mixture distribution

the = 100 most informative CpG websites.