Chinese Restaurant Process Mixture Models

This section is intended to provide the minimally sufficient set of mathematical details to understand the tuning parameters of ChiRP’s regression functions. If you need a refresher on Dirichlet Processes, check out this interactive tutorial.

The Generative Model

The ChiRP package implements the following generative Chinese Restaurant Process (CRP) regression mixture. Each regression function takes in training data $D = \{D_i\}_{i=1:n} = (y_i, x_i)_{i=1:n}$ (and a test set, if desired), where $x_i$ is a $p \times 1$ covariate vector (including an intercept) and $y_i$ is a scalar outcome observed for $n$ subjects indexed by $i$. The regression functions returns, for each subject in the training (and testing, if applicable) data set, a posterior distribution of predicted outcome $y$. It also returns cluster assignments for each subject.

The full probability model is given by \[ \begin{aligned} y_i | x_i, \eta_i & \sim f_y(y_i|x_i, \eta_i) \\ x_i | \theta_i & \sim f_x(x_i | \theta_i ) \\ \eta_i, \theta_i | G & \sim G \\ G & \sim DP(\alpha G_0) \end{aligned} \] Above, $f_y$ is the conditional distribution of the outcome. It is governed by parameters $\eta_i$. Below, we outline three currently supported outcome types: zero-inflated continuous outcomes, continuous outcomes, and binary outcomes.

The joint distribution of the covariates, $f_x$, is governed by $\theta_i$. These parameters are distributed according to an unknown distribution $G$, to which we assign a Dirichlet Process (DP) prior. For compactness, we will sometimes denote $\omega_i = (\eta_i, \theta_i)$. Since draws, $G$, from a DP are discrete, there are bound to be ties among the subject specific parameters $\omega_i$. That is, subjects will tend to cluster together. The number of clusters is not set a priori, but rather infinitely many clusters are assumed.

This induced clustering partitions complex data into more homogenous subgroups - each with their own set of regression parameters. This allows CRP models to obtain flexible fits where parametric models may fail.

Package implementation details:

All models accommodate any number of binary and continuous covariates only. Categorical variables are not supported (yet).
All models assume independence among covariates: $f_x(x_i) = \prod_{j=1}^p f_{x^p}(x_i^p|\theta_i^p)$
- $f_{x^p}$ is set to be Bernoulli for binary covariates and Gaussian, $N(\lambda, \tau)$ for continuous covariates.
Hyperparameters
- Concentration parameter, $\alpha$, is estimated with a $Gam(1,1)$ prior.
- For continuous covariates, we use hyperpriors $\lambda \sim N(\mu, var=\sigma^2)$ and $\tau \sim InvGamma(shape=g_2, rate=b_2)$.
- In an Empirical Bayes fashion, we set $\mu = \bar{x}$ and $\sigma^2 = \text{mu_scale} \cdot var(x)$, where mu_scale is an optional tuning parameter.
- We set $g_2=2$ and $b2 = \text{tau_scale} \cdot var(x)$ for all continuous covariates, where tau_scale is an optional tuning parameter. If tau_scale=1, the prior for the variance is centered around the empirical. For tau_scale>1, we allow for larger variances a priori .

Supported Outcome Models

Zero-Inflated Outcomes with `ZDPMix()`

In this implementation, \[f_y(y_i|x_i, \eta_i) = \pi(x_i'\gamma_i)\cdot \delta_0(y) + (1 - \pi(x_i'\gamma_i) ) \cdot N(x_i'\beta, \phi)\]

Where $\pi(x_i'\gamma_i) = expit(x_i'\gamma_i)$ is the probability of the outcome being zero and $\delta_0(y)$ is the point-mass distribution at $0$. See this paper for methodological details.

The function call looks like this:

zDP_res<-ZDPMix(d_train = d, ## training data.frame() object
                d_test = dt, ## test data.frame() object 
                formula = y ~ L1 + X1 + X2 + X3 + X4 + A,
                ## specify total number of mcmc draws (iter) and burnin (burnin)
                burnin=2000, iter = 3000) 
# this will return iter - burnin posterior draws.

ZDPMix()A Gaussian prior $N_p( \beta_{OLS} , \text{beta_var_scale}\cdot\Sigma )$ is chosen for $\beta$. $\beta_{OLS}$ is a linear regression estimated using observations where $y>0$. $\Sigma$ is a diagonal matrix with estimates parameter variances from this regression along the diagonal. The optional parameter beta_var_scale allows the user to make the prior more or less dispersed around the observed data.

A Gaussian prior $N_p(0 , 2\cdot I_p )$. Note that this is actually fairly flat on the logit scale putting substantial prior probability on odds ratios in the range of $.01-.70$.

Sampling for this model involves a Metropolis step to propose a draw of $\gamma$. The optional tuning parameter prop_z_sigma is a length $p$ vector specifying the variance of the Gaussian proposal distribution used in the Metropolis step.

Finally, we place an Inverse Gamma prior on $\phi \sim InvGamma(g_1, b_1)$ on the data variance. The optional tuning parameter phi_y is a length two vector phi_y=$(g_1, b_1)$ that allows the user to specify this parameters.

Continuous Outcomes `NDPMix()`

In this implementation, $f_y(y_i|x_i, \eta_i) = N(x_i'\beta, \phi)$. All the same applies as in the zero-inflated case above.

Binary Outcomes `PDPMix()`

In this implementation, $f_y(y_i|x_i, \eta_i) = Ber( expit( x_i'\beta))$, where $expit()$ is the inverse logistic function.

Posterior Clustering

The Chinese Restaurant Posterior

Let $c_{1:n}$ represent a vector of cluster indicators for the $n$ subjects. Since the DP prior induces a clustering on patients, we can make inference on $c_{1:n}$. Consider clustering each patient one at a time in a given Gibbs iteration, $t$. At the time we cluster subject $i$, $i-1$ subjects have already been clustered to assignments $c_{1:i-1}^{(t)}$ and subject specific parameters $\omega_{1:i-1}^{(t)}$, then the conditional posterior distribution of $c_i^{(t)}$ is,

\[ p(c_i^{(t)} = k| c_{1:i-1}, \omega_{i:i-1} ) \propto \begin{cases} \frac{1}{\alpha + i -1} \sum_{j<i} I(c_j^{(t)}=k)p(D_i | \omega_j) & \text{for } k \in c_{1:i-1}^{(t)} \\ \frac{\alpha}{\alpha + i -1} \int p(D_i | \omega ) dG_0(\omega) & \text{for } k \not \in c^{(t)}_{1:i-1} \end{cases} \]

These equations characterize the Chinese Restaurant Process posterior. At every Gibbs iterations, $t$, a subject’s cluster assignment can be updated to an existing cluster with probability proportional to the number of subjects in that cluster times how well well subject $i$’s data, $D_i$, fits the likelihood under that cluster’s parameters. However, a subject can also be assigned to a new cluster with probability proportional to $\alpha$ times how well $D_i$ fits the likelihood under a new set of parameters drawn from the prior. If a patient’s observed data is very unique and doesn’t fit well with existing parameters, a new cluster is likely to open up to accommodate this subject. However, for large $n$, as $i\rightarrow n$ the probability of a new cluster opening up becomes less and less likely. Thus, the model naturally prevents overfitting (i.e. every subject getting assigned to their own, singleton cluster).

The functions ZDPMix,NDPMix, and PDPMix all output a $n \times T$ matrix for $T$ post-burnin posterior draws. This matrix gives the cluster label for each subject, at each iteration. This is done for both training and testing datasets. These matrices are stored in the function output as cluster_inds$train and cluster_inds$test.

ChiRP makes posterior inference via MCMC sampling as in Neal 2000. We introduce latent cluster indicators $c_{i:n}$ and alternative between updating these assignments conditional on parameters $\omega_{i:n}$ and then updating parameters $\omega_{i:n}$ conditional on assignments.

Posterior Mode Clustering

The function cluster_assign_mode takes as input either cluster_inds$train or cluster_inds$test and computes the posterior mode cluster assignment associated with each subject while accounting for label switching. See this paper for a more precise reference on the deterministic relabeling procedure implemented in ChiRP.

In short, we use the fact that at every Gibbs iteration we can construct an $n\times n$ adjacency matrix with a $1$ in the $i-j$ entry indicating that subject $i$ and $j$ were clustered together and a $0$ otherwise. We can then take an elementwise posterior mean of these matrices to estimate the posterior mode adjacency matrix where the $i-j$ entry is the proportion of times that subject $i$ and $j$ were clustered together . We then do a search to see which of the posterior adjacency matrices is closest to the posterior mode adjacency matrix in the $L_2$ sense. The indicators from this matrix are used as the posterior mode cluster assignment.

Posterior Predictions

Having observed training data $D$, we may wish to predict an outcome $\tilde y$ given testing covariates $\tilde x \in \mathcal R^p$. Given $T$ posterior draws of $\omega_{1:n}^{(t)}$, indexed by $t$, ChiRP implements a Monte Carlo procedure to obtain $T$ draws from the conditional outcome distribution of test observation with covariates $\tilde x$.

Draw $\tilde c \sim Categorical( p(\tilde x | \omega_1^{(t)}) , p(\tilde x | \omega_2^{(t)}, \dots , p(\tilde x | \omega_n^{(t)}), p(\tilde x | \omega_{new}) )$, where $\omega_{new}$ is a new set of cluster parameters drawn from the prior $G_0$.
Set $\tilde \omega$ to be the parameter corresponding to $\tilde c$ drawn above.
Finally, draw $\tilde y^{(t)} \sim f_y( y | \tilde x, \tilde \omega)$ from the generative model defined at the top of this page.

In this way we get $T$ posterior predictive draws for each subject in the test set. The functions ZDPMix,NDPMix, and PDPMixall output an $n\times T$ matrix of these posterior draws. They are called predictions$train and predictions$test. Each row is the posterior predictive distribution for a given subject. We can take the mean of each row to be a posterior predictive mean estimate. We can use percentiles to form credible intervals around the point estimate.