An Introduction to Mestim

This vignette serves as a short introduction to computing the variance-covariance matrix of a multidimensional parameter using M-estimation and the empirical sandwich variance.

Baby review of M-estimation

Denoting $F$ a probability distribution, a $p$-dimensional M-estimator of $\psi$-type $T$ solves the vector equation \[\int_\mathcal{Z}\psi\big(z, T(F)\big)dF(z)=\boldsymbol{0}.\] In practice, this means that the estimator $T$ has an unbiased estimating function $\psi(z,\theta)=\{\psi_1(z,\theta), \ldots, \psi_p(z,\theta)\}^T$ with solution $\hat{\theta}~(p\times 1)$ solving in $\theta$ the $(p\times 1)$ set of “stacked” estimating equations given by \[ \sum_{i=1}^{n}\psi(Z_i,\theta)=\boldsymbol{0}.\] For a $\psi$-type M-estimator $T$ with estimate $\hat{\theta}$, under suitable regularity conditions for $\psi$, the central limit theorem and Slutsky’s theorem yield \[\sqrt{n}\big(\hat{\theta}-T(F)\big)\xrightarrow{\mathcal{D}}\mathcal{N}\big(0, \Sigma)\] where \[\Sigma=\Bigg[\mathbb{E}\bigg\{\frac{\partial \psi\big(Z,T(F)\big) }{\partial\theta^T}\bigg\}\Bigg]^{-1}\mathbb{E}\Big\{ \psi\big(Z,T(F)\big) \psi^T\big(Z,T(F)\big)\Big\}\Bigg[\mathbb{E}\bigg\{\frac{\partial \psi\big(Z,T(F)\big) }{\partial\theta^T}\bigg\}\Bigg]^{-T}.\] This implies that the $p$-dimensional M-estimator $\hat{\theta}$ is an asymptotically normal, $\sqrt{n}$-consistent, estimator for $T(F)$. See Boos and Stefanski (2013) for a full introduction to M-estimation.

What `Mestim` does

Provided with observed data $(Z_i)_{1≤i≤n}$, a $p$-dimensional vector of estimates $\hat{\theta}$ and a $(p\times 1)$ unbiased estimating function $\psi$, the Mestim package computes the sandwich formula \[\hat{\Sigma}=\Bigg[n^{-1}\sum_{i=1}^n\bigg\{\frac{\partial \psi\big(Z_i,\hat{\theta}\big) }{\partial\theta^T}\bigg\}\Bigg]^{-1} n^{-1}\sum_{i=1}^n\Big\{ \psi\big(Z_i,\hat{\theta}\big) \psi^T\big(Z_i,\hat{\theta}\big)\Big\} \Bigg[n^{-1}\sum_{i=1}^n\bigg\{\frac{\partial \psi\big(Z_i,\hat{\theta}\big) }{\partial\theta^T}\bigg\}\Bigg]^{-T}.\] The estimated asymptotic variance-covariance matrix of $\hat{\theta}$ is $n^{-1} \hat{\Sigma}$, and so, we have \[\hat{\theta} \mathrel{\dot\sim} \mathcal{N}\big(0, n^{-1} \hat{\Sigma}).\] Under the hood, Mestim algorithmically computes the Jacobian matrix of $\psi$; derivatives and outer products in $\hat{\Sigma}$ are then computed in parallel. To compute the asymptotic variance-covariance matrix, the analyst thus only need to provide a list detailing the “stacked” estimating functions in $\psi$. Below, we give examples of growing complexity to illustrate how Mestim can leverage the flexibility of M-estimation theory to calculate asymptotic standard errors (and confidence intervals) for parameter estimates $\hat{\theta}$.

Example 1: Prediction task via logistic regression

Let’s try to compute the asymptotic standard errors of estimated parameters in a logistic regression model. This simple example serves to get familiar with Mestim commands.

Let’s generate synthetic data with two predictors and a binary outcome such that $Z=(X_1,X_2,Y)^T$.

click here to see the data generating process.

Here we use \[X_1 \sim \mathcal{N}(0,1)\] \[X_2 \sim \mathcal{N}(0,3)\] \[Y|X \sim \mathcal{B}\Big(\text{expit}(\beta_1^{0}X_1+\beta_2^{0}X_2)\Big)\] with $\beta_1^{0}=4$, $\beta_2^{0}=6$.

NB: We use superscript $^{0}$ to denote true values of the parameters.

gen_lr_dat <- function(n, seed=123)
{
set.seed(seed)
X_1 <- rnorm(n, sd = 1); X_2 <- rnorm(n, sd = 3) # generate x_1 and x_2
true_betas <- c(4,5) # generate true parameters
X <- model.matrix(~-1+X_1+X_2) # build the design matrix
Y <- rbinom(n, 1, 1/(1 + exp(-X %*% true_betas)) ) # generate Y from X and true_betas
dat  <-  data.frame(X_1=X_1, X_2=X_2, Y=Y) # build a simulated dataset
return(dat)
}

dat <- gen_lr_dat(5000)
head(dat)

##           X_1       X_2 Y
## 1 -0.56047565 -1.482522 0
## 2 -0.23017749  3.382780 1
## 3  1.55870831 -3.440849 0
## 4  0.07050839  4.443056 1
## 5  0.12928774  2.748574 1
## 6  1.71506499  1.005393 1

Let’s fit a logistic regression model and put the estimated parameters in a list.

mod <- glm(Y~-1 + X_1 + X_2, data=dat, family = "binomial")
thetas_hat <- list(thetas_1=coef(mod)[1], thetas_2=coef(mod)[2])

Recall that the estimated parameters $\hat{\theta}=(\hat{\theta}_1, \hat{\theta}_2)^T$ from this logistic regression model jointly solve \[\sum_{i=1}^{n}\bigg(\Big[1+\exp\big\{-{(\theta_1X_{1,i}+\theta_2X_{2,i})\big\}\Big]^{-1}}-Y_i\bigg)X_{1,i} =0\] and \[\sum_{i=1}^{n}\bigg(\Big[1+\exp\big\{-{(\theta_1X_{1,i}+\theta_2X_{2,i})\big\}\Big]^{-1}}-Y_i\bigg)X_{2,i} =0.\] Therefore, we can identify \[\psi_1(Z_i,\theta)=\bigg(\Big[1+\exp\big\{-{(\theta_1X_{1,i}+\theta_2X_{2,i})\big\}\Big]^{-1}}-Y_i\bigg)X_{1,i}\] and \[\psi_2(Z_i,\theta)=\bigg(\Big[1+\exp\big\{-{(\theta_1X_{1,i}+\theta_2X_{2,i})\big\}\Big]^{-1}}-Y_i\bigg)X_{2,i}.\] With that in mind, let’s build a list for the unbiased estimating function $\psi(z,\theta)=\Big(\psi_1(z,\theta), \psi_2(z,\theta)\Big)^T$.

psi_1 <- expression( ((1/(1+exp(-(thetas_1 * X_1 + thetas_2 * X_2)))) - Y) * X_1 )
psi_2 <- expression( ((1/(1+exp(-(thetas_1 * X_1 + thetas_2 * X_2)))) - Y) * X_2 )
psi <- list(psi_1, psi_2)

NB: parameters’ names (here thetas_1 and thetas_2) must be consistent with the previous list.

We are finally ready to pass these arguments to the get_vcov function form the Mestim package.

library(Mestim)
res <- get_vcov(data=dat, thetas=thetas_hat, M=psi)

You can obtain the variance-covariance matrix from a get_vcov result as follows

res$vcov

##            thetas_1   thetas_2
## thetas_1 0.05239025 0.05366863
## thetas_2 0.05366863 0.06795271

click here to verify that the variance-covariance matrix from get_vcov is similar to that of glm.

vcov(mod)

##            X_1        X_2
## X_1 0.05698408 0.06143937
## X_2 0.06143937 0.07963092

This is indeed very close the results inres$vcov.

Asymptotic standard errors are square root of the diagonal elements from the estimated variance-covariance matrix. These are stored in the se attribute.

res$se

##  thetas_1  thetas_2 
## 0.2288892 0.2606774

Example 2: Average treatment effect via outcome regression

Let’s generate synthetic observational data with treatment allocation $A$, continuous outcome $Y$ and a single confounder $X$ such that $Z=(X,A,Y)^T$.

click here to see the data generating process.

Here we use \[X \sim \mathcal{N}(0,1)\] \[A|X \sim \mathcal{B}\Big(\text{expit}(2X)\Big)\] \[\epsilon \sim \mathcal{N}(0,20)\] \[Y|X,\epsilon = \gamma_1^{0} X + \gamma_2^{0} A + \gamma_3^{0} AX + \epsilon\] with $\gamma_1^{0}=4$, $\gamma_2^{0}=3$, and $\gamma_3^{0}=2$.

NB: We use superscript $^{0}$ to denote true values of the parameters.

gen_obs_dat <- function(n, seed=123)
{
set.seed(seed)
X <- rnorm(n) # generate X
A <- rbinom(n, 1, 1/(1 + exp(- 2 * X)) )  # generate treatment allocation A
X_mat <- model.matrix(~ -1 + X + A + A:X) # build the design matrix
true_gammas <- c(4,3,2) 
epsi <- rnorm(n,0,20) # generate gaussian noise 
Y <- (X_mat %*% true_gammas) + epsi # generate observed outcomes 
dat  <-  data.frame(X=X, A=A, Y=Y) # build a simulated dataset
return(dat)
}

dat <- gen_obs_dat(5000)
head(dat)

##             X A          Y
## 1 -0.56047565 0   4.758148
## 2 -0.23017749 0  15.368124
## 3  1.55870831 1   2.018927
## 4  0.07050839 1 -50.422238
## 5  0.12928774 1 -18.163366
## 6  1.71506499 1 -11.819112

In this example, the goal is to calculate standard errors for the outcome regression average causal effect estimator \[\hat{\delta}=n^{-1}\sum_{i=1}^n\mathbb{\hat{E}}(Y|X=X_i, A=1)-\mathbb{\hat{E}}(Y|X=X_i, A=0).\]

For $\mathbb{E}(Y|X, A)$, let’s specify the regression model $m(X, A;\boldsymbol{\gamma})=\gamma_1X + \gamma_2A + \gamma_3AX$ and store the estimated parameters.

m <- lm(Y~ -1 + X + A + A:X, data = dat)
gamma_1_hat <- coef(m)[1]
gamma_2_hat <- coef(m)[2]
gamma_3_hat <- coef(m)[3]

Recall that the estimated parameters $\hat{\boldsymbol{\gamma}}=(\hat{\gamma}_1,\hat{\gamma}_2,\hat{\gamma}_3)^T$ from this linear regression model jointly solve \[\sum_{i=1}^n (Y_i-\gamma_1X_i - \gamma_2A_i - \gamma_3A_iX_i)X_i=0,\] \[\sum_{i=1}^n (Y_i-\gamma_1X_i - \gamma_2A_i - \gamma_3A_iX_i)A_i=0,\] \[\sum_{i=1}^n (Y_i-\gamma_1X_i - \gamma_2A_i - \gamma_3A_iX_i)A_iX_i=0.\] Disregarding the summation sign, we can straightforwardly identify the three first elements of the estimating function $\psi(z,\theta)$. Before building a list detailing the function $\psi$, we need to identify the estimating function of our main parameter of interest $\delta.$ To do so, recall that we can estimate $\delta$ as \[\hat{\delta}=n^{-1}\sum_{i=1}^nm(X_i,1;\hat{\boldsymbol{\gamma}})-m(X_i,0;\hat{\boldsymbol{\gamma}}) \\ =n^{-1}\sum_{i=1}^n\{\hat{\gamma}_1+ \hat{\gamma}_2 \times 1 +\hat{\gamma}_3 \times 1 \times X_i\} - \{\hat{\gamma}_1+ \hat{\gamma}_2 \times 0 +\hat{\gamma}_3 \times 0 \times X_i\} \\ =n^{-1}\sum_{i=1}^n \hat{\gamma}_2 +\hat{\gamma}_3 X_i. \] Let’s first compute it.

delta_hat <- mean(gamma_2_hat + gamma_3_hat*dat$X)

Note that rearranging the last equality we have \[\sum_{i=1}^n \hat{\gamma}_2 +\hat{\gamma}_3 X_i - \hat{\delta} = 0\] which straightforwardly yields the last element of the estimating function $\psi(z,\theta)$. Disregarding the summation sign yields the last estimating function which we can now “stack” with the previous ones. Let’s now build a list detailing the full function $\psi(z,\theta)$.

psi_1 <- expression( (Y - gamma_1*X - gamma_2*A - gamma_3*A*X) * X )
psi_2 <- expression( (Y - gamma_1*X - gamma_2*A - gamma_3*A*X) * A )
psi_3 <- expression( (Y - gamma_1*X - gamma_2*A - gamma_3*A*X) * A*X )
psi_4 <- expression( gamma_2 + gamma_3 * X - delta )
psi <- list(psi_1, psi_2, psi_3, psi_4)

Let’s also store all the estimated parameters $\hat{\theta}=(\hat{\gamma}_1,\hat{\gamma}_2,\hat{\gamma}_3,\hat{\delta})^T$ in a list.

thetas_hat <- list(gamma_1=gamma_1_hat,
                  gamma_2=gamma_2_hat,
                  gamma_3=gamma_3_hat,
                  delta=delta_hat)

Let’s pass the relevant arguments to get_vcov and check results for the variance-covariance matrix.

res <- get_vcov(data=dat, thetas=thetas_hat, M=psi)
res$vcov

##               gamma_1       gamma_2    gamma_3         delta
## gamma_1  1.686258e-01 -5.116353e-17 -0.1686258  2.291608e-05
## gamma_2 -4.167777e-17  2.510135e-01 -0.1497095  2.509786e-01
## gamma_3 -1.686258e-01 -1.497095e-01  0.4228791 -1.496732e-01
## delta    2.291608e-05  2.509786e-01 -0.1496732  2.512757e-01

Let’s see how the results compare with standard errors obtained from the bootstrap.

click here to see the bootstrap procedure

boot_fun <- function(d, i=1:nrow(d)) {
  z<-d[i,]
  mod <- lm(Y~ -1 + X + A + A:X, data = z)
  gamma_1_hat <- coef(mod)[1]
  gamma_2_hat <- coef(mod)[2]
  gamma_3_hat <- coef(mod)[3]
  delta_hat <- mean(gamma_2_hat*1 + gamma_3_hat*1*z$X)
  return( c(gamma_1_hat, gamma_2_hat, gamma_3_hat, delta_hat) )
}
boot_start_time <- Sys.time()
res_boot <- boot::boot(dat, boot_fun, R=999)
boot_end_time <- Sys.time()
paste("Bootstrapping took", round(as.numeric(boot_end_time - boot_start_time), 2), "seconds.")

## [1] "Bootstrapping took 4.07 seconds."

##              [,1]        [,2]       [,3]         [,4]
## [1,]  0.166959339 -0.00255421 -0.1636470 -0.002452947
## [2,] -0.002554210  0.24003558 -0.1433816  0.240057058
## [3,] -0.163647006 -0.14338163  0.4128908 -0.143436074
## [4,] -0.002452947  0.24005706 -0.1434361  0.240449603

This is pretty close to the results in res$vcov that we obtained 47.4 times faster with Mestim.

Example 3: Value estimation for dynamic treatment regime

Let’s generate synthetic observational data for dynamic clinical decisions. We note $S_t$ the observed state at time $t$, $A_t$ the binary action taken at time $t$, and $Y$ the terminal outcome for the sequence where higher values indicate worse disease symptoms. For illustrative purpose, we consider only $T=2$ decision points so that we have data $Z=(S_1,A_1,S_2,A_2,Y)^T.$

click here to see the data generating process.

The data are generated via the following hidden Markov process, where we only get to observe $S_t$, which is a noisy version of the underlying state $X_t$: \[X_1 \sim \mathcal{N}(0,0.1)\] \[X_{t+1}|X_t, A_t \sim \mathbf{1}_{X_t>0} \mathcal{N}(1.1X_t - 0.5A_t, 0.05) + \mathbf{1}_{X_t<0}X_t\] \[S_t|X_t \sim \mathcal{N}(X_t,0.1)\] \[A_1|S_1 \sim \mathcal{B}\Big(\text{expit}(-0.1+\log S_t)\Big)\] \[A_2|S_2,A_1 \sim \mathcal{B}\Big(\text{expit}(0.1+\log S_t + 3A_1)\Big)\]

\[Y|X_3,A_2,A_1 \sim \text{exp}\Bigg(\mathcal{N}\Big(X_3 + \lambda(A_2+A_1),0.1\Big)\Bigg).\] We consider that receiving treatment actions has a side effect penalty of $\lambda=0.1$.

gen_dtr_dat <- function(n, seed=456)
{
set.seed(seed)
expit <- function(x) 1/(1+exp(-x))

X_1 <- rnorm(n, sd=.1)
S_1 <- exp(rnorm(n, mean = X_1, sd = .1))
A_1 <- rbinom(n, size = 1, prob = expit(-.1+log(S_1)))

X_2 <- (X_1>0) * rnorm(n, mean = 1.1*X_1 - .5 * A_1, sd=.05) + (X_1<0) * X_1
S_2 <- exp(rnorm(n, mean = X_2, sd = .1))
A_2 <- rbinom(n, size = 1, prob = expit(.1+log(S_2)+3*A_1))

X_3 <- (X_2>0) * rnorm(n, mean = 1.1*X_2 - .5 * A_2, sd=.05) + (X_2<0) * X_2
Y <- exp(rnorm(n, mean = X_3 + .1*(A_1 + A_2), sd = .1)) #0.1 penalty for treating

dat <- data.frame(S_1=S_1, A_1=A_1, S_2=S_2, A_2=A_2, Y)  
return(dat)
}

dat <- gen_dtr_dat(5000)
head(dat)

##         S_1 A_1       S_2 A_2        Y
## 1 0.8402181   1 0.9292966   1 1.046362
## 2 1.0033476   0 1.0623962   0 1.024897
## 3 1.0889107   0 1.0687287   0 1.152630
## 4 0.8716224   1 0.8938716   1 1.069209
## 5 0.9743708   1 0.9324451   1 1.189594
## 6 1.0222173   1 0.9174528   1 1.195341

Given any treatment action regime $d=\{d_1,\ldots, d_T\}^T$, an estimator for $\mathbb{E}_{Z\sim d}(Y)$ is \[\begin{equation} \tag{1} \hat{\mathcal{V}}_{IPW}(d)=n^{-1}\sum_{i=1}^nY_i\prod_{t=1}^T\frac{\mathbf{1}\big({d_t(H_{t,i})=A_{t,i}}\big)}{\hat{e}_t(H_{t,i})^{d_t(H_{t,i})}\big\{1-\hat{e}_t(H_{t,i})\big\}^{1-d_t(H_{t,i})} } \end{equation}\]

where we use the history notation $H_t=(S_{t},A_{t-1},S_{t-1}, \ldots,S_{1})^T$ and write the relevant generalization of the propensity score as $e_t(H_t)=\mathbb{E}(A_{t}|H_t)$ for clarity.

In this example we consider the regime $\tilde{d}(Z)=\{\tilde{d}_1(H_1)=\mathbf{1}_{S_1>1},\tilde{d}_2(H_2)=\mathbf{1}_{S_2>1}\}^T$. The goal is to calculate standard errors for $\hat{\mathcal{V}}_{IPW}(\tilde{d})$. As $T=2$, we need specify models for $e_1(H_1)$ and $e_2(H_2)$. Let’s specify the parametric regression models \[e_1(H_1;\boldsymbol{\delta})=\text{expit}\big(\delta_1+\delta_2\log S_1)\] and \[e_2(H_2;\boldsymbol{\phi})=\text{expit}\big(\phi_1+\phi_2\log S_2 +\phi_3A_1).\] We fit and store the estimated parameters as follows.

e_1 <- glm(A_1~I(log(S_1)), data=dat, family = "binomial")
delta_1_hat <- coef(e_1)[1]
delta_2_hat <- coef(e_1)[2]

e_2 <- glm(A_2~I(log(S_2)) + A_1 , data=dat, family = "binomial")
phi_1_hat <- coef(e_2)[1]
phi_2_hat <- coef(e_2)[2]
phi_3_hat <- coef(e_2)[3]

As in example 1, recall that for $e_1$ the estimated parameters $\hat{\boldsymbol{\delta}}=(\hat{\delta}_1, \hat{\delta}_2)^T$ jointly solve \[\sum_{i=1}^{n}\Big[1+\exp\big\{-{(\delta_1+\delta_2 \log S_{1,i})\big\}\Big]^{-1}}-A_{1,i} =0,\] \[\sum_{i=1}^{n}\bigg(\Big[1+\exp\big\{-{(\delta_1+\delta_2 \log S_{1,i})\big\}\Big]^{-1}}-A_{1,i}\bigg) \log S_{1,i}=0.\] Similarly for $e_2$ the estimated parameters $\hat{\boldsymbol{\phi}}=(\hat{\phi}_1, \hat{\phi}_2, \hat{\phi}_3)^T$ jointly solve \[\sum_{i=1}^{n}\Big[1+\exp\big\{-{(\phi_1+\phi_2 \log S_{2,i}+\phi_3 A_{1,i})\big\}\Big]^{-1}}-A_{2,i} =0,\] \[\sum_{i=1}^{n}\bigg(\Big[1+\exp\big\{-{(\phi_1+\phi_2 \log S_{2,i}+\phi_3 A_{1,i})\big\}\Big]^{-1}}-A_{2,i}\bigg) \log S_{2,i}=0,\] \[\sum_{i=1}^{n}\bigg(\Big[1+\exp\big\{-{(\phi_1+\phi_2 \log S_{2,i}+\phi_3 A_{1,i})\big\}\Big]^{-1}}-A_{2,i}\bigg) A_{1,i}=0.\]

Disregarding the summation sign, we can straightforwardly identify the five first elements of the estimating function $\psi(z,\theta)$. Let’s store them before building our final list for $\psi$.

Note that for programming convenience, we recommend to store all relevant variable transformations as columns in the original dataframe.

dat$log_S_1 <- log(dat$S_1) ; dat$log_S_2 <- log(dat$S_2) # For ease of programming

psi_1 <- expression( ((1/(1+exp(-(delta_1 + delta_2 * log_S_1)))) - A_1) * 1 )
psi_2 <- expression( ((1/(1+exp(-(delta_1 + delta_2 * log_S_1)))) - A_1) * log_S_1)

psi_3 <- expression( ((1/(1+exp(-(phi_1 + phi_2 * log_S_2 + phi_3 * A_1)))) - A_2) * 1 )
psi_4 <- expression( ((1/(1+exp(-(phi_1 + phi_2 * log_S_2 + phi_3 * A_1)))) - A_2) * log_S_2)
psi_5 <- expression( ((1/(1+exp(-(phi_1 + phi_2 * log_S_2 + phi_3 * A_1)))) - A_2) * A_1)

To obtain the last element of $\psi$, we need to do algebraic manipulations. Denoting $\mathcal{C}_{\tilde{d}, i}=\prod_{t=1}^2\mathbf{1}\big({\tilde{d}_t(H_{t,i})=A_{t,i}}\big)$ for simplicity, after substitution for $\hat{e}_1(H_1;\boldsymbol{\hat{\delta}})$ and $\hat{e}_2(H_2;\boldsymbol{\hat{\phi}})$, equation (1) yields \[\begin{equation} \hat{\mathcal{V}}_{IPW}(\tilde{d})=\\ n^{-1}\sum_{i=1}^nY_i\mathcal{C}_{\tilde{d}, i} \frac{\Big[1+\exp\big\{-(\delta_1+\delta_2 \log S_{1,i})\big\}\Big]^{\tilde{d}_1(S_{1,i})} \Big[1+\exp\big\{-(\phi_1+\phi_2 \log S_{2,i}+\phi_3 A_{1,i})\big\}\Big]^{\tilde{d}_2(S_{2,i})}} {\bigg(1-\Big[1+\exp\big\{-(\delta_1+\delta_2 \log S_{1,i})\big\}\Big]^{-1}\bigg)^{1-\tilde{d}_1(S_{1,i})}\bigg(1-\Big[1+\exp\big\{-(\phi_1+\phi_2 \log S_{2,i}+\phi_3 A_{1,i})\big\}\Big]^{-1}\bigg)^{1-\tilde{d}_2(S_{2,i})}}. \end{equation}\]

Rearrangements of the equation above yield \[\begin{equation} \sum_{i=1}^nY_i\mathcal{C}_{\tilde{d}, i} \frac{\Big[1+\exp\big\{-(\delta_1+\delta_2 \log S_{1,i})\big\}\Big]^{\tilde{d}_1(S_{1,i})} \Big[1+\exp\big\{-(\phi_1+\phi_2 \log S_{2,i}+\phi_3 A_{1,i})\big\}\Big]^{\tilde{d}_2(S_{2,i})}} {\bigg(1-\Big[1+\exp\big\{-(\delta_1+\delta_2 \log S_{1,i})\big\}\Big]^{-1}\bigg)^{1-\tilde{d}_1(S_{1,i})}\bigg(1-\Big[1+\exp\big\{-(\phi_1+\phi_2 \log S_{2,i}+\phi_3 A_{1,i})\big\}\Big]^{-1}\bigg)^{1-\tilde{d}_2(S_{2,i})}} \\ -\hat{\mathcal{V}}_{IPW}(\tilde{d})=0. \end{equation}\] We can now straightforwardly identify and store the last elements of the estimating function $\psi$.

# The regime we are interested in
dat$d_1 <- dat$S_1>1
dat$d_2 <- dat$S_2>1

# For ease of programming
dat$C_d <- with(dat, d_1==A_1 & d_2==A_2)

# Store the last element of psi
psi_6 <- expression( Y * C_d *
          (  (1+exp(-(delta_1 + delta_2 * log_S_1)))^d_1 * # numerator
             (1+exp(-(phi_1 + phi_2 * log_S_2 + phi_3 * A_1)))^d_2  ) 
          
      /   (  (1-(1+exp(-(delta_1 + delta_2 * log_S_1)))^(-1))^(1-d_1) * # denominator
             (1-(1+exp(-(phi_1 + phi_2 * log_S_2 + phi_3 * A_1)))^(-1))^(1-d_2)  ) 
- V
)

Let’s now build a list detailing the full function $\psi(z,\theta)$.

psi <- list(psi_1, psi_2, psi_3, psi_4, psi_5, psi_6)

Now, let’s compute $\hat{\mathcal{V}}_{IPW}(\tilde{d})$ and stack it in a list of all the estimated parameters $\hat{\theta}=\Big(\hat{\delta}_1,\hat{\delta}_2,\hat{\phi}_1,\hat{\phi}_2,\hat{\phi}_3, \hat{\mathcal{V}}_{IPW}(\tilde{d})\Big)^T$.

For simplicity in computing $\hat{\mathcal{V}}_{IPW}(\tilde{d})$, we recommend to use a slightly modified version of psi_6 as follows.

# Just delete "- V" from in the previous expression
# add _hat as appropriate
ipw_estimator <- expression( Y * C_d *
          (  (1+exp(-(delta_1_hat + delta_2_hat * log_S_1)))^d_1 * # numerator
             (1+exp(-(phi_1_hat + phi_2_hat * log_S_2 + phi_3_hat * A_1)))^d_2  ) 
          
      /   (  (1-(1+exp(-(delta_1_hat + delta_2_hat * log_S_1)))^(-1))^(1-d_1) * # denominator
             (1-(1+exp(-(phi_1_hat + phi_2_hat * log_S_2 + phi_3_hat * A_1)))^(-1))^(1-d_2)  )  
)

# Compute individual contribution and take the average
V_hat <- with(dat, mean(eval(ipw_estimator))) # Other ways to compute this quantity are OK too.

thetas_hat <- list(delta_1=delta_1_hat,
                   delta_2=delta_2_hat,
                   phi_1=phi_1_hat,
                   phi_2=phi_2_hat,
                   phi_3=phi_3_hat,
                   V=V_hat)

Let’s pass the relevant arguments to get_vcov and check results for the standard errors.

res <- get_vcov(data=dat, thetas=thetas_hat, M=psi)
res$se

##    delta_1    delta_2      phi_1      phi_2      phi_3          V 
## 0.02836275 0.19963843 0.03921097 0.22778301 0.12032851 0.03641272

Let’s see how the results compare with standard errors obtained from the bootstrap.

click here to see the bootstrap procedure

boot_fun <- function(d, i=1:nrow(d)) {
  z<-d[i,]
  e_1 <- glm(A_1~I(log(S_1)), data=z, family = "binomial")
  e_2 <- glm(A_2~I(log(S_2)) + A_1 , data=z, family = "binomial")

  delta_1_hat <- coef(e_1)[1]
  delta_2_hat <- coef(e_1)[2]
  phi_1_hat <- coef(e_2)[1]
  phi_2_hat <- coef(e_2)[2]
  phi_3_hat <- coef(e_2)[3]

  ipw_estimator <- expression( z$Y * z$C_d *
            (  (1+exp(-(delta_1_hat + delta_2_hat * z$log_S_1)))^z$d_1 * # numerator
               (1+exp(-(phi_1_hat + phi_2_hat * z$log_S_2 + phi_3_hat * z$A_1)))^z$d_2  ) 
            
        /   (  (1-(1+exp(-(delta_1_hat + delta_2_hat * z$log_S_1)))^(-1))^(1-z$d_1) * # denominator
               (1-(1+exp(-(phi_1_hat + phi_2_hat * z$log_S_2 + phi_3_hat * z$A_1)))^(-1))^(1-z$d_2)  )  
  )

  V_hat <- mean(eval(ipw_estimator))
  return( c(delta_1_hat, delta_2_hat, phi_1_hat, phi_2_hat, phi_3_hat, V_hat) )
}
boot_start_time <- Sys.time()
res_boot <- boot::boot(dat, boot_fun, R=999)
boot_end_time <- Sys.time()
paste("Bootstrapping took", round(as.numeric(boot_end_time - boot_start_time), 2), "seconds.")

## [1] "Bootstrapping took 21 seconds."

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot::boot(data = dat, statistic = boot_fun, R = 999)
## 
## 
## Bootstrap Statistics :
##        original        bias    std. error
## t1* -0.10641382 -0.0002500751  0.02843059
## t2*  0.65733524  0.0080253525  0.20518095
## t3*  0.07486354  0.0005029474  0.03972660
## t4*  1.22872312 -0.0066237516  0.23196144
## t5*  3.12746280  0.0087706975  0.12208167
## t6*  0.83983316  0.0021767814  0.03644645

This is pretty close to the results in res$se that we obtained 190.5 times faster with Mestim.

References

Boos, D. D. and Stefanski, L. (2013). Essential Statistical Inference. Springer, New York. https://doi.org/10.1007/978-1-4614-4818-1.

Tsiatis, A. A., Davidian, M., Holloway, S. T. & Laber, E. B. (2019), Dynamic Treatment Regimes: Statistical Methods for Precision Medicine, CRC Press. https://doi.org/10.1201/9780429192692.

An Introduction to `Mestim`

Applied M-estimation and computation of the empirical sandwich variance.

François Grolleau

December 17, 2022

Baby review of M-estimation

What `Mestim` does

Example 1: Prediction task via logistic regression

Example 2: Average treatment effect via outcome regression

Example 3: Value estimation for dynamic treatment regime

References

An Introduction to Mestim

Applied M-estimation and computation of the empirical sandwich variance.

François Grolleau

December 17, 2022

Baby review of M-estimation

What Mestim does

Example 1: Prediction task via logistic regression

Example 2: Average treatment effect via outcome regression

Example 3: Value estimation for dynamic treatment regime

References

An Introduction to `Mestim`

What `Mestim` does