Variance estimation

Variance estimation for child mortality estimators based on surveys faces two main problems. First, most of the mortality estimators have a complex form. Therefore, no simple analytical solution exists for the variance estimator. Second, because the data most often come from complex surveys, the mortality estimator is not a linear function of the data. Again, an analytical solution does not exist.

There are several ways one can deal with the complexity. A common way, used by many software programs, is Taylor series linearization. That works well if the estimator can be expressed as a ratio. In practice, for mortality estimation, that generally only works if one estimates cohort mortality rates, such as the under-five mortality rate, by dividing the number of deaths under five years to children born during a period by the number of births during the same period. However, the synthetic cohort mortality rate commonly used is not a ratio.

CMRJack uses a so-called JackKnife estimator for approximating the variance estimators. JackKnife estimators are part of a general group based on subsample replication. In finite population sampling, an important idea is that one can think of the variance of an estimator as measuring the spread of values of the estimator that would occur if one draws a large number of samples from the population. The variance is simply the sum of the squared differences between each estimator and the mean of all the estimators calculated. Similarly, a 95% confidence interval can be considered the range within which 95% of the estimates can be found. It is usually impractical to draw many samples from a finite population to estimate variances. Even so, there are survey design strategies that do exactly that.

The idea behind subsample replication is that the sample represents a good approximation to the finite population of interest. Therefore, by taking several samples from a sample, one can approximate the idea of taking many samples from the finite population. Jackknife estimation of variance entails first calculating the estimator of interest, say the under-five mortality rate, for the whole sample. Then one removes the first sampling unit in the list of sampling units from the calculation and recalculates. Next, one replaces the first sampling unit, removes the second, and recalculates. The process goes on until all sampling units have been removed once. Then one has a number of estimates corresponding to the sample size plus the original estimate using the whole sample. The variance, with minor corrections, is simply the sum of the squared differences between the complete sample estimate and each reduced sample estimate.

As noted, there are several variations on the Jackknife. The one used by CMRJack is the so-called JKn Jackknife. It has three chief characteristics. First, the sampling units deleted from the calculation for each replicate are primary sampling units, i.e. the first units to be selected in multi-stage samples. For most demographic surveys, the primary sampling unit is the cluster. Thus, it is not the household, the individual women or an individual birth. Second, it considers stratification, and third, it allows a variable number of primary sampling units within each stratum.

Implementations of the JKn typically focus on the estimation weights. When a cluster is omitted, the weights of the other clusters in the stratum are adjusted by a factor of $m_h/(m_h-1)$, where $m_h$ is the number of sampled clusters in the stratum. Thus, the sampling units in each replicate get weights according to the following scheme:

\[w_{i(hc)}=\left\{\begin{array}{ll}
w_{i} &\mbox{if observation $i$ is not in stratum $h$}\\
0 &\mbox{if observation $i$ is in psu $c$ of stratum $h$}\\
\frac{m_h}{m_h-1}&\mbox{if observation $i$ is in stratum $h$ but not in psu $c$}
\end{array}\right.\]

Then[ the actual variance estimate is calculated as given below, and $r$ is the estimate from the full sample; and $r_i$ is one estimate; $\hat{θ} $is the estimate from the full sample, and $hat{θ}_{hj}$ is the estimate from each replicate.

\[\hat{V}_{JKn}=\sum_{h=1}^{H}\frac{m_h-1}{m_h}\sum_{j=1}^{m_h}(\hat{\theta}_{hj}-\hat{\theta})^2\]

DHS uses a slightly different version of the Jackknife that omits consideration of the stratification.

\[\begin{array}{l}
\hat{V}_{JKdhs}=\frac{1}{m(m-1)}\sum_{j=1}^{k}(\hat{r}_{j}- \hat{r } )^2\\
\
r_j=mr-(m-1)r_{(j)}
\end{array}\]

The author of CMRJack prefers the stratified version, while DHS, unsurprisingly, prefers the DHS version. In practice, the differences between the two are not all that big, except in situations where a survey uses unequal allocation of the strata. Then the difference may become substantial, and the CMRJack variance will usually be higher. In the CMRJack view, this reflects the realities of the sampling design since unequal allocation nearly always has a cost in overall variance. In the DHS view, the difference reflects that the JKn estimator of variance becomes unstable.

CMRJack uses the stratified version as the default but the DHS version is also implemented.

The Jackknife as used by CMRJack is also used in Westat’s WesVar software(Westat 2002), SUDAAN(RTI 2001) and is described by Wolter(1985). A good description can also be found in Valliant and Dever (2017). Lu and Lohr (2022) discuss computation. Pedersen and Liu (2012) discuss variance estimation for child mortality estimates. (Click here to see references)

Cluster identification and variance

A question that sometimes arises is how to identify clusters. DHS recode files typically contain the cluster identification in the variable V021 in the birth history file. DHS also has a cluster identifier in the household file (HV001) that identifies the sample point for fieldwork purposes, but that should not be used for the estimation of standard errors.

Most DHS, MICS, and similar surveys use a two-stage sampling design. Within each stratum, clusters are selected, and households are chosen from the selected clusters. In a few cases, the sample may have several stages, such as the selection of districts, then clusters within the selected district, and households within the selected cluster. In such cases, the cluster identifier used in the sampling error estimation should be the Primary Sampling Unit, i.e., the district, rather than the second-stage selection of clusters. The term «ultimate sampling unit» is sometimes also used here for the Primary Sampling Unit.

Using the second-stage clusters for the estimation is wrong because most of the variance accrues from the first-stage selection. A commonly asked question is if one should not include the variance contribution from the second- and possibly third-stage selections. For some survey designs and some estimators (that do not include most child mortality estimators), it is possible to do so. However, if one does, one will typically find that the far from negligible work expended in correctly specifying the sample results in a very slight reduction of the variance estimate. Moreover, most demographic survey samples are drawn with a linear systematic probability proportionate to size method, which makes constructing the variance estimator very difficult because the selection of sampling units is not independent. For an extended discussion, see, for example, Brady T. West’s note «Accounting for Multi-stage Sample Designs in Complex Sample Variance Estimation.» The issue is also discussed in Heeringa, West and Berglund 2010:67 as well as in Kish 1965:155.

It is relatively easy to account for finite population size corrections, i.e., the reduction of variance when the number of sampling units selected make up a large proportion of the available sampling units in the frame. However, in most demographic surveys, the effect of such corrections is minimal because the sampling fractions are generally small.