Weighting

SAMPLE SIZE CALCULATION

The current section describes how the sample size can be calculated when the survey situation fits neither that used for Table 4.9 nor for Table 4.10 in Chapter 4. The sample size calculation applies only to persons, since the most important indicators for end-decade assessment are person-based. Household sample size calculations would not only require a different formula, but also very different design effect, or deff values, of 10 or more.

The calculating formula, taking into account the parameters and assumptions discussed in Chapter 4, is given by

n = [4 (r) (1 - r) (f) (1.1)] / [(e2) (p) (nh)], where (1)

(taking the components in order)

n is the required sample size for the KEY (rarest) indicator,

4 is a factor to achieve the 95 percent level of confidence,

r is the predicted or anticipated prevalence (coverage rate) for the key indicator,

which is based upon the smallest target group (in terms of its proportion of the total population),

1.1 is the factor necessary to raise the sample size by 10 percent for nonresponse,

f is the deff,

e is the margin of error to be tolerated,

p is the proportion of the total population that the smallest group comprises, and

nh is the average household size.

A numerical example is provided to illustrate the calculation.

.EXAMPLE (MODERATE-TO-HIGH COVERAGE RATE):

Suppose the target group in your country that comprises the smallest percentage of the total population is one-year-old children (recall that we are purposely excluding the four-month age groups that form the base for the breastfeeding indicators) and this group comprises 3 percent of the population. Further suppose that their DPT coverage is anticipated to be the lowest of all the indicator coverages - 50 percent, for which we want our margin of error to be 5 percentage points. If your average household size is 6 persons and we assume the sample deff is moderate, or 1.75, then the values of your parameters will be as follows:

r = 0.5

p = .03

f = 1.75

e = .05

nh = 6

Substituting, you have

n = {4 x 0.5 x (1-0.5) x 1.75 x 1.1} / {(.05)2 x .03 x 6}

= 4,278.

This is the number of households you would need to survey in order to estimate DPT coverage of about 50 percent with a margin of error of 5 percentage points. Those households would contain about 25,667 persons, of which about 770 would be one-year-old children.

Formula (1) can be rewritten in shortcut version for easy calculation whenever the values of p, f, e, and nh are fixed at .03, 1.75, .05, and 6, respectively, and when the 95 percent level of confidence and nonresponse adjustment (factors of 4 and 1.1, respectively) are not changed. In that case the shortcut version is given by

n = (17,111) r (1- r). (2)

It is recommended to use the formulas (long or shortcut) instead of Table 4.9 in Chapter 4 if your moderate-to-high prevalence rate is not close to 50 percent, which is the value that the table is based upon. You would have to use the long version (formula 1) if you want to change one or more of the p, e, f, or nh values.

You might also want to consider using the long version if your nonresponse is not expected to be as high as 10 percent, in which case you would substitute for the factor of 1.1 accordingly.

It is recommended that you use the formula instead of Table 4.9 if your least coverage indicator is quite high (for example, 75 percent), because the sample size will be considerably less. For an r value of 0.75, for example, n would be 3,208 (short formula).

Another example is provided for the case where your key indicator has low coverage.

.EXAMPLE (LOW COVERAGE RATE):

Suppose your polio coverage is expected to be about 25 percent. In this case you would want your margin of error to be 3 percentage points instead of 5 (so that the confidence interval for the coverage estimate is 22 to 28 percent, as opposed to 20 to 30 percent). The other parameter values are the same as in the first example. Substituting, you would have

n = {4 x 0.25 x (1-0.25) x 1.75 x 1.1} / {(.03)2 x .03 x 6}

= 8,912

You can readily see that with stricter tolerance for the margin of error, necessary for the low coverage indicator, the sample size is much larger. This is why it is important in selecting the key indicator upon which to base your sample size that both the smallest target group be identified, and, within that group, the indicator that has the lowest coverage.

The shortcut version for calculating sample sizes for different low coverage indicators is given by:

n = 47,531 r (1- r), whenever (3)

p, e, f, and nh are fixed at .03, .03, 1.75, and 6, respectively.

The formulas should be used instead of Table 4.10 in Chapter 4 if your low coverage indicator has a value that departs significantly from 25 percent, since the latter is the value that Table 4.10 is based upon.

PROCEDURES FOR SAMPLING WITH PPS - OPTION 2

In this section we give an illustration of how to select the first-stage units using pps. The illustration also shows you how to combine systematic pps sampling with geographic arrangement of the sampling frame to achieve implicit stratification.

For the illustration we take Option 2 from Chapter 4, the standard segment design, and we select a national sample. Suppose (1) the standard segment size under Option 2 is to be 500 persons, or about 100 households; (2) census enumeration areas (EAs) are to be the sample frame; and (3) the number of PSUs to be selected is 300. The steps of the first-stage selection, which follow, should be done as a computer operation, although it is possible to do them manually.

Step 1: Sort the file of EAs by urban and rural.

Step 2: Within the urban category, further sort the file in geographic serpentine order according to the administrative subdivisions of your country (for example, province or state, district, commune, etc.).

Step 3: Repeat Step 2 for the rural category.

Step 4: In one column show the census population count of the EA.

Step 5: In the next column compute the number of standard segments, which is equal to the population count divided by 500, and rounded to the nearest integer. This is the measure of size for the EA.

Step 6: Cumulate the measures of size in the next column.

Step 7: Compute the sampling interval, I, by dividing the total cumulant by 300, to one decimal place. In this illustration suppose the total cumulant is 5,281. Then the sampling interval, I, would be equal to 5,281/300, or 17.6.

Step 8: Select a random start between 0 and 17.6. The way to do this, in practice, is to use a table of random numbers and select a three-digit number between 001 and 176 and insert the decimal afterward. Suppose you select 042; then your random start is 4.2. Then your first sample PSU would be the one for which the cumulant measure of size is the smallest value equal to or greater than 4.2.

Step 9: Add 4.2 to I, or 4.2 + 17.6 = 21.8; then your next sample PSU would be the one whose cumulant corresponds to the smallest value equal to or greater than 21.8.

Step 10: Add 21.8 to I, or 21.8 + 17.6 = 39.4; the next sample PSU is the one with cumulant corresponding to the smallest value equal to or greater than 39.4.

Step 11: Continue as above, through the urban EAs followed by the rural ones, until all 300 PSUs have been selected.

The two sample PSUs that are depicted in the illustration are those in EAs 003 of commune 01 and EA 002 of commune 03, both in district 01 and province 01. In the case of the first EA, its measure of size is 3, which would mean that three segments would have to be created, each of roughly 540 persons (1,630 divided by 3), and then one of the segments would be selected at random for listing and subsampling of households. In the second sample EA, two segments would be created, each containing about 590 persons, before selecting one of them at random.

The illustration demonstrates the many advantages of implicit stratification. First, it is very easy to achieve, merely requiring that the frame of enumeration areas be sorted geographically before then selecting the sample systematically with pps. Second, it automatically provides a sample of PSUs that is proportionately distributed by urban and rural and by province (or other geographic subdivisions). For example, if 10 percent of your population is located in province 12, then 10 percent of your sample will also be selected in that province. Third, it can be easily implemented on the computer.