Russia Longitudinal Monitoring Survey - Higher School of Economics 2004
Other Household Survey [hh/oth]
The Russia Longitudinal Monitoring Survey (RLMS) is a series of nationally representative surveys designed to monitor the effects of Russian reforms on the health and economic welfare of households and individuals in the Russian Federation. These effects are measured by a variety of means: detailed monitoring of individuals' health status and dietary intake, precise measurement of household-level expenditures and service utilization, and collection of relevant community-level data, including region-specific prices and community infrastructure data.
Data for RLMS have been collected since 1992. Since 1994, the team has collected a new round of data almost every year in the second phase of the project.
The Russia Longitudinal Monitoring Survey (RLMS) is a household-based survey designed to measure the effects of Russian reforms on the economic well-being of households and individuals. In particular, determining the impact of reforms on household consumption and individual health is essential, as most of the subsidies provided to protect food production and health care have been or will be reduced, eliminated, or at least dramatically changed. These effects are measured by a variety of means: detailed monitoring of individuals' health status and dietary intake, precise measurement of household-level expenditures and service utilization, and collection of relevant community-level data, including region-specific prices and community infrastructure data. Data have been collected since 1992.
The repeated cross-section design is far and away the simplest alternative for the RLMS. The sampling is cost efficient, easy to maintain, and easy to update when needed. The design supports both efficient cross-sectional and aggregate longitudinal analyses of change in the Russian household population. Updates to the sample, including a full replenishment of the probability sample of dwelling units, will not seriously disrupt the longitudinal data series.
Kind of Data
Sample survey data [ssd]
Unit of Analysis
Households and individuals.
The scope of the study includes:
- Use of time;
- Health status;
- Medical services;
- Child care;
- Family information;
- Housing conditions;
- Living conditions;
- Transportation and related information;
- Local municipal and other services;
- Cost of food products;
- Farming and animal husbandry;
Producers and sponsors
National Research University Higher School of Economics
Carolina Population Center
University of North Carolina at Chapel Hill
Institute of Sociology RAS
National Research University Higher School of Economics
US National Institutes of Health
In Phase II (Rounds V - XX) of the RLMS, a multi-stage probability sample was employed. Please refer to the March 1997 review of the Phase II sample. First, a list of 2,029 consolidated regions was created to serve as PSUs. These were allocated into 38 strata based largely on geographical factors and level of urbanization but also based on ethnicity where there was salient variability. As in many national surveys involving face-to-face interviews, some remote areas were eliminated to contain costs; also, Chechnya was eliminated because of armed conflict. From among the remaining 1,850 regions (containing 95.6 percent of the population), three very large population units were selected with certainty: Moscow city, Moscow Oblast, and St. Petersburg city constituted self-representing (SR) strata. The remaining non-self-representing regions (NSR) were allocated to 35 equal-sized strata. One region was then selected from each NSR stratum using the method "probability proportional to size" (PPS). That is, the probability that a region in a given NSR stratum was selected was directly proportional to its measure of population size.
The NSR strata were designed to have approximately equal sizes to improve the efficiency of estimates. The target population (omitting the deliberate exclusions described above) totaled over 140 million inhabitants. Ideally, one would use the population of eligible households, not the population of individuals. As is often the case, we were obliged to use figures on the population of individuals as a surrogate because of the unavailability of household figures in various regions.
Although the target sample size was set at 4,000, the number of households drawn into the sample was inflated to 4,718 to allow for a nonresponse rate of approximately 15 percent. The number of households drawn from each of the NSR strata was approximately equal (averaging 108), since the strata were of approximately equal size and PPS was employed to draw the PSUs in each one. However, because response rates were expected to be higher in urban areas than in rural areas, the extent of over-sampling varied. This variation accounted for the differences in households drawn across the NSR PSUs. It also accounted for the fact that 940 households were drawn in the three SR strata--more than the 14.6 percent (i.e. 689) that would have been allotted based on strict proportionality.
Since there was no consolidated list of households or dwellings in any of the 38 selected PSUs, an intermediate stage of selection was then introduced, as usual. Professional samplers will recognize that this is actually the first stage of selection in the three SR strata, since those units were selected with certainty. That is, technically, in Moscow, St. Petersburg, and Moscow oblast, the census enumeration districts were the PSUs. However, it was cumbersome to keep making this distinction throughout the description, and researchers followed the normal practice of using the terms "PSU" and "SSU" loosely. Needless to say, in the calculation of design effects, where the distinction is critical, the proper distinction was maintained. The selection of second-stage units (SSUs) differed depending on whether the population was urban (located in cities and "villages of the city type," known as "PGTs") or rural (located in villages). That is, within each selected PSU the population was stratified into urban and rural substrata, and the target sample size was allocated proportionately to the two substrata. For example, if 40 percent of the population in a given region was rural, 40 of the 100 households allotted to the stratum were drawn from villages.
In rural areas of the selected PSUs, a list of all villages was compiled to serve as SSUs. The list was ordered by size and (where salient) by ethnic composition. PPS was employed to select one village for each 10 households allocated to the rural substratum. Again, under the standard principles of PPS, once the required number of villages was selected, an equal number of households in the sample (10) were allocated to each village. Since villages maintain very reliable lists of households, in each selected village the 10 households were selected systematically from the household list. In a few cases, villages were judged to be too small to sustain independent interviews with 10 households; in such cases, three or four tiny villages were treated as a single SSU for sampling purposes.
In urban areas, SSUs were defined by the boundaries of 1989 census enumeration districts, if possible. If the necessary information was not available, 1994 microcensus enumeration districts, voting districts, or residential postal zones were employed--in decreasing order of preference. Since census enumeration districts were originally designed to be roughly equal in population size, one district was selected systematically without using PPS for each 10 households required in the sample. In the few cases where postal zones were used, one zone was likewise selected systematically for each 10 households. However, where voting districts were used, to compensate for the marked variation in population size, PPS was employed to select one voting district for each 10 households required in the urban sub-stratum.
In both urban and rural substrata, interviewers were required to visit each selected dwelling up to three times to secure the interviews. They were not allowed to make substitutions of any sort. The interviewers' first task was to identify households at the designated dwellings. "Household" was defined as a group of people who live together in a given domicile, and who share common income and expenditures. Households were also defined to include unmarried children, 18 years of age or younger, who were temporarily residing outside the domicile at the time of the survey. If perchance the interviewer identified more than one household in the dwelling, he or she was obliged to select one using a procedure outlined in the technical report. The interviewer then administered a household questionnaire to the most knowledgeable and willing member of the household.
The interviewer then conducted interviews with as many adults as possible, acquiring data about their individual activities and health. Data for the children's questionnaires were obtained from adults in the household. By virtue of the fact that an attempt was made to obtain individual questionnaires for all members of households, the sample constitutes a proper probability sample of individuals as well as of households, without any special weighting. Actually, the fact that we did not interview unmarried minors living temporarily outside the domicile slightly diminished the representativeness of the sample of individuals in that age group.
The multivariate distribution of the sample by sex, age, and urban-rural location compared quite well with the corresponding multivariate distribution of the 1989 census. Of course, because of random sampling error and changes in the distribution since the 1989 census, we did not expect perfect correspondence. Nevertheless, there was usually a difference of only one percentage point or less between the two distributions.
Another way to evaluate the adequacy (or efficiency) of the sample was to examine design effects. An important factor in determining the precision of estimates in multi-stage samples was the mean ultimate cluster (PSU) size. All else being equal, the larger the size the less precise the measure is. In Rounds I through IV of the RLMS, the average cluster size approached 360--a large number dictated by constraints imposed by our collaborators. Thus, although the sample size covered around 6,000 households, precision was less than we would have liked for a sample of that size. In Rounds I and III of the RLMS, the 95 percent confidence interval for household income was about ?±13 percent.
In the Phase II (Rounds V - XX) sample, the situation was considerably better. Although there were only 4,000 households, the mean size of clusters was much smaller than in Phase I. There were 35 PSUs with about 100 households each; even this result was an improvement over the average of 360 in the design of the RLMS Rounds I through IV. However, in the three self-representing areas, the respondents were drawn from 61 PSUs. Recall that Moscow city and oblast, as well as St. Petersburg city, were not sampled but were chosen with certainty. Therefore, the first stage of selection in them was the selection of census enumeration districts. Thus the mean cluster size in the entire sample was about 42, i.e., 4,000/(35+61). Given these much smaller cluster sizes, researchers had reason to expect that precision in this survey would be as good as it was in Rounds I through IV despite the smaller sample size, and this expectation, in fact, turned out to be the case in Rounds V through XIII.
Weights in Descriptive Analysis of RLMS Data.
Analysis weights are essential for unbiased sample-based estimation of RLMS descriptive statistics such as population and subclass means, proportions, and totals. The construction of a descriptive weight for cross-sectional analysis involves a simple sequence of steps:
(1) determine the probability of selection for each sample household;
(2) based on geographic and other known characteristics of sample households, compute an adjustment for nonresponding sample households;
(3) compute a nonresponse-adjusted weight as the product of the reciprocal of the sample selection probability and the nonresponse adjustment.
Since the RLMS attempts to interview all individuals within sample households, the selection probability for an individual equals that for his household. An individual in a cooperating household may, however, choose not to give an interview. If data on individuals-- both cooperating and not--are known from household listings, the nonresponse adjustment factor in the analysis weight can be computed at the level of the individual. Fortunately, the majority of RLMS nonresponse at the individual level corresponds to noncooperation by the entire household, and the household nonresponse adjustment factor will capture most of the sample attrition loss at both levels.
If recent census data on households and individuals are available, a fourth post-stratification step can be added: scaling analysis weights so that the sum of weights for a defined subpopulation matches the corresponding census proportion (e.g., the weighted sample proportion of females, age 45 and older, in the Moscow/St. Petersburg region matches the corresponding proportion from the most recent census). The post-stratification of analysis weights serves two functions:
(1) it can reduce the sampling variance of weighted estimates; more importantly
(2) it may correct noncoverage biases in the frame used to derive the original sample of dwellings and individuals.
RLMS data sets contain post-stratification weights - weights that adjust not only for design factors but also for deviations from the census characteristics. For households, we have produced post-stratification weights that fit our data to the known distribution of household size and location of residence (urban or rural). For individuals, our weights fit our data to the multivariate distribution of location, age, and gender. Of course, depending on the subject of one's analysis, it might be appropriate to compute post-stratification weights that adjust to other variables, and all analysts are free to compute their own.
There is considerable debate over the value of using weights in multivariate analysis. For example, in estimating linear or generalized linear models, many software programs allow the specification of weights for model fitting. Some statisticians argue that using weights is not necessary if the fixed effects that explain the variation in weights are included in the model. In RLMS data, the household characteristics that explain the greatest variation in weights are the geographic region and the urban/rural character of the civil division in which the dwelling is located. Variation in individual weights will reflect the geographic effects for households as well as differentials due to post-stratification of the sample by major geographic regions, age, and sex. Researchers who are interested in exploring the impact of RLMS weights on a multivariate analysis should consider the following test. Fit the model omitting the weights but including as fixed effects the household (region, urban/rural) or individual (region, urban/rural, age, and sex) characteristics. Without changing the specification, also estimate the model using the analysis weights. Compare the results to see if there are important differences in model parameters and/or interpretation. Differences in the unweighted and weighted versions could be due to added sampling variability introduced by the weighted estimation or could indicate that the model is not correctly specified.
At each round, the data contains some households with a sampling weight of zero (0). This values was assigned to the households who moved out of the sample area between rounds. They were located and interviewed to provide a group of respondents for longitudinal analyses. They were assigned weights of zero to keep analysts from inadvertently including them in cross-sectional analyses that are intended to be representative of Russia. (Only respondents with non-zero weights are part of the representative sample.)
Dates of Data Collection
Data Collection Mode
The questionnaire are English-language translations of the original Russian questionnaires. The English versions have been translated as literally as possible. The order of the questions and the layout of the pages have been preserved in the English versions.
The questionnaires are also designed to function as codebooks. The variable names, as they appear in the data sets, are usually listed below or to the left of the questions. If the abbreviation (char) appears with a variable name, then the responses to that question are stored in a character variable. If there is no variable name associated with a particular question, then the responses to that question do not appear in the data set. Some questions in the questionnaires are color coded. Pink means that the question was added. Green indicates changes from the previous round (e.g., year). Gray means that the questions were asked, but the data are not available for public use - the questions were added at the request of the Pension Office and are for their use only.
In Phase II(Rounds V - XX), when questionnaires were returned to local supervisors, those supervisors were required to examine them to locate problems that could best be remedied in the field, e.g., by returning to get key demographic information or cleaning ID numbers so that the roster of individuals located in the household questionnaire matched those on the individual questionnaires from that household. The questionnaires were then transported to Moscow, where yet another ID check was performed.
In Moscow, coders looked through all questionnaires to code so-called "other: specify" responses. However, open-ended questions (e.g., occupation questions) were not coded at this time. Instead, their texts were fully entered as long string variables. Entering the open-ended answers as character variables offered several advantages. First, it allowed data entry to begin immediately, with no delay for coding. Second, it permited the use of computer programs to assist in coding the string variables. Third, the method allowed any user of the original data sets to recode the character variables to suit his or her purposes without going back to the paper copies of the questionnaires. All data entry was handled in-house using the SPSS data entry program on PCs.
Source: "Russia Longitudinal Monitoring survey, RLMS-HSE", conducted by Higher School of Economics and ZAO "Demoscope" together with Carolina Population Center, University of North Carolina at Chapel Hill and the Institute of Sociology RAS.
RLMS-HSE sites: http://www.cpc.unc.edu/projects/rlms-hse, http://www.hse.ru/org/hse/rlms.
Location of Data Collection
Carolina Population Center, the University of North Carolina at Chapel Hill
Archive where study is originally stored
Carolina Population Center, the University of North Carolina at Chapel Hill
Disclaimer and copyrights
The user of the data acknowledges that the original collector of the data, the authorized distributor of the data, and the relevant funding agency bear no responsibility for use of the data or for interpretations or inferences based upon such uses.