Data Editing
Once the data had been entered into the computerised database, the information contents of each data item for the different respondents (persons, households and/or islands) could be checked in detail for consistency. A number of the adjustments carried out are described in the following paragraphs to illustrate their range and scope. The purpose of the datacleaning operation was to ensure that all information used in the analysis was logically correct and
acceptable. Where the information was obviously faulty, and the correct value could not be inferred, the data were generally treated as non-response.
Errors in the data emanate from different sources. The most common problem was that enumerators did not always properly understand or code the information they obtained during household visits. This was the case with household consumption where the quantities noted down were sometimes inconsistent with the value and the standard unit of measurement. For instance, for a 50 kg bag of rice, the price would have been entered correctly but the quantity given as 1 (bag) rather than 50 (kg) as needed for data consistency. Errors made during the coding process, which resulted in the wrong allocation of the information, were far less frequent. The data-entry process generated two distinct types of errors. Firstly, there were data transcription (keypunching) errors, which form a normal part of the work and are controlled to the extent possible through edit checks in the data entry programmes. The second type of error introduced during data entry related to the structure of the data entry system. Information was sometimes missing in some parts of the questionnaires while other parts, logically linked to those sections, contained data. In some cases, information supplied in different sections was contradictory in nature. This
sometimes caused the data entry programme to stop, waiting for consistent information. When this was not available, dummy information was given to ensure continuation of data entry. Such problems should have been, and were mostly, captured in the field but in cases where they had slipped through, remedial action was necessary after data entry.
The following examples illustrate the cleaning process. The area available per person was calculated from the area of the house and the number of household members. If the result was less than five square feet per person, the information on size of the house was deleted and the household treated as nonrespondent for this question. The island questionnaire contained a question relating to the number of trips to the atoll capital or Male’ by dhoni (vessel). There were four possible answers, namely: daily, the number of times per week, the number of times per month, and frequently. During cleaning, all data were converted to the number of times per month. Because the focus was on islands from which there is infrequent dhoni traffic to the atoll capital or Male’, the answer frequently was above the limit set, at par with daily.
Another question on the island questionnaire asked for the distance to the nearest public telephone on another island. If the island itself had a public telephone, this question was not to be answered. In cases where this information was given, it was deleted. Where it was unclear whether the distance was given in hours as requested instead of in minutes, the distance was checked on the map and corrected as necessary.
For children between one and five years of age, the arm circumference, height and weight were measured. These measurements should have been in centimetres and kilograms. In a number of cases, measurements were clearly reported in inches and adjustments were made accordingly. In other instances, when there was no obvious correction procedure (children reportedly taller than 1.2 metres or less than three kilos in weight, for instance), the information was deleted and treated as non-response.
An important issue in the analysis of the data is the treatment of non-response. Even though data for a household, its members or the island may be available in general, it is often not available for all the items of information studied. In some cases, this is due to the lack of information in the original questionnaire, while in other cases it may be due to corrections applied in the data editing and cleaning steps described earlier. In all instances, missing information has in principle been treated in the same way.
The basic assumption is that non-response is unbiased; in other words, that the behaviour of non-respondents is the same as that of respondents. This assumption makes it possible to formulate consistent procedures for dealing with data gaps. In the simplest case and at the lowest level of aggregation, this means that the percentage of respondents giving a certain answer is also (assumed to be) the percentage of the total population at that level of aggregation giving that answer. As rates of non-response differ from one area to another, as well as from one data item to another, these percentages have to be converted to the number of persons represented. This allows for the aggregation of the information so that the percentage distributions and average values at higher levels (island, atoll and overall) can be calculated.
While this procedure is straightforward in principle, it becomes rather complicated when it involves a number of steps which all have their own non-response. This occurs for instance in the calculation of employment, unemployment and underemployment, which are calculated on the basis of the particular responses to a series of questions on the activities of each person above the age of 11 years.
Overall, not only was the sample size large, major efforts were also made to ensure that data complied with qualitative criteria reflected in the use of consistency and other checks.