8. Preparing for Analysis
Once data are collected, it is important to edit or ‘clean’ the data.
One should begin by checking that the values recorded for each variable are within the allowable range or set of possible values. If data are typed into a computer by the respondent, an interviewer, or someone else working on the survey, then there is a possibility of data entry errors. Ages less than 18 years for a survey of adults, weights and heights of unrealistic sizes, and category indicators that do not correspond to a category used in the survey need to be identified and, if possible, corrected. If it is not feasible to correct errors, then one should delete the errant response so that it is not analyzed with the legitimate data. Ideally a record will be kept of the edits that are made to the data set.
Another step in the editing process is to address the issue of missing values in the data set. Some variables will not have values recorded for some respondents because the variables do not pertain to them. For example, it does not make sense to ask nonsmokers about smoking habits or individuals of one gender about health issues that affect only members of the other gender. In the survey instrument, questions that are not relevant to a respondent should be skipped and a “not applicable” code indicating that it was legitimate to skip that question for the respondent should be recorded for such questions. Other values are missing due to accidental skips or a refusal of the respondent to answer. Codes indicating that the response was refused or is unintentionally missing need to be recorded.
At this point one may wish to impute or fill in plausible values for missing values. Imputation should be handled with care. Simply filling in average values from the responding units will create distortions in the distribution of values and has the potential to dramatically affect statistical analyses. Imputations can be created through statistical modeling and estimation, such as from a linear regression model that predicts missing values based on an estimated linear regression equation. Alternatively, imputations can be created by randomly selecting donors with complete data to provide data values from respondents to donate to the cases with missing values. Methods that use auxillary variables to match potential donors to cases with missing values can be used to select good donors. Donors that match the cases with missing values closely are likely to create reasonable imputations. It is always good practice to keep an indicator variable that indicates which values are imputed in the data set so that this information is available when assessing conclusions based on statistical analyses.
A step that is necessary when preparing data for analysis is to create a clearly documented list of variables and their meanings. For example, if sex is recorded with values of 0 and 1 (or 1 and 2), one needs to know which value means female and which means male. One also needs to know how legitimate skips, missing values, do-not-know responses or refusals, and imputations are indicated. Other information that could be of use to the data analyst should also be recorded. This might include how the respondent was contacted, when the interview took place, who conducted the interview, who the respondent was (e.g., was it the target respondent or a proxy report by someone else).