Biostatistics   |   Faculty & Staff   |   Research   |   Graduate Study   |   Contact
  For Collaborators:   Guidelines   |   Data Documentation   |   Sample CodeBook


Guidelines for Data Set Documentation

We handle and archive numerous data sets from different studies on our Sun server. To ensure the best results for your study, please adhere to our general guidelines for documenting data sets. We appreciate your cooperation!

1. Organization

Typically, data sets are sent to us in the form of a matrix, that is, something organized to have rows and columns.

Creating Workable Data Sets

If the study has data consisting of V variables recorded for each of N patients, then the data set would have N rows, one row per patient, and V columns, one column per variable. If a person has the same data recorded at several different visits, then it's usually easiest for us if the data set is organized to have a separate row for each visit, with the variables including the number and date of the visit as well as the patient ID.

The simplest way to create the data set is probably an Excel spreadsheet, but we can handle some other formats. If you are going to send a delimited file, please use spaces or tabs rather than commas or other symbols for delimiters.

SAS Data Sets

If you are going to send SAS data sets, you need to create a transport file so that it will be compatible across systems.

PLEASE NOTE: We currently cannot support SPSS or STATA files.

2. Internal documentation

Every data set must be identified internally in such a way that even if it is mislabeled and misfiled, we know where it came from.

Identification of the Data Sets

We require that every data set include the following variables:

a. Investigator or study (first variable in the record). For example, SALSA (name of study) or JDoe (name of investigator).

b. Form within study. These could be identified by number or name. For example, the first annual clinical evaluation might be called CE01. Or this might be Dr. Doe's study of surgical closure and might be called SURGCLOS.

c. Version of form. This might be a date or a version number. This is critical if you start changing the form around during the study, and not so important if it is a one-time study and you don't change things. For example, it would be very important if you were engaged in a longitudinal study with rolling admissions and found yourself increasing the variables you collect at baseline.

These variables will be identical for every record in the data set.

Identification of Study Participants

Within a data set, each record also must include information identifying the participants in a way that is consistent across all forms in the study.

Typically this is a study ID assigned at study entry.

Analytic data sets should not contain any personal identifiers. You should have an administrative record linking study ID to personal identifiers, but it should be maintained at the highest possible level of security if it is on your computer system, and in a locked file in an office that locks if it is on paper. The Division of Biostatistics does not need or want access to personal identifier data. Personal indentifiers include but are not limited to name, address, phone number, SSN, medical record number, autopsy number.

Other Data Identification

Each record should also include the date the data were collected and the ID of the person who collected the data.

3. External documentation

Each data file must be accompanied by a Code Book.  The code book is a document or file that includes the following information:

  • Name of the data file as it is stored on the computer
  • Name of the code book's author, including contact information
  • Date the code book was last updated.
  • Number of records in the data file.
  • List of variables, including for each variable
  • Variable name
  • Location of variable, length of field
  • Allowable range for data
  • Missing data codes
  • Interpretation of values if not obvious (e.g. 1-male, 2-female.)
  • Comments including branching and logic.

To view a sample code book, click here.

4. General principles for creating data sets

Be consistent. If 1=yes, 2=no for question 1, use this for all yes-no questions.

Use missing data codes rather than leaving blanks, except for branched questions. Use the missing values to inform us. For example, if years of education might range from 0 to 30, let 98 be don't know and 99 be refused.

Record years using all 4 digits. Remember the year 2000 mess that didn't happen.

In general avoid using 0 for codes except for the numeric value zero. For example, zero years of formal education.

Use legal SAS Data Set variable names. If you are making a SAS data set, use names that give some idea of what the variable is. And remember, someone may import your data set into another program, so try to use names that will be legitimate in Splus, SPSS, etc.

  • Use descriptive variable names. For age, use the variable "age". For weight, use "weight". Avoid generic names like "XYZ".
  • Do not use periods ( . ), underscores ( _ ) or other punctuation in variable names.
  • Names cannot begin with numbers.
  Department of Public Health Sciences   |   UC Davis Health System   |   UC Davis
This page was updated 02 September 2014, 11:09 AM.

Reproduction of material on this web site is hereby granted solely for personal use. No other use of this material is authorized without prior written approval of the UC Regents.

Copyright © 2017 The Regents of the University of California. All rights reserved.