We handle and archive numerous data sets from different studies on our
Sun server. To ensure the best results for your study, please adhere to
our general guidelines for documenting data sets. We appreciate your
Typically, data sets are sent to us in the form of a matrix, that is, something
organized to have rows and columns.
Creating Workable Data Sets
If the study has data consisting of V variables recorded for each of N
patients, then the data set would have N rows, one row per patient, and V
columns, one column per variable. If a person has the same data recorded
at several different visits, then it's usually easiest for us if the data
set is organized to have a separate row for each visit, with the variables
including the number and date of the visit as well as the patient ID.
The simplest way to create the data set is probably an Excel spreadsheet,
but we can handle some other formats. If you are going to send a delimited
file, please use spaces or tabs rather than commas or other symbols for delimiters.
SAS Data Sets
If you are going to send SAS data sets, you need to create a transport file
so that it will be compatible across systems.
PLEASE NOTE: We currently cannot support SPSS or STATA files.
2. Internal documentation
Every data set must be identified internally in such a way that even if
it is mislabeled and misfiled, we know where it came from.
Identification of the Data Sets
We require that every data set include the following variables:
a. Investigator or study (first variable in the record).
For example, SALSA (name of study) or JDoe (name of investigator).
b. Form within study. These could be identified by number
or name. For example, the first annual clinical evaluation might be called
CE01. Or this might be Dr. Doe's study of surgical closure and might be called
c. Version of form. This might be a date
or a version number. This is critical if you start changing the form
around during the study, and not so important if it is a one-time
study and you don't change things. For example, it would be very important
if you were engaged in a longitudinal study with rolling admissions and found
yourself increasing the variables you collect at baseline.
These variables will be identical for every record in the data set.
Identification of Study Participants
Within a data set, each record also must include information identifying
the participants in a way that is consistent across all forms in the
Typically this is a study ID assigned at study entry.
Analytic data sets should not contain any personal identifiers.
You should have an administrative record linking study ID to personal
it should be maintained at the highest possible level of security
if it is on your computer system, and in a locked file in an office that
it is on paper. The Division of Biostatistics does
not need or want access to personal identifier data. Personal
indentifiers include but are not limited to name, address, phone
number, SSN, medical record number, autopsy number.
Other Data Identification
Each record should also include the date the data were collected and the
ID of the person who collected the data.
3. External documentation
Each data file must be accompanied by a Code Book. The code
book is a document or file that includes the following information:
- Name of the data file as it is stored on the computer
of the code book's author, including contact information
- Date the
code book was last updated.
- Number of records in the data file.
- List of variables, including
for each variable
of variable, length of field
range for data
of values if not obvious (e.g. 1-male, 2-female.)
including branching and logic.
To view a sample code book, click here.
4. General principles for creating data sets
Be consistent. If 1=yes, 2=no for question 1, use this for all
Use missing data codes rather than leaving blanks, except for branched
questions. Use the missing values to inform us. For example, if years
of education might range from 0 to 30, let 98 be don't know and 99 be
Record years using all 4 digits. Remember the year 2000 mess that
In general avoid using 0 for codes except for the numeric value zero. For
example, zero years of formal education.
Use legal SAS Data Set variable names. If you are making a SAS data set, use names that give some idea of what
the variable is. And remember, someone may import your data set into
another program, so try to use names that will be legitimate in Splus, SPSS,
- Use descriptive variable names. For age, use
the variable "age". For weight, use "weight".
names like "XYZ".
- Do not use
periods ( . ), underscores ( _ ) or
other punctuation in variable names.
cannot begin with numbers.