«Ensuring Consistent Data Mapping Across SDTM-based Studies – a Data Warehouse Approach Annie Guo, ICON Clinical Research, North Wales, PA ABSTRACT ...»
PharmaSUG2011 - Paper CD21
Ensuring Consistent Data Mapping Across
SDTM-based Studies – a Data Warehouse Approach
Annie Guo, ICON Clinical Research, North Wales, PA
SDTM is about standardization of clinical trials data. This paper presents a tool that helps ensure
consistent data mapping across SDTM-based studies. The tool is comprised of a series of SAS®
programs. The input to the SAS programs consists of three sources: annotated CRF, SDTM data set specifications, and SDTM SAS data sets. The SAS programs run across each study, and summarize the information from the input files. The output is a set of standardized SAS data sets per study that serve as a data warehouse storing the metadata and data contained in the SDTM data sets. This data warehouse approach allows for direct access and comparison among existing studies, bypassing the original sources, as well as providing a reference database useful for facilitating the programming of new studies.
REQUIREMENTSRequirements focus on two areas: data warehouse, and reporting of the data in the data warehouse.
Data Warehouse On the data warehouse side, the three sources of SDTM metadata and data, i.e., annotated CRF (aCRF) in PDF, SDTM data set Specifications (Specs) in Excel and SDTM SAS data sets, must be all integrated into a set of SAS data sets across each study. The SAS data sets follow a uniform structure to store SDTM domain names, variable names, and variables values from the three sources. The uniform structure allows for listing or harmonization comparison across studies.
Report 1 List of CRFs and associated SDTM domain(s), with hypertext links to aCRF and Specifications This is a high level overview of the association between CRF and SDTM. In many cases the name of the CRF determines the SDTM domain it is mapped to. For example, Concomitant Medications CRF goes to the published Concomitant Medications - CM domain. However, confusion may arise when it comes to custom domains. For example, Human Anti-Human Antibody Samples CRF, does it go to a custom domain, and if so, have we had one for that CRF? A list like Table 1 would provide clarity and answers to those questions.
The hypertext links to aCRF and Specs in Table 1 provides direct access to the SDTM document. It saves us time and we do not need to navigate through the folder structure on server in order to locate the file and then open the specific PDF page or Excel tab.
Table 1: Report requirement 1 Report 2 List of distribution of SDTM variables on CRF annotations, Specifications and SDTM data sets, including variable values if they are annotated on CRF This structure of this list is one row per SDTM variable per variable value annotated on aCRF, such as the sample in Table 2. The focus is on the CRF annotations. The reason is, in general not all SDTM variables are annotated on aCRF. However, those variables or variable values annotated on aCRF must appear in the Specifications for the study. In addition, the Specifications and the SDTM data sets must have exactly the same SDTM variables within a study. This list would point out any deviation from those rules.
Table 2: Report requirement 2
This list summarizes CRF annotations by SDTM domain and variable, and gives us an idea about the data collected on CRF, without having to open and look at the aCRF files. For example, in Table 2, both studies collect adverse event casualty, AEREL. It appears that there is no annotation for AEREL variable values, so most likely the values are according to pre-printed text on CRF. For CM domain, Study 2 has CMDOSE annotated on CRF, so this is probably a numeric data field to collect medication dose. Study 1 has CMDOSTXT annotated instead, so we can guess the data field on CRF collects character data for not only medication dose but also medication unit or other information.
This list also helps identify inconsistency among studies. For example, EGORRES in Table 2, the variable values for abnormal ECG test results are different according to the CRF annotations. Without actually opening the aCRF to verify, we may guess Study 1 collects ECG test result as either Abnormal or Normal, but Study 2 also asks if an abnormal test result is clinically significant. Another possible cause for the difference would be inconsistent mapping between the two studies. In other words, both studies collect Clinical Significance, but Study 1 has that piece of information mapped to SUPPEG, and EGORRES is set to ABNORMAL regardless of Clinical Significance.
Report 3 List of all variables values in SDTM SAS data sets, cross referencing controlled terminology terms in Specifications The structure of the list is one row per combination of SDTM variable name, variable value, and variable label from SDTM SAS data sets and/or Specifications across studies. For those variables subject to controlled terminology, they are cross-checked to show if the variable labels and values are consistent with the Specifications.
Table 3: Report requirement 3
The purpose of this list is to show variable value mapping across studies. For example, in Table 3, variable AEREL, there are the values MULTIPLE, NOT RELATED, POSSIBLY RELATED and RELATED from the three studies. Breakdown by study, Study 1 has all the values but MULTIPLE according data set Specifications, and only RELATED actually collected on CRF and stored in the AE data set. For Study 2, the only value on AEREL is MULTIPLE. Though it looks like a little different between Study 1 and Study 2, it could be because the two studies following different versions of SDTM IG. So overall there seems to be no discrepancy.
This list can also identify possible data issues in the SDTM SAS data sets. For example, in Table 3, Study 3 seems to have an error. The values on AEREL in AE data set are NOT RELATED and RELATED, but the Specifications file has MULTIPLE as the only controlled terminology term for AEREL.
Report 4 List of paired variable values for --TESTCD and --TEST variables, and paired QNAM and QLABEL in SDTM SAS data sets, cross referencing value level metadata in Specifications Test codes and test names in Findings domains, and QNAM and QLABEL in supplemental qualifiers are one-to-one relation. They may be used consistently across studies unless there are study or sponsor specific requirements.
The structure of this list is one row per paired variable values from SDTM SAS data sets and/or the value level metadata in Specifications. Since it displays all the possible combinations, it is straightforward for anyone to look up what we have had on --TESTCD / --TEST and QNAM / QLABEL, and to make a good judgment if sticking to the existing convention or creating new ones.
Table 4: Report requirement 4
This list is also a tool for identifying differences across studies or validating SDTM SAS data sets against Specifications within a study. For example, the red text in Table 4, Study 3 has used the value EGCS for the variable EGTESTCD, as opposed to the value EGCLSIG used by the other two studies. Another problem in Study 3 is it has the value INTP stored in the SDTM SAS data set, but the Specification file has INTRP as the controlled terminology term for the variable EGTESTCD.
For supplemental qualifier, if a combination of QNAM and QLABEL is specified in Specifications but missing from SDTM SAS data sets, the list displays ** Absence **, for example, in Table 4 the paired values EGCLSP and ECG Clinical Significance Specify. The reason is, not all QNAM and QLABEL values defined in Specifications are required to appear in SDTM SAS data sets. If the CRF data field for that QNAM is completely blank in the raw data, the QNAM / QLABEL is not included in the SUPP-- data set.
Report 5 Extended from Report 4, list of paired variable values on --TESTCD / --TEST plus --CAT / --STRESU, and paired QNAM / QLABEL plus --QORIG / --QEVEL in SDTM SAS data sets, cross referencing value level metadata in Specifications
Table 5: Report requirement 5
This list is to drill down Report 4, with the addition of the category variable --CAT and the standard unit variable --STRESU for SDTM Findings domains, and the qualifier variable origin QORIG and evaluator QEVAL for supplemental qualifier data sets. The structure of the list is one row per combination of the respective 4 variables.
This list displays detailed information about the paired variables. For example, in Table 5, the data related to LBTESTCD is fairly consistent except for Study 3; Study 3 has multiple standard units mmol/L and mg/dL for the lab test Blood Urea Nitrogen, but only mmol/L is present in the value level metadata for that test in Specifications. This implies possible programming issue in Study 3.
For supplemental qualifier data sets, QORIG and QEVAL can be standardized unless there are study or sponsor specific requirements. For example, if QEVAL = INVESTIGATOR has been used as the default mapping, we may stick to it rather than using a different value such as SPONSOR, for example, Study 3 in Table 5.
DESIGNBase SAS is the programming environment for both the data warehouse and the reports. It is used to import the CRF annotations, and the metadata in Specifications into SAS. For SDTM SAS data sets across each study, as illustrated in Flowchart 1, Proc Freq is used to summarize the values from individual variables and also paired variables. Then the data from the three sources are merged appropriately by SDTM domain, variable and value, and saved in a set of SAS data sets.
Flowchart 1: Data processes across each study Flowchart 2: Reports across multiple studies The output SAS data sets are merged across studies, and reports created as illustrated in Flowchart 2.
Reports are created with SAS ODS. Final reports are in Excel to take advantage of its AutoFilter tool.
IMPLEMENTATION OF DATA WAREHOUSE
Annotated CRF → Data Warehouse The annotations on CRF are created as Comments in the PDF file. They have consistent format and layout by following SDTM Submission Guidelines. For example, in Table 6, the annotation for a test code can be EGTESTCD = INTP, and for a test result EGORRES = NORMAL. QVAL for supplemental qualifier is annotated as, for example, EGCLSIG = N in SUPPEG.
To import the annotations to SAS, first save the annotations in ASCII file. In Adobe Acrobat, in the menu bar, click Comments and Summarize Comments…, and then click Comments Only. This extracts the annotation text to a separate window. Copy all the text and paste to a text editor and save them as ASCII file. For example, in Table 6 the annotations from the PDF file for the 12-LEAD ECG CRF can be saved as the text in ASCII file on the right.
Table 6: Converting annotations from PDF to ASCII file
Once the annotations are in ASCII file, they can be read into SAS with Data step. Note in Table 6, only the text in red is CRF annotations, plus Page 12 that is the page number from the PDF file and is part of the PDF Comments. The text is structured since we follow certain rules when creating the annotations. Therefore we can scan the imported text and extract specification information about SDTM domains, variable names, and variable values. Table 7 is the sample SAS code to process the text. The output SAS data set from the sample code is as in Table 8.
Table 7: Sample SAS code to import ASCII file to SAS
Table 8: CRF annotations saved in SAS data set Data Set Specifications → Data Warehouse SDTM data set Specifications sample are saved in CSV files. They include SDTM domain names, variable names, variable labels, type, length, controlled terminology terms and other required metadata. Table 9 is an example for EG domain and its supplemental qualifier SUPPEG.
Table 9: Sample data set Specifications in CSV file for EG and SUPPEG Data step combined with SAS Macro is used to loop through all domains and read individual CSV files into SAS.
Table 10 is the sample code. Table 11 is the output SAS data set, and it matches the Specifications in Table 9.
Table 10: Sample SAS code to import Specifications CSV files to SAS Table 11: SDTM data set Specifications saved in SAS data set Two other output data sets are created as shown in Table 12 and Table 13. In Table 12 the column Value is the controlled terminology terms extracted from the Specifications, and the column Label is the SDTM variable labels.
Table 13 is the value level metadata extracted from the Specifications, where the column Variable is the target variable names from the Specifications, i.e., EGTESTCD and QNAM, and the column Value stores the values of the two target variables, i.e., INTP and PR, and EGCLSIG, respectively. The column Label in Table 13 is the corresponding test names from EGTEST, i.e., ECG Interpretation and PR Interval, and the qualifier variable label from QLABEL, i.e., Clinically Significant.