PharmaSUG 2013 - Paper DS03

Programming Validation Tips for SDTM prior to using OpenCDISC validator

Dany Guerendo, STATProg LLC, Morrisville, NC


ABSTRACT:

In the years I have been working with the Clinical Data and Standards Consortium (CDISC), primarily using the standard data tabulation model (SDTM), validating the domains I create can prove to be challenging for programmers new to the standards. This paper provides tips and techniques, developed for validating domains created; prior to running into a validation tool like OpenCDISC or WebSDM.

This paper‘s sole purpose is to help facilitate the task of the primary programmer or the validation programmer, if applicable, by automating some of the repetitive tasks occurring when programming SDTM.

It may not be an exhaustive list of options but I hope it serves as guidance document for programmers.

Although I worked mostly with version 3.1.2 of the SDTM Implementation Guide, these tips work well with version

      1. which is now available.

        These programming tips were applied in SAS interactive (Window based version) version 9.2 and 9.3 as well as SAS Enterprise Guide version 3.1 and 5.3.


        INTRODUCTION:

        This paper will describe a method for automating the assignment of SDTM domain labels, for variables as well as datasets.

        It will also provide ways to assign controlled terminology, or check that the controlled terminology is properly applied.

        Lastly, it will show how to determine where a difference in observation counts occurred when validating rather large findings observation class in SDTM.


        SDTM DOMAIN: VARIABLE LABELS AND DOMAIN LABEL

        One simple solution to the frequently encountered issue of mistyped variable labels is to automate the process. Some companies may already have programs or software implemented to handle this; but if you do not, you can create a dataset from the SDTM IG excel spreadsheet listing all domains, also available online by following the links to SDTM standards on the CDISC website: www. CDISC.org. This spreadsheet is available for version 3.1.2 but may not be available for version 3.13. I reformatted the spreadsheet to look like this but it is a personal preference:


        Below is a sample SDTM metadata spreadsheet.


        Version

        Dlabel

        Domain

        Vname

        Vlabel

        Vtype

        Core

        Vorder

        Indomain

        v3.1.3

        Demographics

        DM

        STUDYID

        Study Identifier

        Char

        Req

        1

        D

        v3.1.3

        Demographics

        DM

        DOMAIN

        Domain Abbreviation

        Char

        Req

        2

        D


        v3.1.3


        Demographics


        DM


        USUBJID

        Unique Subject

        Identifier


        Char


        Req


        3


        D


        v3.1.3


        Demographics


        DM


        SUBJID

        Subject Identifier for

        the Study


        Char


        Req


        4


        D


        v3.1.3


        Demographics


        DM


        RFSTDTC

        Subject Reference

        Start Date/Time


        Char


        Exp


        5


        D


        v3.1.3


        Demographics


        DM


        RFENDTC

        Subject Reference

        End Date/Time


        Char


        Exp


        6


        D


        v3.1.3


        Demographics


        DM


        RFXENDTC

        Date/Time of First

        Study Treatment


        Char


        Exp


        7


        D


        v3.1.3


        Demographics


        DM


        RFXSTDTC

        Date/Time of Last

        Study Treatment


        Char


        Exp


        8


        D


        v3.1.3


        Demographics


        DM


        RFICDTC

        Date/Time of

        Informed Consent


        Char


        Exp


        9


        D



        v3.1.3


        Demographics


        DM


        RFPENDTC

        Date/Time of End of

        Participation


        Char


        Exp


        10


        D

        v3.1.3

        Demographics

        DM

        DTHDTC

        Date/Time of Death

        Char

        Exp

        11

        D

        v3.1.3

        Demographics

        DM

        DTHFL

        Subject Death Flag

        Char

        Exp

        12

        D

        v3.1.3

        Demographics

        DM

        SITEID

        Study Site Identifier

        Char

        Req

        13

        D

        v3.1.3

        Demographics

        DM

        INVID

        Investigator Identifier

        Char

        Perm

        14

        D

        v3.1.3

        Demographics

        DM

        INVNAM

        Investigator Name

        Char

        Perm

        15

        D

        v3.1.3

        Demographics

        DM

        BRTHDTC

        Date/Time of Birth

        Char

        Perm

        16

        D

        v3.1.3

        Demographics

        DM

        AGE

        Age

        Num

        Exp

        17

        D

        v3.1.3

        Demographics

        DM

        AGEU

        Age Units

        Char

        Exp

        18

        D


        The last column is used for the sole purpose of selecting the variables that are created in the domain. This allows you to select only the variables that are part of the domain programmed since is possible that some permissible variables (Core =Perm) are not used.

        The display order of the variables is key in the SDTM guidelines so insuring that the order is maintained should be a part of any macros automating the assignment of variable attributes.


        From this spreadsheet we create a SAS dataset. Below is an example of a simple SAS macro to read in the excel spreadsheet you created.


        %macro sdtm();


        filename sdtmct "xxxxxxx/cdisc_v_312.csv";


        *--------------------------------------------------------------------*

        • Read in excel spreadsheet

        • Output permanent dataset with all domains attributes

          *--------------------------------------------------------------------*;

          proc import out= sdtmqc.cdisc_v_312 datafile = sdtmct

          dbms=csv replace; getnames=yes;

          run;


          filename sdtmct clear;


          %mend;

          Once executed this program creates a readily available dataset of all SDTM variable labels. See an example of the output below:


          Sample SDTM attribute dataset.


          image


          Version: The variable allows you to enter several versions of the domain in the spreadsheet. Domain: The variable selecting which domain attributes you need in the run.

          Dlabel: All SDTM dataset labels

          Now having access to this data, a macro can be created to assign variable and dataset labels. The macro below is a fairly simple example of how you can accomplish this task.


          *---------------------------------------------------------------------*

        • Select a the appropriate domain from dataset CDISC_v_312

        • Create a permanent dataset

        • with the appropriate attributes from the input work dataset

          *--------------------------------------------------------------------*;

          %macro attribut(dsin=, libsdtm =, domain =, dssort =);


          data attrib(where=(compress(upcase(domain))="&domain"));

          set &libsdtm..cdisc_v_312(keep =version dlabel domain vname vlabel vtype core vorder indomain);

          if compress(indomain) =:'N' then delete;

          run;


          proc sort data =attrib; by vname;

          run;


          *---------------------------------------------------------------------*

          *Check content of created domain: variable names, labels, type and

          *length

          *--------------------------------------------------------------------*;

          proc sql noprint;

          create table content as

          select libname, upcase(name) as vname length=8, label as wklabel, type as wktype label ="Type as defined in &dsin", length as

          wklength


          quit;

          from dictionary.columns

          where libname ='WORK' and memname =upcase("&dsin") order by vname;


          *---------------------------------------------------------------------*

          *Merge content of created domain with attrib

          *Check attributes in work dataset against attrib

          *--------------------------------------------------------------------*;

          data attrib2;

          merge attrib(in=at) content(in=ct); by vname;

          ** Create length of variables based on work dataset attributes;

          if upcase(vtype) ='CHAR' then clength ='$'||strip(put(wklength,best.)); else clength =strip(put(wklength,best.));


          if at and substr(vname,3) ='SEQ' then clength =strip(put(8,best.)); if at and vname eq 'DOMAIN' then clength ='$'||strip(put(2,best.));

          if at then output attrib2;


          run;


          *---------------------------------------------------------------------*

          Create macro variable of record counts in attribute dataset

          *--------------------------------------------------------------------*;

          proc sql noprint;

          select count(*) into: obscnt

          from attrib2(where=(domain="&domain"))

          ;

          quit;

          %let obs =%sysfunc(compress(&obscnt)); proc sort data =attrib2;

          by vorder;

          run;


          *---------------------------------------------------------------------*

        • Create macro variables of variable names, labels, type and length

          *--------------------------------------------------------------------*;

          proc sql noprint;

          select vname into: varnm1-:varnm&obs from attrib2(where=(indomain ='D'))

          ;

          select vlabel into: lbl1-:lbl&obs from attrib2(where=(indomain ='D'))

          ;

          select vtype into: type1-:type&obs from attrib2(where=(indomain ='D'))

          ;

          select clength into: lgth1-:lgth&obs from attrib2(where=(indomain ='D'))

          ;

          quit;

          *---------------------------------------------------------------------*

        • Create dataset label

        *--------------------------------------------------------------------*;

        proc sql noprint;

        select strip(dlabel) into: dslabel

        from attrib(where=(upcase(domain)="&domain"))

        ;

        quit;


        *---------------------------------------------------------------------*

        Create full attribute text

        *--------------------------------------------------------------------*;

        data _null_;

        %do i=1 %to &obs;

        call symput("attrib" ||strip(put(&i,best.)), "&&varnm&i" || " label=" || "'&&lbl&i'" || " length=" || "&&lgth&i" || " format=" || "&&lgth&i"||".");

        %end; run;


        *---------------------------------------------------------------------*

        Output SDTM dataset with proper variable attributes

        *--------------------------------------------------------------------*;

        data &domain(label ="&dslabel"

        keep =

        %do i=1 %to &obs;

        &&varnm&i

        %end😉 &dsin._;

        attrib

        %do i=1 %to &obs;

        &&attrib&i


        run;

        %end;

        ;

        set &dsin; by &dssort;


        ** Assign domain name; domain =compress("&domain");


        %mend attribut;


        **Sample call;

        *%attribut(dsin=, libsdtm =, domain =, dssort =);

        An example of a demographic domain created using the macro ATTRIBUT.


        image


        QUALITY CHECK ON CONTROLLED TERMINOLOGY

        One of the most common issue we encounter is, when controlled terminology is applied, how do we recognized terms that do not comply?

        One idea is to create a format catalog from the controlled terminology(CT) spreadsheet available from the National Cancer Institute website; looks like this:


        This is a snapshot of the CDISC controlled terminology.



        Code

        Codelist Code

        Codelist Extensible

        (Yes/No)


        Codelist


        Codelist Name

        CDISC Submission Value


        CDISC Synonym(s)

        C66767


        No

        ACN

        Action Taken with Study Treatment

        ACN

        Action Taken with Study Treatment

        C49503

        C66767


        ACN

        Action Taken with Study Treatment

        DOSE INCREASED


        C49504

        C66767


        ACN

        Action Taken with Study Treatment

        DOSE NOT CHANGED


        C49505

        C66767


        ACN

        Action Taken with Study Treatment

        DOSE REDUCED


        C49501

        C66767


        ACN

        Action Taken with Study Treatment

        DRUG INTERRUPTED


        C49502

        C66767


        ACN

        Action Taken with Study Treatment

        DRUG WITHDRAWN


        C48660

        C66767


        ACN

        Action Taken with Study Treatment

        NOT APPLICABLE

        NA


        C17998

        C66767


        ACN

        Action Taken with Study Treatment

        UNKNOWN

        U; Unknown

        C66768


        No

        OUT

        Outcome of Event

        OUT

        Outcome of Event


        ACN is the SDTM code list applied to the AE domain variable, AEACN for action taken. First read the controlled terminology in SAS as a dataset using proc import or data steps. filename sdtmct "xxxxxxxxxxxx/SDTM Terminology.csv";

        proc import out= ctemp1 datafile = sdtmct dbms=csv replace; getnames=yes;

        run;


        filename sdtmct clear;

        Create a SAS dataset from the imported controlled terminology. Table below shows an example of resulting data:

        image

        The dataset was manipulated to populate the “code list extensible” variable for all rows.

        Keeping this column, indicating whether a code list is extensible or not, could prove very practical for people new to the standard. All the code lists appearing in the table above are non-extensible. This means you do not have the flexibility to deviate from what CDISC proposes for that code list.

        RACE is an extensible code list. If you consistently collect a race category that does not appear in the CDISC terminology, you may suggest the value be added. The code list is updated at least once a year so it is a good idea to check frequently if a new one is available as new terms may be added.

        There are, I am sure, many ways to use this data to automate assignment of a format to a variable.

        If access to the database code list is available, one way would be to create a dataset with a column for all values in your raw file and the corresponding CDISC controlled terminology assigned.

        For a demographic data, for example, if the race did not match the CDISC controlled terminology, your data could look something like this:

        RAW_RACE CDISC SUBMISSION VALUE

        BLACK BLACK OR AFRICAN AMERICAN

        CAUCASIAN WHITE

        CHINESE ASIAN

        You can then create a format assigning the expected CDISC controlled terms:

        proc format;

        value $race

        'BLACK' ='BLACK OR AFRICAN AMERICAN' 'CAUCASIAN' ='WHITE'

        'CHINESE' ='ASIAN'

        ;

        run;


        Instead of repeating this step for every variable, you can simply create a format dataset by adding a column in the dataset you already have. Below VALUE show the database code list for race and sex as collected in your data:


        image

        Sample codes for creating a format from a dataset is available on the SAS support website http://support.sas.com Below is my own version for the dataset I have:

        proc sql noprint;

        create table fmttable as

        select unique strip(CODELIST) as fmtname label ='Format Name', strip(CDISC_SUBMISSION_VALUE) as label,

        strip(VALUE) as start, from

        ;

        quit;


        data ctrl;

        set ctfmt end=last; end =start;

        output;

        if last then do; start='**OTHER**'; end =start;


        run;

        label=' '; output; end;


        proc format library=work cntlin=ctrl;

        run; quit;

        The output looks like this:

        image


        The format can then be applied in any data step:

        data dm;

        attrib race length=$40; set dmraw;

        if not missing(raceval) then race =put(raceval,$race.);

        run;


        QC OF SDTM VARIABLES - CODE:

        However, if the data collected should match the controlled terminology (CT) because your data management (DM) team uses the Clinical Data Acquisition Standard Harmonization (CDASH), then you only have to check your SDTM variables against the CDISC controlled terminology.

        Let us use the Adverse Event (AE) domain as an example.

        For example, you can check that AEACN complies with the CDISC CT by following the steps below:

        • Select all possible values of AE.AEACN in the created AE domain then merge with the values in the appropriate controlled terminology, ACN.


proc freq data =AE noprint;

Table AEACN/ missing out=AE_ACTION;

run;


proc sql noprint;

create table AECT as

select aeacn, b.cdisc_submission_value, b.codelist


quit;

from AE_ACTION left join CTDATA(where=(codelist in ('ACN'))) as b on aeacn=b.cdisc_submission_value

;


If you have a one-to-one match then your AE domain has the right values in AEACN, otherwise the output will show differences. See example below:

image


HOW TO QUICKLY CHECK DIFFERENCES IN NUMBER OF RECORDS FOR FINDINGS DOMAINS

One issue often encounters when validating data using parallel programming is a difference in number of records. For findings domain, a quick way to identify where the difference occurred is by narrowing it down to which test category and code differ.

Below is an example for the LB domain:

proc sql noprint;

select lbcat,lbscat,lbtestcd,lbtest, count(usubjid) as count from sourcelb

group by lbcat, lbscat, lbtestcd, lbtest order by lbcat, lbscat, lbtestcd, lbtest

;

quit;


The same result can be achieved with proc freq:


proc freq data =sourcelb noprint;

table lbcat*lbscat*lbtestcd*lbtest / out=qc(keep =lbcat lbscat lbtestcd lbtest count rename=(count=n)) list missing nocum nopercent;

run;


Create a similar dataset for the validation dataset:


proc sql noprint;

create table qs as

select lbcat,lbscat,lbtestcd,lbtest, count(usubjid) as qccount from validlb

group by lbcat, lbscat, lbtestcd, lbtest order by lbcat, lbscat, lbtestcd, lbtest

;

quit;


data miss;

merge qc(in=qc) source(in=dv); by lbcat lbscat lbtestcd lbtest;

run;


Show dataset here:


image

The records where the variables COUNT and QCCOUNT are not equal help you narrow down where the difference comes from.

You can add as many variable levels as needed based on your validation requirements. If stopping at the test level is not enough, add the visit values for example.

The idea is to pin-point where the difference is as opposed to looking for “a needle in a haystack” since datasets, especially laboratory data, can get very large.


CONCLUSION:


This paper has paper provided examples of SAS code that can be used to automate certain validation programming tasks when creating SDTM datasets.

It can also be used as a starting point for programmers, with no automation tools in place, on how to minimize the amount of repeat programming that often comes with creation of SDTM domains.

As with any new process, starting up is often the most cumbersome. Creating a library of domain attributes and labels as well as one of all the controlled terminologies, CDISC and in-house (sponsor defined), would be most useful; and is almost a necessity if the validation task is to become more efficient.

OpenCDISC will check your SDTM domain, and is an excellent tool for validation but automating programming tasks will help reduce the amount of time checking the report for warning about attributes for example.


REFERENCES:

SAS support website: http://support.sas.com CDISC: http://www.cdisc.org

OpenCDISC: http://www.opencdisc.org

National Cancer Institute Website: http://www.cancer.gov/cancertopics/cancerlibrary/terminologyresources/cdisc


ACKNOWLEDGMENTS:

This is the text for the acknowledgments. This is the paper body. This is the paper body. This is the paper body. This is the paper body. This is the paper body.


RECOMMENDED READING:

  • SDTM Implementation Guide: version 3.1.3 and version 3.1.2

  • Study Data Tabulation Model v1.3

  • How to use SDTMIG 3.1.3


CONTACT INFORMATION:

Your comments and questions are valued and encouraged. Contact the author at: Name: Dany Guerendo

Enterprise: STATProg LLC Address: 512 Abbey Fields Loop

City, State ZIP: Morrisville, NC, 27560 Work Phone: 919-423-3560

E-mail: dany.guerendo@statprogllc.com


SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.