
Douglas S. McNair MD PhD
Sr. Vice President, Research
Cerner Corporation
2800 Rockcreek Parkway
Kansas City, MO 64117-2551
Tel: 816.201.0511
dmcnair@cerner.com
www.cerner.com
Bio>
Go to lecture> |
Contemporary datasets to support observational research regarding safety and efficacy have ‘curses’ both of dimensionality and cardinality. They may include records for millions of persons and values for hundreds of thousands or millions of different variables. The databases range from terabytes (1012B) to petabytes (1015B) in size, and relational tablescans may take hours or days to execute unless the datasets are organized for massively parallelized processing by hundreds or thousands of servers.
In this context, selecting effective markers with rather moderate effects from a large number of candidate variables in electronic health records and genomics/proteomics datawarehouses is a formidable challenge. Model selection based on a priori plausibilistic human variables-nomination methods is no longer sufficient. Doing so only results in many false-negatives and poorly-performing models. Model selection by multiple-testing of individual model parameters under general conditions asymptotically is a consistent selection procedure but one that is not necessarily scalable to dimensionality of thousands. Selection based on a multiple-test controlling the FDR [and construction of a linear score in a ‘training’ dataset and using the receiver operating characteristic (ROC) on independent ‘validation’ dataset(s)] and other practical alternative approaches are discussed. Using existing electronic medical record-derived de-identified, confidentiality-protected datawarehouses it is possible to evaluate scenarios with varying numbers of markers, varying number and types of endpoints, varying proportion of effective markers, and sample size.
|