Statistics and Machine Learning Methods for EHR Data: From Data Extraction to Data Analytics

Wu, Hulin, Yamal, Jose Miguel, Yaseen, Ashraf


The use of Electronic Health Records (EHR)/Electronic Medical Records (EMR) data is becoming more prevalent for research. However, analysis of this type of data has many unique complications due to how they are collected, processed and types of questions that can be answered. This book covers many important topics related to using EHR/EMR data for research including data extraction, cleaning, processing, analysis, inference, and predictions based on many years of practical experience of the authors. The book carefully evaluates and compares the standard statistical models and approaches with those of machine learning and deep learning methods and reports the unbiased comparison results for these methods in predicting clinical outcomes based on the EHR data.

Key Features:

  • Written based on hands-on experience of contributors from multidisciplinary EHR research projects, which include methods and approaches from statistics, computing, informatics, data science and clinical/epidemiological domains.
  • Documents the detailed experience on EHR data extraction, cleaning and preparation
  • Provides a broad view of statistical approaches and machine learning prediction models to deal with the challenges and limitations of EHR data.
  • Considers the complete cycle of EHR data analysis.

The use of EHR/EMR analysis requires close collaborations between statisticians, informaticians, data scientists and clinical/epidemiological investigators. This book reflects that multidisciplinary perspective.


  • Hulin Wu, PhD, the endowed Betty Wheless Trotter Professor and Chair, Department of Biostatistics & Data Science, School of Public Health (SPH), University of Texas Health Science Center at Houston (UTHealth). Dr. Wu also holds a joined appointment as Professor at UTHealth School of Biomedical Informatics. Dr. Wu received BS and MS training in engineering and PhD in statistics. He has many years of experience in developing novel statistical methods, mathematical models and informatics tools for biomedical data analysis and modeling. He is the Founding Director of the Center for Big Data in Health Sciences (CBD-HS) and he is directing the EHR research working group at UTHealth SPH.

  • Dr. Yamal is a tenured Associate Professor in the Department of Biostatistics & Data Science and a member of the Coordinating Center for Clinical Trials at UTHealth School of Public Health. Dr. Yamal has extensive experience in clinical trials including data coordinating centers and serving on Data Safety Monitoring Boards for clinical trials in stroke and traumatic brain injury. He has also contributed towards statistical methodology for classification problems for nested data as well as machine learning applications. 
  • Ashraf Yaseen is an Assistant Professor of Data Science at the School of Public Health, UTHealth. He has extensive experience in database design, implementation and management, machine learning, and high-performance computing. In his current research work, Dr. Yaseen is exploring big data integration and deep learning technologies in electronic health records to address clinical and public health questions.

  • Vahed Maroufy, PhD, Assistant Professor, Department of Biostatistics & Data Science, UTHealth School of Public Health. Dr. Maroufy received MSc and PhD training in statistics and has experience in applied and theoretical statistics, including geometry of statistical models, mixture models, Bayesian inference, predictive models using EHR data, and analysis of genetic data in cancer research.