Graduate Program

Technology

Degree Name

Master of Science (MS)

Semester of Degree Completion

Fall 2025

Thesis Director

Toqeer A. Israr

Thesis Committee Member

Ammar Bhutta

Thesis Committee Member

Sean T. Roberts

Abstract

This research presents an automated disease diagnosis framework that leverages natural language processing and deep learning to predict diagnosis codes from unstructured electronic health record (EHR) clinical notes. Using the MIMIC-III critical care dataset, clinical narratives such as discharge summaries and physician notes are extracted, validated, and preprocessed to construct a labeled corpus aligned with ICD-9 diagnoses. The study implements a pipeline comprising text cleaning, feature extraction, and supervised learning, and compares traditional models such as Logistic Regression and Bi-LSTM with transformer-based architectures built on BERT. Models are trained and evaluated with categorical cross-entropy loss and standard multi-class metrics, including accuracy, F1-score, and ROC-AUC, while hyperparameters such as learning rate, optimizer configuration, and training epochs are systematically tuned. Experimental results show that BERT-based classifiers substantially outperform conventional baselines, achieving higher accuracy and F1-scores and demonstrating strong robustness in handling complex clinical terminology and context. These findings highlight the potential of transformer-based NLP models to enhance clinical decision support and large-scale phenotyping, while also underscoring limitations related to label noise, dataset-specific bias, truncated input length, and the need for more comprehensive interpretability and multi-label modeling in future work.

Share

COinS