Graduate Program
Technology
Degree Name
Master of Science (MS)
Semester of Degree Completion
Fall 2025
Thesis Director
Toqeer A. Israr
Thesis Committee Member
Ammar Bhutta
Thesis Committee Member
Sean T. Roberts
Abstract
This research presents an automated disease diagnosis framework that leverages natural language processing and deep learning to predict diagnosis codes from unstructured electronic health record (EHR) clinical notes. Using the MIMIC-III critical care dataset, clinical narratives such as discharge summaries and physician notes are extracted, validated, and preprocessed to construct a labeled corpus aligned with ICD-9 diagnoses. The study implements a pipeline comprising text cleaning, feature extraction, and supervised learning, and compares traditional models such as Logistic Regression and Bi-LSTM with transformer-based architectures built on BERT. Models are trained and evaluated with categorical cross-entropy loss and standard multi-class metrics, including accuracy, F1-score, and ROC-AUC, while hyperparameters such as learning rate, optimizer configuration, and training epochs are systematically tuned. Experimental results show that BERT-based classifiers substantially outperform conventional baselines, achieving higher accuracy and F1-scores and demonstrating strong robustness in handling complex clinical terminology and context. These findings highlight the potential of transformer-based NLP models to enhance clinical decision support and large-scale phenotyping, while also underscoring limitations related to label noise, dataset-specific bias, truncated input length, and the need for more comprehensive interpretability and multi-label modeling in future work.
Recommended Citation
Sherla, Manas Kumar, "Disease Diagnosis Using Natural Language Processing and Deep Learning" (2025). Masters Theses. 5110.
https://thekeep.eiu.edu/theses/5110