Comparative analysis of machine learning algorithms for biomedical text document classification: A case study on cancer-related publications

Küçük Resim Yok

Tarih

2024

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Erişim Hakkı

info:eu-repo/semantics/openAccess

Özet

Biomedical text document classification is an essential task within Natural Language Processing (NLP), with applications ranging from sentiment analysis to authorship identification. Despite advancements in traditional machine-learning algorithms like Support Vector Machines (SVM) and Logistic Regression, challenges such as data sparsity and high dimensionality persist. Recent years have seen a surge in the use of deep learning models to mitigate these issues. This study aims to conduct a comparative analysis of various machine-learning algorithms for classifying biomedical text documents. The study employs the "Medical Text Dataset - Cancer Doc Classification" from Kaggle, comprising 7570 biomedical text documents labeled into three types of cancer (colon, lung, and thyroid). A preprocessing pipeline involving tokenization, stop-word removal, and Term Frequency-Inverse Document Frequency (TF-IDF) vectorization is applied. Algorithms including Logistic Regression, SVM, and Multinomial Naive Bayes are evaluated through 5-fold cross-validation. Performance metrics like accuracy, precision, recall, F1 score, and area under the ROC curve (AUC ROC) are employed. Logistic Regression outperforms the other algorithms with an accuracy of 78.3% and an AUC ROC of 88.59%. SVM and Multinomial Naive Bayes follow with lower performance metrics. Hyperparameter tuning further enhances the performance of the algorithms, particularly Logistic Regression. The study makes a significant contribution to the field of biomedical text classification by systematically comparing machine-learning algorithms. Logistic Regression emerges as the most effective, emphasizing the importance of algorithm selection and hyperparameter tuning in machine learning applications within this domain.

Açıklama

Anahtar Kelimeler

Kaynak

Medicine Science

WoS Q Değeri

Scopus Q Değeri

Cilt

13

Sayı

1

Künye