Comparative analysis of machine learning algorithms for biomedical text document classification: A case study on cancer-related publications

Küçük, Ekrem; Çiçek, İpek Balıkcı; Küçükakçalı, Zeynep; Yetiş, Cihan

Comparative analysis of machine learning algorithms for biomedical text document classification: A case study on cancer-related publications

dc.contributor.author	Küçük, Ekrem
dc.contributor.author	Çiçek, İpek Balıkcı
dc.contributor.author	Küçükakçalı, Zeynep
dc.contributor.author	Yetiş, Cihan
dc.date.accessioned	2024-08-04T19:53:22Z
dc.date.available	2024-08-04T19:53:22Z
dc.date.issued	2024
dc.department	İnönü Üniversitesi	en_US
dc.description.abstract	Biomedical text document classification is an essential task within Natural Language Processing (NLP), with applications ranging from sentiment analysis to authorship identification. Despite advancements in traditional machine-learning algorithms like Support Vector Machines (SVM) and Logistic Regression, challenges such as data sparsity and high dimensionality persist. Recent years have seen a surge in the use of deep learning models to mitigate these issues. This study aims to conduct a comparative analysis of various machine-learning algorithms for classifying biomedical text documents. The study employs the "Medical Text Dataset - Cancer Doc Classification" from Kaggle, comprising 7570 biomedical text documents labeled into three types of cancer (colon, lung, and thyroid). A preprocessing pipeline involving tokenization, stop-word removal, and Term Frequency-Inverse Document Frequency (TF-IDF) vectorization is applied. Algorithms including Logistic Regression, SVM, and Multinomial Naive Bayes are evaluated through 5-fold cross-validation. Performance metrics like accuracy, precision, recall, F1 score, and area under the ROC curve (AUC ROC) are employed. Logistic Regression outperforms the other algorithms with an accuracy of 78.3% and an AUC ROC of 88.59%. SVM and Multinomial Naive Bayes follow with lower performance metrics. Hyperparameter tuning further enhances the performance of the algorithms, particularly Logistic Regression. The study makes a significant contribution to the field of biomedical text classification by systematically comparing machine-learning algorithms. Logistic Regression emerges as the most effective, emphasizing the importance of algorithm selection and hyperparameter tuning in machine learning applications within this domain.	en_US
dc.identifier.doi	10.5455/medscience.2023.10.209
dc.identifier.endpage	174	en_US
dc.identifier.issn	2147-0634
dc.identifier.issue	1	en_US
dc.identifier.startpage	171	en_US
dc.identifier.trdizinid	1244644	en_US
dc.identifier.uri	https://doi.org/10.5455/medscience.2023.10.209
dc.identifier.uri	https://search.trdizin.gov.tr/yayin/detay/1244644
dc.identifier.uri	https://hdl.handle.net/11616/89710
dc.identifier.volume	13	en_US
dc.indekslendigikaynak	TR-Dizin	en_US
dc.language.iso	en	en_US
dc.relation.ispartof	Medicine Science	en_US
dc.relation.publicationcategory	Makale - Ulusal Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.title	Comparative analysis of machine learning algorithms for biomedical text document classification: A case study on cancer-related publications	en_US
dc.type	Article	en_US

Koleksiyon

TR-Dizin İndeksli Yayınlar Koleksiyonu

Comparative analysis of machine learning algorithms for biomedical text document classification: A case study on cancer-related publications

Dosyalar

Koleksiyon