Comparative analysis of machine learning algorithms for biomedical text document classification: A case study on cancer-related publications

dc.contributor.authorKüçük, Ekrem
dc.contributor.authorÇiçek, İpek Balıkcı
dc.contributor.authorKüçükakçalı, Zeynep
dc.contributor.authorYetiş, Cihan
dc.date.accessioned2024-08-04T19:53:22Z
dc.date.available2024-08-04T19:53:22Z
dc.date.issued2024
dc.departmentİnönü Üniversitesien_US
dc.description.abstractBiomedical text document classification is an essential task within Natural Language Processing (NLP), with applications ranging from sentiment analysis to authorship identification. Despite advancements in traditional machine-learning algorithms like Support Vector Machines (SVM) and Logistic Regression, challenges such as data sparsity and high dimensionality persist. Recent years have seen a surge in the use of deep learning models to mitigate these issues. This study aims to conduct a comparative analysis of various machine-learning algorithms for classifying biomedical text documents. The study employs the "Medical Text Dataset - Cancer Doc Classification" from Kaggle, comprising 7570 biomedical text documents labeled into three types of cancer (colon, lung, and thyroid). A preprocessing pipeline involving tokenization, stop-word removal, and Term Frequency-Inverse Document Frequency (TF-IDF) vectorization is applied. Algorithms including Logistic Regression, SVM, and Multinomial Naive Bayes are evaluated through 5-fold cross-validation. Performance metrics like accuracy, precision, recall, F1 score, and area under the ROC curve (AUC ROC) are employed. Logistic Regression outperforms the other algorithms with an accuracy of 78.3% and an AUC ROC of 88.59%. SVM and Multinomial Naive Bayes follow with lower performance metrics. Hyperparameter tuning further enhances the performance of the algorithms, particularly Logistic Regression. The study makes a significant contribution to the field of biomedical text classification by systematically comparing machine-learning algorithms. Logistic Regression emerges as the most effective, emphasizing the importance of algorithm selection and hyperparameter tuning in machine learning applications within this domain.en_US
dc.identifier.doi10.5455/medscience.2023.10.209
dc.identifier.endpage174en_US
dc.identifier.issn2147-0634
dc.identifier.issue1en_US
dc.identifier.startpage171en_US
dc.identifier.trdizinid1244644en_US
dc.identifier.urihttps://doi.org/10.5455/medscience.2023.10.209
dc.identifier.urihttps://search.trdizin.gov.tr/yayin/detay/1244644
dc.identifier.urihttps://hdl.handle.net/11616/89710
dc.identifier.volume13en_US
dc.indekslendigikaynakTR-Dizinen_US
dc.language.isoenen_US
dc.relation.ispartofMedicine Scienceen_US
dc.relation.publicationcategoryMakale - Ulusal Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.titleComparative analysis of machine learning algorithms for biomedical text document classification: A case study on cancer-related publicationsen_US
dc.typeArticleen_US

Dosyalar