Comparative analysis of machine learning algorithms for biomedical text document classification: A case study on cancer-related publications
dc.contributor.author | Küçük, Ekrem | |
dc.contributor.author | Çiçek, İpek Balıkcı | |
dc.contributor.author | Küçükakçalı, Zeynep | |
dc.contributor.author | Yetiş, Cihan | |
dc.date.accessioned | 2024-08-04T19:53:22Z | |
dc.date.available | 2024-08-04T19:53:22Z | |
dc.date.issued | 2024 | |
dc.department | İnönü Üniversitesi | en_US |
dc.description.abstract | Biomedical text document classification is an essential task within Natural Language Processing (NLP), with applications ranging from sentiment analysis to authorship identification. Despite advancements in traditional machine-learning algorithms like Support Vector Machines (SVM) and Logistic Regression, challenges such as data sparsity and high dimensionality persist. Recent years have seen a surge in the use of deep learning models to mitigate these issues. This study aims to conduct a comparative analysis of various machine-learning algorithms for classifying biomedical text documents. The study employs the "Medical Text Dataset - Cancer Doc Classification" from Kaggle, comprising 7570 biomedical text documents labeled into three types of cancer (colon, lung, and thyroid). A preprocessing pipeline involving tokenization, stop-word removal, and Term Frequency-Inverse Document Frequency (TF-IDF) vectorization is applied. Algorithms including Logistic Regression, SVM, and Multinomial Naive Bayes are evaluated through 5-fold cross-validation. Performance metrics like accuracy, precision, recall, F1 score, and area under the ROC curve (AUC ROC) are employed. Logistic Regression outperforms the other algorithms with an accuracy of 78.3% and an AUC ROC of 88.59%. SVM and Multinomial Naive Bayes follow with lower performance metrics. Hyperparameter tuning further enhances the performance of the algorithms, particularly Logistic Regression. The study makes a significant contribution to the field of biomedical text classification by systematically comparing machine-learning algorithms. Logistic Regression emerges as the most effective, emphasizing the importance of algorithm selection and hyperparameter tuning in machine learning applications within this domain. | en_US |
dc.identifier.doi | 10.5455/medscience.2023.10.209 | |
dc.identifier.endpage | 174 | en_US |
dc.identifier.issn | 2147-0634 | |
dc.identifier.issue | 1 | en_US |
dc.identifier.startpage | 171 | en_US |
dc.identifier.trdizinid | 1244644 | en_US |
dc.identifier.uri | https://doi.org/10.5455/medscience.2023.10.209 | |
dc.identifier.uri | https://search.trdizin.gov.tr/yayin/detay/1244644 | |
dc.identifier.uri | https://hdl.handle.net/11616/89710 | |
dc.identifier.volume | 13 | en_US |
dc.indekslendigikaynak | TR-Dizin | en_US |
dc.language.iso | en | en_US |
dc.relation.ispartof | Medicine Science | en_US |
dc.relation.publicationcategory | Makale - Ulusal Hakemli Dergi - Kurum Öğretim Elemanı | en_US |
dc.rights | info:eu-repo/semantics/openAccess | en_US |
dc.title | Comparative analysis of machine learning algorithms for biomedical text document classification: A case study on cancer-related publications | en_US |
dc.type | Article | en_US |