Predicting lung cancer risk based on artificial intelligence: Leveraging multifactorial inputs for early detection

Karaaslan, ErolGüldoğan, EmekYağın, Fatma Hilal2026-04-042026-04-0420252147-0634https://doi.org/10.5455/medscience.2025.06.169https://search.trdizin.gov.tr/tr/yayin/detay/1369788https://hdl.handle.net/11616/107291Lung cancer remains the leading cause of cancer-related mortality worldwide, largely because most cases are detected at advanced stages. This study develops and validates multifactorial machine-learning models that integrate demographic, behavioural, psychological, symptom-based and comorbidity variables to identify individuals at high risk of lung cancer. An anonymised dataset of 13.000 subjects (74% lung-cancer positive) obtained from the public “Lung Cancer Patient Records” repository was pre-processed through recoding, one-hot encoding and stratified train/test partitioning. To address class imbalance the training subset was balanced with Synthetic Minority Oversampling Technique (SMOTE). Three supervised algorithms—Logistic Regression, Random Forest and Extreme Gradient Boosting (XGBoost)—were tuned via grid search with five-fold stratified cross-validation optimising area under the receiver-operating-characteristic curve (AUC). On the independent hold-out set XGBoost achieved superior discrimination (AUC=0.93), sensitivity (0.95) and F1-score (0.93), followed closely by Random Forest (AUC=0.91). Univariate analyses confirmed significant associations (p<0.001) between lung cancer status and all candidate predictors, with the strongest effect sizes observed for yellow fingers, persistent cough, wheezing, fatigue and peer-pressure–related smoking. The findings demonstrate that incorporating easily elicited clinical symptoms and psychosocial factors alongside traditional risk markers markedly improves early-detection performance over age–smoking models alone. Because all inputs are non-invasive and low-cost, the proposed model can be embedded in electronic-health-record decision support or mobile triage applications, particularly benefiting resource-limited settings. Future work will focus on external validation across diverse populations, temporal modelling of symptom trajectories and cost-effectiveness analyses to inform risk-tailored low-dose CT screening protocols.eninfo:eu-repo/semantics/openAccessSolunum SistemiOnkolojiBilgisayar BilimleriYapay ZekaPredicting lung cancer risk based on artificial intelligence: Leveraging multifactorial inputs for early detectionArticle14497598210.5455/medscience.2025.06.1691369788