Machine Learning Application on Prediction of Male Breast Cancer with PLCO Dataset
DOI:
https://doi.org/10.47611/jsrhs.v10i3.2199Keywords:
Machine Learning, Male Breast Cancer, PLCOAbstract
The objective of the paper is to explore and examine the applicability of machine learning models on Male Breast Cancer with PLCO dataset. People who are unaware of the potential danger of getting breast cancer like males would not have the medical awareness beforehand for predictions. Therefore, the PLCO trials dataset consisting of ages, prostate status, marriage status etc. from National Institute of Cancer is used in this research for detection. The main purpose of using PLCO test is to discover the potential risk of getting an Male Breast Cancer (MBC) as soon as possible with low cost and easy collection. It is the rarity of MBC that imposes the threat for males who are unaware of the danger. To explore the relatively most suitable models to use for detecting MBC using non-traditional PLCO test dataset, different existing models including decision tree, random forest, DBSCAN, One Class SVM and so on were used to fit the data. Due to its extremity of imbalance, evaluation comes from the combination of standard accuracy and Area Under the Receiver Operating Characteristics(AUROC) for the overall accuracy of those models mentioned above. K-means and Logistic Regression models performed best with the AUC score of 0.62 and 0.67. Results suggested that more efficient approaches for common male breast cancer diagnosis or more advanced models and algorithms are needed in further study.
Downloads
References or Bibliography
References
Al-Masri, A. (2019). How does k-means clustering in machine learning work? TowardsDataScience. https://towardsdatascience.com/how-does-k-means-clustering-in-machine-learning-work-fdaaaf5acfa0
Breast cancer wisconsin (diagnostic) data set. (1995). UCI Machine Learning Repository.
Cardoso, F. (2017). Characterization of male breast cancer: Results of the eortc 10085/tbcrc/big/nabcg international male breast cancer program. ScienceDirect. https://doi.org/10.1093/annonc/mdx651
Choudhury, A. (Ed.). (2021, January 14). Top xgboost interview questions for data scientists. Retrieved August 31, 2021, from https://analyticsindiamag.com/top-xgboost-interview-questions-for-data-scientists/
Decision tree learning pros and cons. (n.d.). Orelly. https://www.oreilly.com/library/view/machine-learning-with/9781787121515/697c4c5f-1109-4058-8938-d01482389ce3.xhtml
Doshi, N. (2019). Spectral clustering. Towards Data Science. https://towardsdatascience.com/spectral-clustering-82d3cff3d3b7
Gao, Y. (2019). Breast cancer screening in high-risk men: A 12-year longitudinal observational study of male breast imaging utilization and outcomes. Radiology. https://doi.org/10.1148/radiol.2019190971
Goonewardana, H. (2019). PCA: Application in machine learning. Apprentice Journal. https://medium.com/apprentice-journal/pca-application-in-machine-learning-4827c07a61db
Hill, T. D. (2005). Comparison of male and female breast cancer incidence trends, tumor characteristics, and survival. ScienceDirect. https://www.sciencedirect.com/science/article/abs/pii/S1047279705000128?via%3Dihub
Innab, R. (2019, October 31). Why do decision trees have a tendency to overfit to the training set? [Online forum post]. Quora. https://www.quora.com/Why-do-decision-trees-have-a-tendency-to-overfit-to-the-training-set
Karatsalos, C. (2018, March 27). What is the time complexity of spectral clustering and why is it so? [Online forum post]. StackExchange. https://stats.stackexchange.com/questions/348512/what-is-the-time-complexity-of-spectral-clustering-and-why-is-it-so
Kunanbaeva, A. (2019). What is ROC AUC and how to visualize it in python. Medium. https://medium.com/@kunanba/what-is-roc-auc-and-how-to-visualize-it-in-python-f35708206663
M, S., & Radhika, S. (2020). Machine learning techniques for prediction from various breast cancer datasets. IEEE. https://sci-hub.st/https://ieeexplore.ieee.org/abstract/document/9167657/
Male breast cancer. (2020). National Breast Cancer. https://www.nationalbreastcancer.org/male-breast-cancer
Markman, M. (2021). BRCA1 and brca2. Cancer Treatment Center of America. https://www.cancercenter.com/cancer-types/breast-cancer/risk-factors/brca1-and-brca2
Narkhede, S. (2018). Understanding logistic regression. Towards Data Science. https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102
Omene, C. (2010). Chapter 42 - the differences between male and female breast cancer. ScienceDirecta. https://doi.org/10.1016/B978-0-12-374271-1.00042-3
Prado, K. (2017). How DBSCAN works and why should we use it? TowadsDataScience. https://towardsdatascience.com/how-dbscan-works-and-why-should-i-use-it-443b4a191c80
Prostate cancer screening results from the prostate, lung, colorectal, and ovarian cancer randomized screening trial: Questions and answers. (2009, March 19). Retrieved August 3, 2021, from https://www.cancer.gov/types/prostate/research/plco-screening-results-qa#:~:text=Cancer%20Screening%20Trial%3F-,The%20Prostate%2C%20Lung%2C%20Colorectal%2C%20and%20Ovarian%20(PLCO),%2C%20colorectal%2C%20and%20ovarian%20cancer.
Regularization (mathematics). (n.d.). Wikipedia. Retrieved August 31, 2021, from https://en.wikipedia.org/wiki/Regularization_(mathematics)
Sasco, A. (1993). Review article: Epidemiology of male breast cancer. A meta-analysis of published case-control studies and discussion of selected aetiological factors. International Journal of Cancer. https://onlinelibrary.wiley.com/doi/10.1002/ijc.2910530403
Scholkopf, B. (2000). Support vector method for novelty detection. MIT Press. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.675.575&rep=rep1&type=pdf
Sharma, A. (2020). How to master the popular dbscan clustering algorithm for machine learning. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/09/how-dbscan-clustering-works/
Sklearn.cluster.DBSCAN. (n.d.). Scikit-learn. Retrieved August 31, 2021, from https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
Sklearn.ensemble.RandomForestClassifier. (n.d.). Sklearn. Retrieved August 31, 2021, from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Verma, P. (2020). Isolation forest algorithm for anomaly detection. Heatbeat. https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5
Vermeulen, M. A. (2017). Pathological characterisation of male breast cancer: Results of the eortc 10085/tbcrc/big/nabcg international male breast cancer program. European Journal of Cancer. https://doi.org/10.1016/j.ejca.2017.01.034
Wening, P. (2018). Local outlier factor for anomaly detection. TowardsDataScience. https://towardsdatascience.com/local-outlier-factor-for-anomaly-detection-cc0c770d2ebe
Yalaza, M. (2016). Male breast cancer. US National Library of Medicine National Institutes of Health. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5351429/#b3-jbh-12-1-1
Yan, D. (2009). Fast approximate spectral clustering. Association for Computing Machinery. https://doi.org/10.1145/1557019.1557118
Published
How to Cite
Issue
Section
Copyright (c) 2021 Juntao Li; Dr. Ganesh Mani
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright holder(s) granted JSR a perpetual, non-exclusive license to distriute & display this article.