Article
Clustering breast cancer patients based on their course of treatment
Search Medline for
Authors
Published: | September 6, 2024 |
---|
Outline
Text
Introduction: Clinical cancer registries in Germany are responsible for collecting, processing, and analyzing data on cancer patients [1]. Analyzing data about the course of treatment has the potential to uncover factors that influence treatment success, thereby improving our understanding of the care and treatment of cancer patients. However, due to the complexity of treatment data, this potential is still largely untapped. One way to analyze this complex data is to use machine learning methods to find patterns in the data. In this paper, we describe a method for clustering breast cancer patients based on their treatments.
Methods: To cluster breast cancer patients based on their course of treatment, we need to assess their similarity. We therefore developed a similarity measure that incorporates a patient‘s surgeries, radiotherapies, systemic therapies, and cancer diagnosis. For example, the similarity between two surgeries is determined by comparing their procedure codes, residual states and complications. Such similarities of treatments are used as replacement cost in a modified Levenshtein distance, taking into account the order of treatments.
From the similarity measure we derived a distance matrix, which was then reduced to two dimensions using UMAP [2], allowing the data to be visualized in a scatter plot and facilitating the subsequent clustering [3], [4].
We applied our approach on a dataset of breast cancer patients diagnosed in 2019 from the cancer registry of North Rhine-Westphalia (n=17,822). We evaluated different clustering methods and hyperparameters based on their silhouette score (SSc) [5], number of clusters found and number of outliers.
The SSc was used to select three clustering results with varying levels of detail for further investigation. These included a result with 12 clusters (SSc=0.68), a result with 53 clusters (SSc=0.66), and a result with 174 clusters (SSc=0.61).
Results: Our approach successfully separates patients into different clusters based on their course of treatment. We found clusters with typical treatment courses, such as surgery followed by systemic or radiotherapy. Depending on the granularity of clustering, some clusters were subdivided into more detailed groups, for example, based on the surgery into groups of mastectomy and breast-conserving surgery.
Unexpected treatment courses were also discovered. For example, three clusters were identified, grouping patients without treatment (5,177 patients). Such a large number of untreated breast cancer patients is unlikely and indicates data quality issues. Another example are patients with treatment prior to diagnosis (34 patients). This could indicate incorrectly dated data.
Discussion and conclusion: We asked domain experts to evaluate our approach. The experts emphasized the hypothesis-generating qualities of the method, since discussing the results revealed new insight into the data. Examples of this are conspicuous or unexplained clusters, which may indicate data quality issues or prompt detailed discussions at quality conferences. Overall, the domain experts rated clustering of treatment data as an interesting and relevant topic.
Currently, the method uses diagnosis-specific assumptions to calculate the similarity of attributes. In the future, we plan to analyze other diagnoses and to investigate whether the necessary similarity measures can be learned automatically using Machine Learning techniques.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Stegmaier C, Hentschel S, Hofstädter F, Katalinic A, Tillack A, Klinkhammer-Schalke M. Das Manual der Krebsregistrierung. München: W. Zuckschwerdt Verlag; 2019.
- 2.
- McInnes L, Healy J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction [Preprint]. arXiv. 2018. DOI: 10.48550/arXiv.1802.03426
- 3.
- Herrmann M, Kazempour D, Scheipl F, Kröger P. Enhancing cluster analysis via topological manifold learning. Data Mining and Knowledge Discovery. 2023;38:840–887. DOI: 10.1007/s10618-023-00980-2
- 4.
- Allaoui M, Kherfi ML, Cheriet A. Considerably Improving Clustering Algorithms Using UMAP Dimensionality Reduction Technique: A Comparative Study. Image and Signal Processing. 2020 Jun 5;12119:317–25. DOI: 10.1007/978-3-030-51935-3_34
- 5.
- Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. 1987;20:53–65. DOI: 10.1016/0377-0427(87)90125-7