
Cancer and Data: Exploring the Potential of Analytics in Oncology
04th June 2025
Cancer as the Second Leading Cause of Global Mortality
According to data from the World Health Organization (WHO), cancer was responsible for approximately 9.6 million deaths in 2018, accounting for about one in six deaths worldwide. The most prevalent cancer types in men are lung, prostate, colorectal, stomach, and liver cancer; in women, the most common are breast, colorectal, lung, cervical, and thyroid cancer.
Epidemiological projections are not encouraging: a recent report from the WHO, in collaboration with the International Agency for Research on Cancer (IARC), estimates that by 2050 there will be over 35 million new cases annually—a 77% increase compared to the 20 million reported in 2022. Among the main risk factors contributing to the rising incidence are smoking, harmful alcohol consumption, and obesity.
Cancer claims 10 million lives each year, with late-stage diagnosis being a major driver. Early detection significantly improves survival rates; however, current screening methods are often inaccessible or lack sufficient sensitivity to detect cancer at an early stage.
For example, mammography has an estimated false-negative rate of around 12%. Even with advanced technologies such as 3D digital mammography, cumulative false-positive rates over a decade have been reported to exceed 50%. This issue is even more critical in cancers of the liver, pancreas, or ovaries, where effective population-wide screening methods are still unavailable, resulting in late diagnoses with limited therapeutic options.
(mental health, wellness, fitness, nutrition and supplements, remote monitoring)
(POC testing, lab testing, diagnostic technologies, decision support, population health)
(telehealth, home care, primary treatments, specialized treatments, hospitals)
(rehabilitation, social care, chronic care, elder care)
(training and certification, health and safety)
(health records, practice management, scheduling and referrals, health analytics)
(wearables, medical devices, medical equipment, medical imaging, medical robotics)
(health benefits, corporate wellness, health insurance, health asset financing, healthcare real estate)
(drug manufacturing, drug commercialization, healthcare logistics, pharmacies)
(discoveries, clinical trials, clinical insights, precision medicine, genomics)
Real-World Data in Cancer Research
In oncology, there is growing interest in the use of real-world data (RWD) as a complement to controlled clinical trials, with the aim of addressing clinical and public health questions that cannot be fully answered through the experimental paradigm alone.
RWD refers to information collected from heterogeneous healthcare sources—such as electronic medical records, administrative databases, or population-based cancer registries—and is typically organized around three core dimensions: patient baseline characteristics, treatment modalities administered, and observed clinical outcomes.
Key challenges in using RWD in oncology include the accurate identification of non-severe adverse events (e.g., peripheral neuropathy, persistent nausea, chronic fatigue) and the reliable attribution of causes of death.
Annual Projection of New Cancer Cases
Annual projections for the number of new cancer cases worldwide from 2022 to 2050 show a sustained exponential increase, rising from 20 million cases in 2022 to approximately 35 million by 2050. This projection reflects the growing global cancer burden and underscores the urgent need for scalable strategies in prevention, early detection, and treatment.
Limitations of Current Detection Methods
The diagnostic sensitivity of mammography is lower in younger patients, those with dense breast tissue, or those with hormone receptor-negative tumors (i.e., lacking expression of estrogen or progesterone receptors). Additionally, fast-growing cancers may develop between routine screenings, escaping periodic detection.
A 2018 study by Khozin et al. used electronic health records from the Flatiron Health network to analyze treatment outcomes in patients with non-small cell lung cancer (NSCLC) treated with immune checkpoint inhibitors such as nivolumab and pembrolizumab. The analysis incorporated key prognostic variables—including smoking status, PD-L1 expression, EGFR mutations, and ALK rearrangements—that were previously unavailable in other RWD sources.
Since 2020, there has been a marked increase in the use of DICOM images (X-rays, MRIs, CT scans, and PET-CTs) to train machine learning (ML) and deep learning (DL) models. At the same time, the relevance of omics data—particularly genomic data—has grown, enabling molecular stratification and therapeutic personalization.
Variables Across Domains in Cancer-Related Datasets
Age - Sex - Date of diagnosis - Tumor histology - Tumor stage at diagnosis - Pathological and molecular tumor details (margin status, extent of lymph node dissection, lymphovascular invasion, grade, molecular markers such as hormone receptors or mutations) - Socioeconomic status - Functional status - Comorbidities - Organ function (e.g., renal function) - Height, weight, and/or body surface area - Second primary cancer
Disease extent at time of treatment - Treatment intent - Surgical procedure - Date of surgical admission and/or discharge - Radiotherapy dose, fractionation, volume, and technique - Chemotherapy drugs, doses, and schedule - Treatment start and end dates - Specialist consultations
Date of death - Cause of death - Diagnosis and dates of hospital admission and/or discharge - Diagnosis and dates of emergency visits - Patient-reported symptoms and quality of life - Functional status - Use of healthcare resources and/or costs
Current Applications in Cancer Detection and Monitoring
Breast cancer: Combining 2D digital mammography with tomosynthesis significantly improves detection rates, reduces repeat imaging, lowers radiation exposure, and optimizes resource use. In women with dense breast tissue, magnetic resonance imaging (MRI) offers greater sensitivity, although at the cost of higher false positive rates.
Lung cancer: Low-dose computed tomography (LDCT), when performed annually, has shown superior capability in detecting early-stage cancer compared to traditional chest X-rays.
Thyroid cancer: FDG uptake in PET-CT scans has been correlated with tumor aggressiveness, allowing differentiation between well-differentiated and more aggressive variants.
Bone sarcomas: In cases of metastatic Ewing sarcoma, MRI has demonstrated 99% sensitivity in detecting lesions smaller than 7 mm, significantly outperforming PET/CT (62%), reinforcing its value in therapeutic planning.
Bone metastases: Deep learning algorithms trained on bone scintigraphy images have matched the diagnostic performance of nuclear medicine physicians, improving both accuracy and efficiency in analyzing large volumes of studies.
Melanoma: Deep learning models trained on international databases of dermatoscopic images have achieved performance levels comparable to expert clinical judgment in identifying suspicious pigmented lesions.
Imaging, Clinical Data, and Genomics: Toward a Multimodal Approach
Current advances are focused on building multimodal datasets that integrate medical imaging, structured clinical records, and genomic data. This approach enables the development of more robust predictive tools, accelerates diagnosis, and personalizes treatments.
A notable example is the Federated Tumor Segmentation (FeTS) Challenge, where more than 30 centers collaborate to train brain tumor segmentation models without centralizing data. This federated learning approach is particularly relevant for regions with regulatory or infrastructure limitations, such as Latin America.
Artificial Intelligence in Oncology Research and Development
AI applications in oncology extend beyond image diagnosis. Models are also used to:
🔹Predict therapeutic response based on molecular biomarkers.
🔹Correlate histopathological patterns with genomic and clinical data.
🔹Optimize clinical workflows and patient stratification in observational studies or clinical trials.
The use of real-world data (RWD) has proven to be a key tool for understanding complex tumor subtypes. For example, integrating longitudinal clinical data with molecular characteristics has improved stratification and treatment in subtypes such as HR+/HER2- or triple-negative breast cancer.
In diseases like pancreatic cancer, recent studies have demonstrated how combining histological images with PET-CT and circulating tumor DNA (ctDNA) data enables a more comprehensive tumor profiling, opening new possibilities for early diagnosis and personalized medicine.
The Value of Integrated Data
At Cromodata, we understand that the true potential of AI in oncology depends on the quality, diversity, and traceability of available data. That’s why we work to facilitate access to structured, de-identified, and standardized information, helping scientific and technological teams build precise, scalable, and clinically relevant solutions.
Author: Valeria Analia Dávila.
References:
Booth, C. M., Karim, S., & Mackillop, W. J. (2019). Real-world data: towards achieving the achievable in cancer care. Nature reviews Clinical oncology, 16(5), 312-325.
Khozin, S., Abernethy, A. P., Nussbaum, N. C., Zhi, J., Curtis, M. D., Tucker, M., ... & Pazdur, R. (2018). Characteristics of real‐world metastatic non‐small cell lung cancer patients treated with nivolumab and pembrolizumab during the year following approval. The Oncologist, 23(3), 328-336.
Juliusson, G., Lazarevic, V., Hörstedt, A. S., Hagberg, O., & Höglund, M. (2012). Acute myeloid leukemia in the real world: why population-based registries are needed. Blood, The Journal of the American Society of Hematology, 119(17), 3890-3899.