Files
Download Full Text (4.8 MB)
Description
Purpose: To systematically assess the impact of batch effects on differential expression of isomiR, mRNA, and non-coding RNA (sncRNA) across 13 The Cancer Genome Atlas (TCGA) cancer types and develop a standardized, batch-controlled analysis protocol for TCGA datasets.
Methods: Data analysis and visualization were performed using Python v3.12. Plots were generated using matplotlib and seaborn. isomiR, mRNA, and sncRNA sequences across 13 TCGA cancer types were used. Principal Component Analysis (PCA) was performed using scikit-learn for dimensionality reduction. Differential expression was conducted using DESeq2 under the following protocols: 1) primary tumor samples vs matched normal samples processed in the same batch as the control; 2) all primary tumor vs all normal samples with batch correction; 3) randomly down-sampled primary tumor to the same sample size as normal vs all normal samples without batch correction; and 4) randomly down-sampled primary tumor to the same sample size as normal vs all normal samples with batch correction.
Results: PCA revealed distinct clustering of normal and primary tumor samples across 13 cancer types and all RNA molecules, indicating the presence of batch effects. Moreover, using differential expression outcomes from normal and primary tumor samples processed within the same batch as control, we found that existing batch effects introduce false-positive differentially expressed molecules and obscure molecules that are potentially truly differentially expressed. After correcting batch effects using DESeq2, we observed a greater overlap in differentially expressed molecules between the control and the batch-corrected analyses across 13 cancer types. Additionally, we identified a set of non-overlapping differentially expressed molecules that appeared exclusively in the control analysis.
Conclusions: TCGA plays a critical role in guiding the development of molecular markers used in clinical research and translational oncology. However, because batch effects in TCGA data sets are not comprehensively evaluated in the literature, researchers may be misled by batch-driven artifacts. We recommend implementing a batch-controlled protocol when analyzing TCGA datasets to minimize false positives and identify true differentially expressed molecules that may be masked by batch effects.
Publication Date
2-2-2026
Keywords
TCGA, mRNA, miRNA, batch effects
Disciplines
Genetic Structures | Medicine and Health Sciences | Oncology
Recommended Citation
Chen, Zhongxuan; Nersisyan, Stepan; Rigoutsos, Isidore; and Londin, Eric, "Systematic Evaluation of Batch Effects in The Cancer Genome Atlas Program Underscores the Need for Batch-Controlled Analysis" (2026). Alpha Omega Alpha Research Symposium Posters. 16.
https://jdc.jefferson.edu/aoa_research_symposium_posters/16


Comments
Presented at the 2026 AOA Research Symposium.