Big Data Pipelines: Curating a Surgical Data Set for Deep Learning
Document Type
Presentation
Loading...
Publication Date
7-22-2025
Abstract
The MOVER Dataset was curated to support AI researchers in developing predictive models to improve surgical outcomes. The data is comprehensive, but at 4.5 TB, working with it poses logistical challenges. This paper outlines the process of building a custom data engineering tool in python to address this challenge. The tool was used to systematically create a dataset which is suitable for deep learning. This process is approachable without the support of a large distributed cloud infrastructure. As a result, the code was concise, it applied 12 different pipelines, in just 230 lines of code; and it was fast, processing 155GB of data, in 14 minutes on a desktop. This approach simplifies data wrangling, however the process of building the pipeline software was considerably more challenging than doing the transformations ad-hoc.
Recommended Citation
Murphy, Sean P. II, "Big Data Pipelines: Curating a Surgical Data Set for Deep Learning" (2025). Master of Science in Health Data Science Capstone Presentations. Paper 9.
https://jdc.jefferson.edu/ms_hds/9
Language
English


Comments
Presentation: 20:42