A guide to providing high-quality, analysis-ready data to scientists while protecting patients' health information by Thomas Yu, Phil Snyder, Drew Duglan, and Solly Sieberts

Published on Oct 22, 2024


Screenshot 2025-12-21 at 5.58.19 PM.png


Introduction

In 2022, Sage Bionetworks began working on RECOVER, a program aimed at comprehensively understanding and mitigating the long-term impacts of COVID-19.

Sage serves as the Digital Health Data Repository (DHDR) Core for the program. The DHDR is tasked with processing and analyzing the digital health data collected from consumer wearables like Fitbit and Apple Watches, which are collected from RECOVER study participants who join the RECOVER Digital Health Program.

The hope is that the digital health data can be integrated with other clinical information to find new biomarkers, diagnostics, and treatments for Long Covid patients.

The Digital Health team, in collaboration with the Data Processing & Engineering (DPE) and Governance teams, leads the effort of making this digital health data more accessible for our data consumers for downstream analysis. 

The Challenge:

“How do we provide high-quality, analysis-ready digital health data to our data consumers in a scalable and efficient way, while ensuring that it remains free of protected health information (PHI)? “


Our workflow transforms ingressed data into formats that can be readily queried and reused. These data then move through a de-identification pipeline, resulting in data summaries for researchers that are free of protected health information.

eyJidWNrZXQiOiJhc3NldHMucHVicHViLm9yZyIsImtleSI6IjhhZ2Q0eG1nL2ltYWdlLTcxNzI3MTU1OTA0OTA4LnBuZyIsImVkaXRzIjp7InJlc2l6ZSI6eyJ3aWR0aCI6ODAwLCJmaXQiOiJpbnNpZGUiLCJ3aXRob3V0RW5sYXJnZW1lbnQiOnRydWV9fX0=

Processing Participant Data

As the DHDR Core, Sage receives daily exports of digital health data, in ZIP archives of NDJSON files, directly from the RECOVER Digital Health Platform (DHP). NDJSON, while versatile, poses challenges in terms of query speed when dealing with a large number of records. Uncompressed, there is 5TB of data across tens of billions of records with information on over 6,000 participants. JSON files are not a format that statistical and computational analysts prefer to work with. Instead, analysts typically prefer tabular data, which is easily read into tables in analysis tools like R or Python. Our task is to turn data that is spread across hundreds of thousands of individual JSON files into analysis-ready tabular data. 

We’ve therefore chosen the Parquet format. Parquet's columnar storage format compresses the data and enables better query performance. Compressed, we are able to reduce the data size from 5TB to 150GB.  Leveraging AWS Glue, our team has been able to efficiently transform the data from JSON to Parquet while ensuring all operations are conducted within a framework that meets the highest federal security standards (FISMA). 

What we’ve gained:

  • Cost Efficiency: The reduction in storage requirements, coupled with faster query times, lowers the overall cost of data analysis.

  • Scalability: AWS Glue's serverless architecture easily scales to accommodate the processing of millions of data points, ensuring that the solution remains viable as the dataset grows. 

  • Data Analysis Speed: With data stored in Parquet format, analytics and machine learning models can ingest and process information much more quickly, accelerating the time to insight.


Protecting Participant Identity

After the data is transformed to Parquet, a series of post-processing and data enrichment tasks are performed by the Digital Health team prior to delivering the data to consumers outside of Sage Bionetworks. 

First, we generate a Parquet dataset which is ready for use in internal and external analyses by processing through our de-identification pipeline. This removes fields containing direct participant identifiers and removes potentially identifiable information from free-text fields. At this stage, we also exclude any data flagged for removal through participant withdrawal. 

This results in a version of the raw data which is suitable for sharing for analysis. The resulting Parquet file can then be shared with analysts within the DHDR, as well as with RECOVER scientists.

Summarizing the Data

We then generate summarized data for scientists who prefer working with data that has undergone basic pre-processing. The data summarization pipeline includes multiple steps including:

  1. QA/QC (Quality Assurance/Quality Control): This stage involves checks to ensure the quality and accuracy of the data.

  2. Standardization: This step ensures that the data conforms to a standard format or structure, facilitating consistency and compatibility across different wearables.

  3. Digital Biomarkers: This step involves identifying and extracting digital biomarkers (derived measures) from the standardized data, which can be used for various health-related analyses and insights.

  4. Summarization: Finally, the data is then summarized to condense information for easier analysis. Chiefly, summarizing data across timescales (e.g. weekly) reduces some of the analysis burden for researchers who are not accustomed to working with the scale of data present from wearable devices.

The final output of both the raw and summarized results are vetted by the governance team to ensure that there is no PHI being leaked in any of the above reports.  

Leveraging the Data to Help Covid Patients

These data summaries are crucial for helping scientists to better understand how to treat, prevent and manage Long Covid, which still affects millions of people worldwide.

The summaries are exported to the RECOVER Data Resource Core (DRC) so that they can be integrated with clinical data from the thousands of participants who are part of the initiative.

Along with the summaries, the Parquet-formatted raw data are also shared in the RECOVER Researcher Gateway powered by BioData Catalyst (BDC), leveraging the GA4GH Data Repository Service (DRS) API to facilitate further analyses.

Using the DRS integration between platforms, we are able to share any datasets within Synapse and the DHDR in an efficient and secure (FISMA compliant) manner.  

The journey of these digital health data at Sage results in unique insights into the burden of disease from the patient’s perspective. The hope is that these data can inform the measurement of clinical trial outcomes as part of RECOVER, ultimately pushing forward new medicines for Long Covid patients. 


Do you need help coordinating your digital health data?

To join our digital health portfolio at Sage, please get in touch at partnerwithus@sagebase.org

  Theme Trial
Please upgrade to remove this banner.