DCQC - A Universal Data Validation Framework for Bioinformatics Datasets

Get clarity on the validity of your files in a way that is simple, scalable and lightning fast by Brad Macdonald, Thomas Yu, and Drew Duglan

Published on Jun 17, 2024

In the realm of bioinformatics, the integrity and accuracy of your data are paramount.

With vast amounts of biological data being generated and analyzed, the need for robust and reliable data validation frameworks is more acute than ever. While tools like JSON Schema Validator and Great Expectations serve well for structured data validation, and fastQC addresses quality control for FASTQ files, there has been a notable gap in comprehensive, opinionated data validation frameworks tailored for common bioinformatics file formats.

This is the problem that DCQC (Data Coordination Quality Control) solves.

What is DCQC?

Developed by the Data Processing and Engineering (DPE) team at Sage Bionetworks, DCQC is an extensible testing framework designed specifically for bioinformatics files hosted on our flagship data platform, Synapse.

DCQC is composed of two main components, nf-dcqc and py-dcqc, which leverage the power of Nextflow and Python, respectively.

The framework provides clear and authoritative opinions on the validity of files, ensuring that bioinformatics data not only meet general quality standards, but are also compliant with format-specific requirements.

What do we mean by validation?

Before we dive into some of the features of DCQC, it’s worth understanding what the term “validation” means in the context of the framework, since it encompasses several tiers, each with specific tests and requirements:

Tier 1: File Integrity

This foundational tier ensures that files are whole and accessible. Tests include MD5 checksum verification, format-specific checks like byte consistency, and decompression checks. These are essential as they underpin the reliability of subsequent validations.

Tier 2: Internal Conformance

At this level, DCQC checks for internal consistency and compliance with the declared file format. This includes validating file formats using available tools, and ensuring that internal metadata conforms to expected schemas, such as OME-XML.

Tier 3: External Conformance

This tier verifies that file features align with externally submitted metadata. Objective tests at this stage include checking consistency in channel counts, file sizes, and adherence to standards in nomenclature and secondary file availability (e.g., ensuring a CRAI file exists for each CRAM file).

Tier 4: Subjective Conformance

The most nuanced tier involves evaluations based on qualitative criteria. From detecting sample swaps to scanning for protected health information (PHI) and outlier detection in file metrics, Tier 4 tests often require expert review to interpret the results meaningfully.

Easily Add New Functionality to Your Quality Control Process

One of the key features of DCQC is that it’s “extensible,” meaning its design makes contributing new functionality simple and powerful. The result is that DCQC can be applied to various use cases.

Each QC test is defined in a repeatable way in Python code. Tests are then organized into “suites” and automatically applied to their designated file type when the workflow runs. Our testing framework also makes it easy to use existing bioinformatics software tools to handle the business logic of tests.

This means that the user contributing the test does not have to reinvent the wheel each time.

Currently, supported file types include common bioinformatics formats including BAM, FASTQ, TIFF, and OME-TIFF, in addition to a host of general-purpose structured file formats such as CSV, TSV, and JSON. We are now working to add support for further file types and to specifically support multi-file QC for use cases such as paired FASTQ files.

Find your Needle in a Haystack in a Fraction of the Time

Another standout feature of DCQC is its scalability. Using Nextflow and Seqera Platforms, DCQC can execute validation of tens of thousands of files in parallel, efficiently pinpointing issues even in massive datasets—a true "needle in a haystack" solution.

We have already used DCQC internally for several of our projects and have been able to assess the validity of OMETIFF images at scale, saving countless man-hours and headaches for data contributors and analysts.

Why Choose DCQC?

Choosing DCQC for your data validation needs means opting for a framework that is scalable and reliable. DCQC provides actionable feedback on the quality of your dataset before any problems can be encountered by researchers or analysts. Whether you are managing large-scale genomic studies or detailed, format-specific data analyses, DCQC provides peace of mind by ensuring data integrity and reliability.

Are you ready to elevate your data quality?

For more information on DCQC, how you can integrate it into your data validation processes, and how you can help contribute new functionality to meet your needs, visit the GitHub repositories for nf-dcqc and py-dcqc.

If you want to learn more about the work we do on Data Processing and Engineering, please get in touch at partnerwithus@sagebase.org

DCQC - A Universal Data Validation Framework for Bioinformatics Datasets

Table of contents

Share this post