Our Experiences with Cloud Computing Platforms to Aid Data Operations by Thomas Yu and Drew Duglan
Published on
At Sage Bionetworks, a pivot towards leveraging Snowflake as our internal data warehouse layer has significantly transformed our data operations, resulting in cost savings and enhanced data democratization across the organization.
Identifying Highly Accessed Datasets
One of the most notable results of adopting Snowflake has been our ability to identify datasets that are not only downloaded frequently, but are large (Terabytes in size). Before integrating Snowflake, accessing insights into data usage patterns took considerable time, resulting in largely unmonitored data egress costs across the platform.
We can now pinpoint datasets with the highest data egress rates daily, allowing us to make informed decisions about transitioning these to AWS Open Data or adding other governance restrictions. This reduces our data storage and transfer costs while keeping the datasets open to the global research community.
On top of being able to monitor egress costs, we are also able to query the number of users downloading data from Synapse portals. In Q1 of 2024 alone, the AD Knowledge Portal had over 1 PB of data downloaded by over 1,000 unique users! There is a lot more to discover, but this metric is an indication that we are achieving our mission: “To drive a new age of discovery through truly open science and radical collaboration.”
Democratizing Data
Snowflake's impact at Sage extends beyond cost savings. The platform has been instrumental in democratizing usage metrics across our projects, marking a departure from the traditional report factory model. Before Snowflake, accessing up-to-date usage metrics involved manually generating reports or creating an SQL database that captured six months' worth of data at the time of creation. This often led to delays in decision-making and metrics reporting.
With Snowflake, we've embraced a more dynamic and self-service approach to usage metrics management. Our teams can access metrics with just one click, giving immediate insights into data usage patterns. This shift has fostered a culture of data-driven decision-making, empowering each team member with the information they need to contribute effectively to our mission.
Snowflake marketplace also provides us with the ability to share our datasets. Most recently, we shared the AACR Project GENIE Portal on the Snowflake Marketplace. While this is just the metadata of the dataset, we learned it may be possible to build an application in Snowflake that could provide the data-sharing features required to share the dataset along with the metadata.
Synapse’s Global Reach
Another advantage of integrating Snowflake into our data management ecosystem is its ability to provide insight into Synapse’s global usage. Leveraging a publicly available Snowflake marketplace dataset (IPinfo), we've been able to map the usage of Synapse across all continents, except Antarctica.
Number of unique GENIE downloaders per month
In Q1 of 2024, AACR Project GENIE was viewed in over 100 different countries, and data was downloaded by over 250 unique users. This global perspective informs our strategic planning and underscores the appeal and applicability of Synapse in realizing our mission.
Final Thoughts
The strategic deployment of Snowflake as our data warehouse has been transformative for Sage Bionetworks. It has enabled us to manage our data more efficiently, realize significant cost savings, and provide a self-service view into critical usage metrics across our platform. Our platform's global usage contributes to our commitment to facilitating open, collaborative research.
The journey with Snowflake is just beginning, but its impact is already profound, setting a new standard for data management at Sage.