Sage Bionetworks' take on best practices for developing re-usable research software in open science by Thomas Yu, Drew Duglan, Robert Allaway, Jineta Banerjee, Brad Macdonald, Bryan Fauble, Nick Grosenbacher, Jenny Medina, Thomas Schaffter, Jessica Britton, Sarah Chan, Milen Nikolov, Adam Taylor, and Sonia Carlson

Published on Feb 21, 2024. DOI 10.21428/4f83582b.eecd6505

As defined by the NIH, Research software includes “source code files, algorithms, scripts, computational workflows, and executables that are created during the research process or for a research purpose.”  As an organization committed to open science, we understand the importance of providing reproducible and reusable code, and tracking the provenance of data often linked to executed code. Most of the software we build at Sage Bionetworks (“Sage”) is accessible via our GitHub organization.

What existing standards or criteria do we use to evaluate the openness, FAIRness, quality, and/or security of the software we share or reuse?

Maintaining high-quality, open-source software takes considerable resources, but can have a major impact on the scientific community.

Research software has two broad categories: purpose-built scripts with a narrow scope of applications and reusable software tools with a broader scope of applications.  Purpose-built scripts may not require all the criteria that go into building a robust, reusable research software tool, but there are instances where these scripts become (through need, desire, or accident) reused software tools. Transitioning can often be more difficult than starting from scratch, so it is good practice to follow the minimal requirements of being reusable and reproducible (https://github.com/paperswithcode/releasing-research-code , https://www.openmodelingfoundation.org/standards/reusability/#minimal-reusability-standards). The broader criteria for robust reusable research software tools include:

Addressing a Problem

All research software should be geared toward addressing a specific set of problems. Before implementation, the use cases and requirements of the software should be thoughtfully documented to ensure that they address the scope and potential usage of the tool.

Version Control

Effective version control, through platforms like GitHub or GitLab, facilitates collaboration and version tracking of the source code.  Developers can collaborate and provide constructive feedback through pull requests while tracking the evolution of the codebase. 

Documentation

Documentation is a crucial companion, offering insights into functionality, usage, and maintenance. The codebase must be established with an initial README.md, LICENSE, and CONTRIBUTING.md, which will facilitate managing expectations regarding usage, sharing, and development. Research software can demand high computing power and a specific environment to execute, so it is imperative to include that information so that others can execute the code during and after development.

Licensing and attribution

There are different licenses we pick to publish code for the different use cases. Most of our repositories use the Apache 2.0 license, but we have projects that use other licenses like the BSD clause 3. The BSD clause 3 license ensures attribution to Sage Bionetworks if the code is picked up and used in a commercial environment. For guidance, we encourage people to use this site: https://choosealicense.com/ to help us pick the correct license.

Testing

Testing of code (unit, integration, etc.) is essential for validating functionality across operating systems/software versions, preventing regressions, and assisting with collaboration. Testing ensures that new code contributions do not change expected behavior elsewhere in the codebase. Further, testing and validation should ensure that the data used for testing the functionality of the software remains uncompromised and maintains its quality. Pairing testing with an automatic code formatter or linter allows the code to follow language-specific style guidelines and naming conventions, enhancing code readability and consistency. Adopting robust continuous integration/continuous deployment (CI/CD) pipelines, using GitHub actions, or Jenkins, enables automatic testing, scanning, and building of your codebase. 

Security

Assessing the code coverage and conducting security scanning are vital components, guaranteeing comprehensive testing and identifying potential vulnerabilities. Nowadays, there are ever greater technologies that provide a service to scan your code outside of your CI/CD pipeline. Technologies such as Sonarcloud can be used to scan your codebase for code smells, security vulnerabilities, and common pitfalls. Here is a list of other source code analysis tools: https://owasp.org/www-community/Source_Code_Analysis_Tools . Packages can then be pushed to a tool registry like PyPI or CRAN, which will allow users to easily install your tool.

Maintenance

The re-usability of a software package also relies on the maintainers themselves.  Maintainers that are not active means the codebase is not actively maintained. A good example of the requirement of active maintenance is a scenario where older versions of programming languages are phased out due to security vulnerabilities. Maintainers should actively update their codebase to work with newer versions of the programming language. Maintenance of a codebase requires time and expense either from a community or an individual. When maintainers of open-source software do not have the time or resources to devote to software upkeep, they should transparently deprecate or “archive” their software, while keeping the source code publicly available. By doing this, those who may reuse the software are then more informed about the risks of using unmaintained software and the level of support they can expect to receive.

eyJidWNrZXQiOiJhc3NldHMucHVicHViLm9yZyIsImtleSI6InB1Z3ZxNmhnL2ltYWdlLTAxNzA4MzgxNjgzMDAzLnBuZyIsImVkaXRzIjp7InJlc2l6ZSI6eyJ3aWR0aCI6ODAwLCJmaXQiOiJpbnNpZGUiLCJ3aXRob3V0RW5sYXJnZW1lbnQiOnRydWV9fX0=

Which factors influence our decision to either reuse open-source research software developed by others or develop anew? 

Being stewards of open science, we are also constantly looking for open-source technologies to reuse in our day-to-day work. The first step in deciding to reuse open-source research software is to determine if it solves the problem we are tackling. Other factors such as code reuse license, the last active code contribution, stars (favorites) on a code repository, number of contributors, ease of use via the documentation, and level of security are also used to determine the reuse of existing research software. For example, when working with data that requires HIPAA compliance, it is essential to also work with software/technologies that are HIPAA eligible. (e.g. https://aws.amazon.com/compliance/hipaa-compliance/ )

While we are strong advocates of open-source software, if, after a code landscape survey, we discover that existing open-source solutions do not address specific edge cases or solve our problem, we then look to create new solutions for the community. That said, the build-everything-yourself attitude is now less prevalent; it is instead more versatile to chain together existing technologies available to us and build the truly novel components to solve the problem at hand.

How can we support active research software communities to aid the development of best practices for the sharing and reuse of high-quality research software?

Developing high-quality research software requires time and patience. The scope and requirements of research software should be defined before it is built. Software design artifacts (such as requirements, or use cases) should serve as the foundation of any research software development. We should encourage standards for documentation/docstrings and the usage of tools like SonarCloud to discover and resolve severe security vulnerabilities, like the OWASP top-10, in research software. Without the discovery of those security vulnerabilities, the proliferation of vulnerable code would be unchecked, putting processes that deal with sensitive data at risk. 

In closing…

Sage Bionetworks is just one of the many organizations in our large ecosystem today that care about open-source research software.  The best practices we’ve outlined above represent not only what we’ve learned through our own experiences, but through standards set by different organizations. Here are some other guidelines and standards that inspire us in our daily work.