A scientific approach to troubleshooting 504 gateway timeout errors for schematic APIs by Lingling Peng
Published on Mar 08, 2024.
Out of all the issues and errors that I have encountered when working on Schematic and deployment, the “504 gateway timeout” error is my least favorite. Around one year ago, when I was on the Tiger teamundefined working to deploy Schematic on Amazon Web Services (AWS), I struggled a lot to understand what was happening behind the “504 gateway timeout”. I remember that I began my investigation by examining AWS dashboards and familiarizing myself with different AWS metrics. I took multiple screenshots to capture different statuses of Schematic and tried my very best to correlate the timestamp of the 504 error with various graphs while reading through various AWS documentation. Despite my frenzied attempts, I never truly understood the underlying issues. Googling “504 gateway timeout on AWS” resulted in many discussions and potential solutions. However, I couldn’t find anything relevant that could help with Schematic.
Eventually, I accepted encountering timeout errors as an inevitable part of deployment. It seems like the application gets timed out by obscure settings and we should direct all our efforts to improve the function run time. However, improving the run time of Schematic APIs is not a trivial task. Behind the scenes, Schematic as a Python library interacts with various services. For example, to generate a manifest for a dataset, Schematic first has to generate a Google sheet with correct formatting based on the data model provided, then retrieve the existing manifest from http://Synapse.org as a data frame, and update the Google sheet with the data frame retrieved from Synapse. This process involves interacting with Google APIs, Synapse Python client library, and Synapse APIs (through the Python client). To understand what endpoints are likely to time out, I started a giant Excel spreadsheet to benchmark the performances of various endpoints and experimented to see which endpoints would return a 504. Through the experiment, I noticed that 504 bugs were not consistent. Sometimes an endpoint could return a 200, but sometimes it returned 504. For hours, I thought to myself: How could I ever fix 504 issues when this cannot even be reproduced reliably?
As Nginx and uWSGIundefined got added to communicate with Schematic APIs, I secretly thought that this would add more “fuel to the fire”. Even though Nginx and uWSGI make the application more secure and scalable, each additional layer brings an additional layer of complexity. The additional parameters brought by Nginx and uWSGI make it even harder to pinpoint the exact issue.
For a very long time, I only troubleshot the timeout issues as they appeared and hoped that after some Googling, there was a mysterious Nginx or uWSGI setting that I could change to fix the problem. For the most part, this strategy worked. I could either get away by speeding up some internal Schematic operations or updating a parameter in Nginx or uWSGI. But have I found all the “culprits” of setting the timeout limit? I am not so sure.
Recently, another one of our apps that used the Schematic APIs experienced a timeout issue again when calling an endpoint for a list of Synapse projects. This time, I decided to change my approach and unravel the issue. The first question that crossed my mind was: is this a Schematic issue? Can this possibly result from some latency on the front-end/client side? To answer this question, I called the endpoint using the same list of Synapse projects the front-end application used in Python and compared the latencies that I saw with the latencies from the front-end app. The number looked similar, hinting that the majority of latencies indeed originated from Schematic. While looking at the latencies of calling this endpoint for each project, two projects stood out: one project took around 40 seconds to complete, while the other one completed slightly below 60 seconds. This made me ponder: Could the time out be caused by calling multiple projects in a loop or specific projects? I then removed the project ID that had ~60 seconds of latency from the project lists and re-ran the list multiple times. Each time, the endpoint returned with 200 for all requests. To verify only one project ID was causing the issue, I called the endpoint in a loop five times for this project ID. Two out of three times that endpoint returned 504. This experiment confirmed that only one project ID triggered the 504 issue. In addition, I also noticed that when the endpoint timed out, the endpoint returned 504 after around 60 seconds.
Does this indicate that Schematic APIs consistently time out if the operation run time exceeds 60 seconds? How could I verify my hypothesis? Running the same endpoint multiple times can’t help because the latency of an endpoint is not consistent and beyond control. However, I came up with a method to verify my suspicion. I created two test endpoints that do nothing other than sleep, one for 59.9 seconds and one for 60 seconds. As expected, the endpoint that sleeps for 59.9 seconds never times out while the one that sleeps for 60 seconds always times out. This experiment proved that for some reason, Schematic APIs have a maximum run time of 60 seconds.
The next question that arose was: What is responsible for the 60-second limit? Instead of using generic terms to conduct a broad Google search, I narrowed it down and focused on Nginx parameters that set the 60-second time out. Very easily, I found a Server post that described the same error that I was seeing. When I looked into the official documentation, it began to make sense: uwsgi_read_timeout controls how long Nginx is waiting for a response from uWSGI. If the uWSGI server does not transfer any data within this time frame, the connection gets closed. I then tried modifying the default of this parameter in Nginx configuration files, and when I called the storage/projects/manifests endpoint again, it worked. I then tried building a new Docker image and running Schematic APIs in a local Docker container. It worked. The operation no longer timed out.
The next step is to deploy my fix to AWS and conduct a final round of testing. Despite Nginx no longer timing out locally, I still experienced a timeout error on AWS, which led me to suspect AWS infrastructure is setting a 60-second timeout limit somewhere. To investigate and confirm my suspicion, I tried using the dummy endpoints that I created in earlier steps. Just like before, the endpoint with a duration of 59.9 seconds never times out, while the one that lasts for 60 seconds always times out. This prompted me to think: What is on AWS that controls a 60-second timeout? Since I could now limit my search to figure out what set the “60 seconds timeout” on AWS Fargate, I easily found documentation that explained that the application load balancer time out in 60 seconds (see documentation here). I then manually adjusted the application idle timeout to 180 seconds and retested: this time finally the operation was completed successfully without any timeout errors.
Reflecting on the extensive journey of troubleshooting 504 errors, I attribute much of my success to the mindset of thoroughly understanding the root causes before delving into solutions. In situations where a single issue might have multiple underlying causes, it's essential to invest time in dissecting and isolating each component, ensuring that our hypotheses about the problems are accurate. Once we've confirmed the real issues, identifying the appropriate solution becomes more evident.
I hope this article can shed some light on debugging on AWS. As I continue my journey of troubleshooting various errors, I deeply value the lessons that I learned from this experience: prioritizing understanding issues over hasty solutions and knowing that a deeper understanding of issues lays a solid foundation for sustainable solutions.