Request timeouts do not always indicate a problem in the database. Often, degradation is hidden in the path between the application and the DB.

The problem manifests when database metrics appear stable, but clients experience timeouts. At the observation level, this looks like a contradiction: latency increases while database time remains the same. The reason is that user experience is shaped not by the execution time of the query in the DB, but by the total round-trip delay. This includes connection pools, load balancers, proxies, and the network. Most database monitoring tools only see the internal execution of the query, so degradation outside the DB remains invisible.

The approach boils down to breaking down latency into two parts: database time and everything else. This relationship can conveniently be viewed as a parent/child model. The parent is the complete round trip (from sending the request to receiving the response). Inside are the execution time in the DB and the external overhead. The key question is: which part dominates? If database time increases, it makes sense to optimize queries or scale the DB. If the external component increases, the problem lies in the infrastructure, and SQL optimization will not yield results. This is a simple yet critical diagnostic slice.

The implementation is based on correlating two data sources. APM captures the complete round trip, as it operates on the application side and measures the entire request path. Database Monitoring, on the other hand, takes metrics directly from the DB (for example, total_exec_time in Postgres), where only the execution time of the query is considered without data transfer to the client. The difference between these two measurements is the sought-after external overhead. In practice, this is represented as the round trip / database time ratio. An increase in this value indicates degradation outside the DB. In the scenario discussed, the cause turned out to be PgBouncer: as the load increased, its single-threaded event loop hit the CPU limit, which increased latency at the connection pool level.

The result of this approach is a reduction in diagnostic time and a decrease in the number of false hypotheses. Instead of sifting through optimizations within the database, one can immediately localize the layer of the problem. In the given case, this allowed for abandoning query tuning and scaling PgBouncer by adding an instance and balancing. Specific metrics of improvement are not provided, but the shift in focus from the DB to the infrastructure eliminated the source of timeouts.

Read

🚀 Deploy the Blocks