Outgrowing Postgres: How to identify scale problems
Discover early warning signs that you’ve outgrown PostgreSQL and learn how to keep performance high. This introductory article offers diagnostic techniques and proactive strategies to help you scale and plan the future of your analytics without losing momentum.
When you launched your startup, PostgreSQL seemed like the perfect choice. It's open-source, robust, and comes with a comfortable and dev-friendly SQL dialect. Postgres is the Swiss Army knife of databases; it offers a lot of utility in a compact package, making it an attractive option for startups and established companies alike:
- Familiarity and Ease of Use: You probably used Postgres in CS-101. It's a foundational technology for anyone working with databases, and most developers have worked with it at some point in their career, so it's easy to adopt it for a new project.
- Versatility: From OLTP workloads to basic analytics, Postgres can handle a wide range of tasks with decent performance and cost-effectiveness.
- Rich Ecosystem: Postgres is surrounded by a large ecosystem of extensions and tools to deploy, manage, and extend its functionality.
- ACID Compliance: This is critical for many applications, especially those dealing with financial data, and Postgres has it covered.
- Cost-Effective: As an open-source tech, Postgres can be deployed "for free" without vendor lock-in, making it an economical choice for early stage companies.
Adding to the appeal is the proliferation of serverless and managed options that simplify getting started with Postgres. These solutions take care of the operational complexities of deploying the database and typically offer extra features on top of the DBMS, so developers can focus on building their applications. Some notable managed Postgres options include:
- Supabase: An open-source platform offering a PostgreSQL database with built-in authentication, instant APIs, and real-time subscriptions.
- Neon: A serverless Postgres service with modern features like branching and virtually limitless storage.
- Vercel Postgres: A serverless SQL database designed to integrate seamlessly with Vercel Functions and frontend frameworks.
- Nile: A serverless Postgres solution tailored for multi-tenant SaaS applications, offering built-in tenant virtualization and global deployment capabilities.
When you're building and scaling an early stage startup, choosing a managed service for Postgres makes a lot of sense. You get the database without the complexity, so you can focus on startup things like building a product, nailing your positioning, and scaling revenue.
With all these advantages, it's no surprise that engineers often choose Postgres as their go-to database for everything, including analytics. It seems like the perfect solution: one database to rule them all, handling both your transactional and analytical needs.
However, as you scale and your data grows, you may start to notice some cracks in this one-size-fits-all approach. What began as a smooth, efficient system starts to show signs of strain. Your once-snappy queries take longer to execute. Your users complain about slow-loading dashboards (or worse, they silently churn). Your engineering team spends more time optimizing backend queries and less time building new user-facing features.
These growing pains tell you something important: you're beginning to push Postgres to its limits, especially when it comes to data volume, read performance, and concurrency. They also tell you that you're finding product-market fit, which is great! Still, you need to be aware of the early signs of Postgres scaling problems. Let's talk through them.
The early signs of Postgres scale problems
Below are some of the signs that your Postgres cluster is starting to struggle to handle your application load, with tips for how to identify them.
Slow query performance
Increasing query execution times are a harbinger of Postgres scaling problems; they're one of the first and most obvious signs that your database is struggling to keep up. Queries that once returned results in milliseconds might take one or two orders of magnitude longer. This slowdown may not be immediately apparent at the user level - an increase from 10ms to 100ms is likely to ruffle feathers - but it's better to identify this early and get ahead of it.
This slowdown can be particularly noticeable in complex analytical queries or those involving large datasets. In these cases, your users will feel it.
To identify slow queries, you can use PostgreSQL's built-in logging features. Set log_duration_min_statement
to a reasonable threshold (e.g., 500ms) to log all queries that exceed this duration:
Regularly reviewing these logs (or monitoring them in Grafana) can help you identify problematic queries before they significantly impact your application's performance.
Increasing I/O wait times
As your Postgres instance grows, you might notice an increase in I/O wait times. This occurs when your database server spends more time waiting for disk operations to complete. High I/O wait times can be a sign that your storage system is struggling to keep up with the volume of data being read and written.
You can monitor I/O wait times using tools like iostat
on Linux or the pg_stat_database
view in PostgreSQL:
If you notice a consistent increase in `blks_read` relative to `blks_hit`, it might indicate that your database is struggling to keep frequently accessed data in memory.
Rising CPU utilization
As your database handles more concurrent connections and increasingly complex queries, you might see an increase in CPU usage. While some increase is expected as your workload grows, a sudden or dramatic spike could indicate that your queries are becoming more resource-intensive or that your index strategy needs to be optimized.
Monitor your CPU usage regularly using system tools like top
or htop
. For a Postgres-specific approach, you can query the pg_stat_activity
view:
This query will show you how many connections are in each state (active
, idle
, etc.), giving you insight into your database's workload.
Increased lock contention
With an increase in write operations, you might start seeing more lock contention. This occurs when multiple transactions are trying to modify the same data simultaneously. Excessive lock contention can lead to query timeouts and reduced throughput.
You can monitor lock contention using the pg_locks
view:
This query will show you which processes are blocking others so you can identify potential bottlenecks in your application.
Increasing vacuum frequency
As your database grows, you might notice that VACUUM
operations run more frequently or take longer to complete. This happens because Postgres needs to clean up more dead tuples as your data volume increases.
You can monitor VACUUM
operations using the pg_stat_user_tables
view:
If you see tables with many dead tuples or infrequent VACUUM
operations, you should consider adjusting your aut-vacuum settings or implementing more aggressive cleanup strategies.
How Postgres scale problems impact your app and users
Increased response times
As database performance begins to degrade, your requests made by your application to the database will take longer. Loading states are no friend to a good user experience, no matter the use case.
To measure the impact of database load, monitor your application's response times using tools like New Relic or Datadog. Pay special attention to endpoints that interact heavily with the database.
Higher error rates
Performance degradation can lead to increased error rates in your application. Queries might time out, connections could be dropped, and transactions might fail due to lock timeouts.
Monitor your application logs for database-related errors and set up alerts for sudden spikes in error rates.
Reduced throughput
As your database struggles to keep up with the incoming requests, you might notice a reduction in overall throughput. This means your application can handle fewer requests per second, potentially leading to backlogs and delayed processing.
Use load testing tools like Apache JMeter, Gatling, or Locust to measure your application's throughput under various loads and compare it to historical data.
Degraded user experience
The cumulative effect of these performance issues is a degraded user experience. Users might experience slow page loads, delayed data updates, false data, or application timeouts. This can lead to frustration, support requests, and churn.
Regularly gather user feedback and monitor user behavior metrics to understand how database performance is affecting your end users.
Proactive strategies for scaling Postgres
Put simply: you need to recognize Postgres scaling problems early by staying attuned to the warning signs. But don't wait until you have a problem to develop a strategy to fix it. Instead, put some proactive scaling strategies in place.
Here are some approaches to consider:
Regular performance audits
Schedule regular performance audits of your database. This involves reviewing query performance, analyzing index usage, and identifying resource bottlenecks. Tools like pg_stat_statements
can support these audits:
This query will show you the most time-consuming queries in your database, helping you focus your optimization efforts.
Use a cache
As your data grows, you need to implement technical strategies to minimize I/O operations on the database so that rising data volumes have less of an impact on app response times. You can use the built-in Postgres query cache, application-level caching, or a distributed key-value store like Redis.
For example, Postgres's materialized views can be used to cache the results of complex queries:
Of course, there are no free lunches; materialized views can reduce your latency and scan sizes, but they reduce the freshness of your data. You’ll need processes to periodically invalidate your cache and refresh materialized views depending on how fresh you need your data to be.
Optimize your schema and queries
Regularly review and optimize your database schema and queries. You can denormalize data, create appropriate indexes, or rewrite inefficient queries following these rules for improving SQL query performance.
Use EXPLAIN ANALYZE
in Postgres to understand query execution plans and identify opportunities for optimization:
Pool connections
High-traffic applications can quickly exhaust Postgres' available connections, leading to performance bottlenecks and potential downtime. Connection pooling offers a solution by efficiently managing and reusing database connections. Tools like PgBouncer can significantly reduce connection overhead, allowing your application to handle many concurrent users without overwhelming the database.
Partition data
For large tables, consider partitioning the data. This can significantly improve query performance and make maintenance operations more manageable.
Plan for vertical and horizontal scaling
Have a clear plan for both vertical scaling (increasing resources on your existing server) and horizontal scaling (distributing your data across multiple servers). This might involve upgrading your hardware, implementing read replicas, sharding, or moving to a distributed database solution.
Monitor and alert
Implement comprehensive monitoring and alerting for your database. This should cover not just the database itself, but also the underlying infrastructure. Tools like Prometheus and Grafana can be very effective for this purpose.
Set up alerts for key metrics like CPU usage, disk I/O, and query performance. Here's an example of how you might set up an alert in Prometheus for high CPU usage:
Load test regularly
Conduct regular load tests to understand how your database performs under stress. This can help you identify bottlenecks before they impact your production environment.
Use tools like pgbench
, which comes with Postgres, to simulate different workloads:
This command initializes a test database and then runs a benchmark with 10 clients for 60 seconds.
Looking ahead: What to do when you outgrow Postgres
Scale is a good problem to have. It means you're growing. But you want to be able to continue to do the things that helped you achieve scale: focusing on use cases, building product, and releasing new features. By monitoring key metrics, understanding their impact on your application and users, and implementing proactive scaling strategies, you can stay ahead of potential issues and ensure your database meets your growing needs. As your data volumes increase and your user base expands, you'll encounter more complex issues, particularly when it comes to analytics workloads and managing large-scale data.
In the upcoming articles of this series, we'll go deeper on specific Postgres scaling challenges and their various solutions, for example:
- How to handle growing data volumes
- How to handle increased user concurrency
- How to run OLAP workloads on Postgres
- When to move OLAP workloads off Postgres
- Advanced tuning techniques for squeezing more out of your Postgres
- How to evaluate the right OLAP solution for analytics
- How to optimize and integrate an OLTP + OLAP stack
Stay tuned for more.
Need to move your analytics off Postgres?
Tinybird is data infrastructure for software teams. You can query your Postgres tables with SQL and publish queries as dynamic, scalable REST API Endpoints to power in-product reporting, real-time dashboards, and more.
You can try Tinybird for free with no time limit. Sign up here.