Scaling Horizontally on PostgreSQL: Citus’s Impact on Database Architecture
What exactly is Citus? Put simply, Citus is an extension for PostgreSQL that allows you to distribute your data and queries across multiple nodes, thus enabling horizontal scaling. This means you can spread your database workload across a cluster of machines rather than relying on a single server.
Why Do I Need to Use Citus?
- Fast Queries for All Tenants: Citus facilitates lightning-fast queries by distributing data across multiple nodes, ensuring that each tenant gets the performance they deserve.
- Sharding Logic in the Database, Not the Application: With Citus, you can implement sharding directly within your database, saving you from the complexities of implementing sharding logic in your application code.
- Hold More Data Than Possible in a Single-Node PostgreSQL: As your data grows, Citus allows you to scale out seamlessly, accommodating large volumes of data that would overwhelm a single PostgreSQL node.
- Scale Out Without Giving Up SQL: Citus enables you to scale out your PostgreSQL database without sacrificing the familiar SQL interface, making it easy for your team to adapt and scale as needed.
- Maintain Performance Under High Concurrency: Even under heavy loads and high concurrency, Citus ensures consistent performance, keeping your applications responsive and reliable.
- Fast Metrics Analysis Across Customer Base: Citus excels in analyzing metrics across your entire customer base, providing valuable insights quickly and efficiently.
When Not to Use Citus?
- Workload Expectations: If you don’t anticipate your workload ever growing beyond a single PostgreSQL node, Citus may be overkill for your needs.
- Offline Analytics: For scenarios where real-time data ingestion and queries aren’t essential, you might not require the capabilities that Citus offers.
- Low-Concurrency Analytics Apps: If your analytics applications don’t need to support a large number of concurrent users, simpler solutions may suffice.
- Data-Heavy ETL Results: If your queries primarily return data-heavy ETL results rather than summaries, Citus may not provide significant benefits.
Before Starting Citus: Key Concepts to Know
Sharding
- Row-Based Sharding: Distributing data at the row level across multiple nodes for improved performance and scalability.
- Schema-Based Sharding: Partitioning data based on schema definitions to achieve optimal distribution and query performance.
Cluster Roles
- Coordinator Node: The coordinator node serves as the central orchestrator in a Citus cluster, responsible for query planning, distribution, and coordination among worker nodes. Understanding the role of the coordinator node is essential for managing cluster operations, optimizing query performance, and troubleshooting issues related to query routing and distribution.
- Worker Node: Worker nodes in a Citus cluster store and process data shards, executing query fragments in parallel to achieve high-performance distributed query processing. Knowledge of worker nodes’ responsibilities is vital for optimizing data distribution, ensuring balanced workload across nodes, and diagnosing performance bottlenecks within the cluster.
High Availability Concepts
- Patroni (High Availability): A tool for managing PostgreSQL high availability, ensuring database resilience and failover protection.
- Consul/Etcd: Service discovery and configuration management tools used to coordinate and manage the Citus cluster.
Familiarity with these key terms provides a solid foundation for effectively deploying and managing Citus clusters. By understanding the underlying principles of Citus’ architecture, you can design scalable and performant database solutions tailored to your application’s requirements. Moreover, knowing these terms enables you to troubleshoot issues, optimize cluster performance, and make informed decisions about cluster configuration and scaling strategies. In essence, mastering these concepts empowers you to harness the full potential of Citus and unlock new levels of scalability and efficiency in your PostgreSQL deployments.
Citus Installation and Configuration
Before Installation: Considerations and Limits
Before diving into the provisioning and installation of Citus, it’s crucial to be mindful of certain limitations and considerations that impact the setup process.
- Single Database Management with Patroni: In a Citus cluster, Patroni is utilized for high availability, but it’s important to note that each Citus cluster can only manage a single database with Patroni. Therefore, attempting to manage multiple databases within a single Citus cluster alongside Patroni is not feasible.
- SSL Implementation for Node-to-Node Communication: By default, Citus utilizes SSL for node-to-node communication. This means that SSL implementation must be completed in PostgreSQL environments to ensure secure communication within the Citus cluster. Prior to installation, ensure that SSL configurations are properly set up and functioning within your PostgreSQL environment.
Provisioning and Installation Tools
Constructing a Citus cluster involves leveraging a combination of tools and platforms to facilitate the deployment and management process. Here are the key components and platforms we’ll be utilizing:
- PostgreSQL: At the core of the Citus cluster lies PostgreSQL, the powerful open-source relational database management system. Citus extends PostgreSQL’s capabilities by enabling horizontal scalability and distributed query processing across multiple nodes.
- Patroni: Patroni plays a pivotal role in ensuring high availability within the Citus cluster. It provides automated failover and resilience mechanisms, allowing for continuous availability of the database cluster. With Patroni, Citus clusters can recover from node failures and maintain data integrity without manual intervention.
- ETCD or Consul: ETCD or Consul serves as the service discovery and configuration management tool within the Citus cluster. These tools facilitate communication between cluster nodes, dynamic configuration updates, and coordination of cluster operations. ETCD or Consul ensures consistency and reliability in distributed environments, enhancing the scalability and resilience of the Citus deployment.
- Citus: Finally, Citus itself is the distributed database extension for PostgreSQL that enables horizontal scaling and distributed query processing. With Citus, you can distribute your data across multiple nodes and parallelize queries to achieve high performance and scalability. By leveraging Citus, you can seamlessly scale your PostgreSQL database to handle large datasets and high concurrency workloads.
By utilizing these tools and platforms in conjunction with PostgreSQL, you can construct a robust and scalable Citus cluster that meets the demands of your application’s workload. With proper provisioning and installation procedures in place, you can harness the full potential of Citus to unlock new levels of performance and scalability in your PostgreSQL deployments.
Deploying a Citus cluster doesn’t have to be daunting. With a clear understanding of the roles and a step-by-step approach, you can have your Citus cluster up and running smoothly. In this guide, we’ll walk through the installation process for both worker and coordinator nodes to streamline your Citus setup.
Installation of Citus
Install Patroni Packages on Every Node
Patroni is a crucial component for achieving high availability and automatic failover in PostgreSQL clusters, including Citus. To install Patroni packages, follow these steps:
apt-get update -y && apt-get install sudo vim curl wget -y
apt-get install python3 -y
apt-get install python3-pip -y
apt-get install python3-psycopg2 -y
pip3 install patroni[consul]
Citus extends PostgreSQL with distributed database capabilities, allowing you to scale out your database horizontally. Here’s how to install Citus on each node:
curl https://install.citusdata.com/community/deb.sh | sudo bash
cd /etc/apt/keyrings
mv citusdata_community-archive-keyring.gpg citusdata_community.gpg
cp citusdata_community.gpg /etc/apt/trusted.gpg.d/citusdata_community.gpg
chmod 644 /etc/apt/trusted.gpg.d/citusdata_community.gpg
vi /etc/apt/sources.list.d/citusdata_community
deb [signed-by=/etc/apt/trusted.gpg.d/citusdata_community.gpg] https://repos.citusdata.com/community/ubuntu/ focal main
deb-src [signed-by=/etc/apt/trusted.gpg.d/citusdata_community.gpg] https://repos.citusdata.com/community/ubuntu/ focal main
sudo apt-get -y install postgresql-15-citus-11.3
Configuring Patroni to manage your Citus and PostgreSQL cluster is a crucial step towards ensuring high availability and automatic failover. By leveraging the patroni.yml configuration file, you can define the settings for each node in your cluster and seamlessly manage the cluster’s operation. Let’s dive into how you can configure patroni.yml to effectively manage your Citus and PostgreSQL deployment.
Configuring patroni.yml for Citus and PostgreSQL
To begin, open the patroni.yml configuration file on each node in your Citus and PostgreSQL cluster. This file contains the settings for Patroni, including parameters specific to Citus integration.
# patroni.yml
citus:
group: X # 0 for coordinator and 1, 2, 3, etc. for workers
database: citus # must be the same on all nodes
In the patroni.yml file, the citus section allows you to specify the Citus-specific configuration options:
group: This parameter indicates the role of each node in the Citus cluster. Set it to 0 for the coordinator node and 1, 2, 3, etc., for worker nodes. Ensure that each node is assigned a unique group number corresponding to its role in the Citus cluster.
database: Specify the name of the Citus database. This must be the same across all nodes in the Citus cluster to ensure proper coordination and communication between nodes.
Finally, we’re starting the patroni services.
sudo systemctl start patroni
+ Citus cluster: distributed-cluster-1 ----------+--------------+-----------+----+-----------+
| Group | Member | Host | Role | State | TL | Lag in MB |
+-------+---------+---------------+--------------+-----------+----+-----------+
| 0 | coordinator1 | 10.0.0.1 | Sync Standby | streaming | 1 | 0 |
| 0 | coordinator2 | 10.0.0.2 | Leader | running | 1 | |
| 0 | coordinator3 | 10.0.0.3 | Replica | streaming | 1 | 0 |
| 1 | worker1-1 | 10.0.0.4 | Leader | running | 1 | |
| 1 | worker1-2 | 10.0.0.5 | Sync Standby | streaming | 1 | 0 |
| 2 | worker2-1 | 10.0.0.6 | Leader | running | 1 | |
| 2 | worker2-2 | 10.0.0.7 | Sync Standby | streaming | 1 | 0 |
| 3 | worker3-1 | 10.0.0.8 | Sync Standby | streaming | 1 | 0 |
| 3 | worker3-2 | 10.0.0.9 | Leader | running | 1 | |
+-------+---------+---------------+--------------+-----------+----+-----------+
To sum up, to install Citus, begin by installing Patroni, PostgreSQL, and Citus on each node. Configure Patroni on coordinator nodes with Citus group 0 and on worker nodes with groups 1, 2, 3, etc. Ensure Consul is installed for service coordination. Start the Patroni service on all nodes. These steps provide a basic setup, but refer to the official documentation for complete guidance and customization options.
Citus vs Traditional PostgreSQL Benchmarks
Database Size
When comparing database sizes between Citus with two shards and a traditional PostgreSQL instance, a clear pattern emerges favoring Citus’s distributed architecture. Utilizing a straightforward query to retrieve database sizes, it becomes evident that the combined size of the Citus shards is substantially smaller, showcasing a 50% reduction compared to the single PostgreSQL instance. In Citus, each shard occupies a modest 29 MB, highlighting the efficiency gained through distributed storage. Conversely, the traditional PostgreSQL instance has a larger footprint, with the database size totaling 54 MB. This significant disparity underscores the scalability benefits of Citus, as it efficiently distributes data across multiple shards, resulting in optimized storage utilization and smaller database sizes.
SELECT datname AS "Database",
pg_size_pretty(pg_database_size(datname)) AS "Size"
FROM pg_database
ORDER BY pg_database_size(datname) DESC;
-- Citus Per Shard
postgres=# SELECT datname AS "Database",
pg_size_pretty(pg_database_size(datname)) AS "Size"
FROM pg_database
ORDER BY pg_database_size(datname) DESC;
Database | Size
-----------+---------
citus | 29 MB
template1 | 7556 kB
postgres | 7492 kB
template0 | 7337 kB
-- Single Node PostgreSQL
postgres=# SELECT datname AS "Database",
pg_size_pretty(pg_database_size(datname)) AS "Size"
FROM pg_database
ORDER BY pg_database_size(datname) DESC;
Database | Size
-----------+---------
citus | 54 MB
template1 | 7525 kB
postgres | 7453 kB
template0 | 7297 kB
Moreover, it’s worth noting that as we scale the Citus cluster by adding new worker nodes, we further decrease the database size per shard, demonstrating the continued efficiency gains and scalability potential of the Citus architecture.
In this blog post, we discussed the architecture and fundamental concepts of Citus, an extension for PostgreSQL designed to enable horizontal scaling across multiple nodes. We explored how Citus distributes data and queries across a cluster of machines, providing scalability and high availability for PostgreSQL databases. By partitioning tables and leveraging distributed query processing, Citus allows for seamless scaling to accommodate growing workloads while maintaining performance. We also highlighted the importance of using Patroni for high availability and automatic failover in Citus clusters. Additionally, we compared database sizes between Citus shards and a traditional PostgreSQL instance, demonstrating the efficiency gains of Citus’s distributed architecture.
Looking ahead, we will explore strategies for optimizing query performance in the next blog post, focusing on achieving efficient and sustainable solutions to meet business requirements.
Embrace the Dark Side as you jump into database optimization! Have questions or need guidance on your journey? Reach out to me on LinkedIn, connect on Twitter or Superpeer.May the Force guide us as we optimize our databases for greater efficiency.
Demir.
References
- https://patroni.readthedocs.io/en/latest/installation.html#installing-on-debian-derivatives
- https://patroni.readthedocs.io/en/latest/citus.html
- https://docs.citusdata.com/en/v12.1/installation/multi_node_debian.html#steps-to-be-executed-on-all-nodes
- https://docs.citusdata.com/en/v12.1/admin_guide/cluster_management.html
- https://www.citusdata.com/blog/2023/03/06/patroni-3-0-and-citus-scalable-ha-postgres/
- https://github.com/hasura/graphql-on-various-pg/blob/master/citus/load_sample_data.sh
- https://www.citusdata.com/blog/2023/07/18/citus-12-schema-based-sharding-for-postgres/