Unveiling Performance: Benchmarking OLTP and KV Requests in Couchbase vs. PostgreSQL

Hüseyin Demir
10 min readJun 13, 2023

--

Hi, In this study, the performance differences and utilization of query between PostgreSQL and Couchbase environments will be examined. Additionally, the usage among data size, resource consumption, and other areas will also be analyzed. I will dive into the world of OLTP (Online Transaction Processing) and KV (Key-Value) requests. In this benchmark test, we will compare the performance of Couchbase and PostgreSQL, two popular database platforms, and analyze the results. Additionally, we will explore the process of generating and preparing the required data before conducting the benchmark. So, let’s explore the exciting world of OLTP and KV requests and uncover the insights gained from this benchmark test.

I. Test Setup and Environment

II. Data Generation and Preparation

III. Benchmark Test Execution

IV. Performance Evaluation and Analysis

V. Insights and Findings

VI. Considerations for Choosing the Right Database Platform

Test Setup and Environment

To prepare test environment docker platform can be used. I used following commands to create a PostgreSQL and Couchbase cluster.

docker run -d --name db -p 8091-8097:8091-8097 -p 9123:9123 -p 11207:11207 -p 11210:11210 -p 11280:11280 -p 18091-18097:18091-18097 couchbase:7.2.0
docker run -d --name db2 couchbase:7.2.0
docker run -d --name db3 couchbase:7.2.0

docker run -d --name db4 couchbase:7.2.0
docker run -d --name db5 couchbase:7.2.0
docker run -d --name db6 couchbase:7.2.0
docker run --name pgs_95 -e POSTGRES_PASSWORD=mysecretpassword -p 5436:5432 -d postgres:15

Data Generation and Preparation

Within the scope of this study, two different test environments have been prepared. These environments are as follows.

Both environments have been populated with a data model consisting of 5 million records, generated in the exact same way. The steps to generate and populate this data are as follows:

  1. Generate a random/fake dataset using Python.
import csv
from random import randint, uniform, choice
from datetime import datetime, timedelta

# Number of order records to generate
num_records = 5000000

# Generate random customer names
def generate_random_name():
first_names = ['John', 'Jane', 'Alice', 'Bob', 'Charlie', 'Emma', 'Oliver', 'Sophia']
last_names = ['Smith', 'Johnson', 'Brown', 'Davis', 'Wilson', 'Taylor', 'Clark']
return choice(first_names) + ' ' + choice(last_names)

# Generate random email addresses
def generate_random_email():
domains = ['gmail.com', 'yahoo.com', 'hotmail.com', 'example.com']
username = ''.join(choice('abcdefghijklmnopqrstuvwxyz') for _ in range(randint(5, 10)))
domain = choice(domains)
return f'{username}@{domain}'

# Generate random order date
def generate_random_date():
start_date = datetime(2015, 1, 1)
end_date = datetime(2023, 1, 1)
time_between_dates = end_date - start_date
random_number_of_days = randint(0, time_between_dates.days)
random_date = start_date + timedelta(days=random_number_of_days)
return random_date.date()

# Generate random total amount
def generate_random_amount():
return round(uniform(10.0, 1000.0), 2)

# Generate random city
def generate_random_city():
cities = ['New York', 'London', 'Paris', 'Tokyo', 'Berlin', 'Sydney', 'Rome', 'Toronto']
return choice(cities)

# Generate random country
def generate_random_country():
countries = ['USA', 'UK', 'France', 'Japan', 'Germany', 'Australia', 'Italy', 'Canada']
return choice(countries)

# Generate random postal code
def generate_random_postal_code():
return ''.join(choice('0123456789') for _ in range(6))

# Generate orders
orders = []
for i in range(num_records):
order = {
'order_id': i + 1,
'order_item_id': i + 1,
'customer_name': generate_random_name(),
'customer_email': generate_random_email(),
'order_date': generate_random_date(),
'total_amount': generate_random_amount(),
'city': generate_random_city(),
'country': generate_random_country(),
'postal_code': generate_random_postal_code()
}
orders.append(order)

# Define CSV file path
csv_file_path = 'order_data.csv'

# Write order data to CSV file
fieldnames = ['order_id', 'order_item_id', 'customer_name', 'customer_email', 'order_date', 'total_amount', 'city', 'country', 'postal_code']
with open(csv_file_path, 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(orders)

print(f'{num_records} order records generated and exported to {csv_file_path}.')

Save the file as populate.py and execute.


python3 benchmark/populate.py

After running this script, it creates a file named “order_data.csv” in the directory where it runs. In the subsequent steps, we will import this file into both PostgreSQL and Couchbase.

2. The generated data is imported into Couchbase.

Before this operation, it is necessary to create a bucket named “orders” in Couchbase. The cbimport tool has been used to import the data into Couchbase.

./cbimport csv -c couchbase://127.0.0.1 -u Administrator -p test123 -b orders -d file:///home/demir/Desktop/notes_temp/order_data.csv -g key::%order_id% -t 5
CSV `file:///home/demir/Desktop/notes_temp/order_data.csv` imported to `couchbase://127.0.0.1` successfully
Documents imported: 5000000 Documents failed: 0

3. The generated data is imported into PostgreSQL.

Prior to this, the database and table must have been created in PostgreSQL. If these operations have been completed, the CSV file can be imported into PostgreSQL using the following method:

The following DDL (Data Definition Language) can be used to create the table.

CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
order_item_id SERIAL,
customer_name VARCHAR(255) NOT NULL,
customer_email VARCHAR(255) NOT NULL,
order_date DATE NOT NULL,
total_amount DECIMAL(10, 2) NOT NULL,
city VARCHAR(255) NOT NULL,
country VARCHAR(255) NOT NULL,
postal_code VARCHAR(20) NOT NULL
);
COPY orders
FROM '/path/order_data.csv'
DELIMITER ','
CSV HEADER;

After the process is completed, we should verify the information related to the table. If the operation is successful, we will obtain an output similar to the following:

4. The necessary indexes are added to both Couchbase and PostgreSQL for the queries to be tested.

The following indexes have been created for Couchbase:

CREATE INDEX `adv_country_total_amount` ON `orders`(`country` INCLUDE MISSING,`total_amount`)
CREATE INDEX `adv_customer_name_order_date_total_amount_order_id` ON `orders`(`customer_name`,`order_date`,`total_amount`,`order_id`)
CREATE INDEX `adv_order_date_customer_name_order_id_total_amount` ON `orders`(`order_date`,`customer_name`,`order_id`,`total_amount`)

For PostgreSQL, the following indexes have been created:

CREATE INDEX idx_orders_customer_name ON orders (customer_name) INCLUDE (order_id, order_date, total_amount);
CREATE INDEX idx_orders_country_total_amount ON orders (country, total_amount);
CREATE INDEX idx_orders_covering ON orders (order_date) INCLUDE (order_id, customer_name, total_amount);

Benchmark Test Execution and Performance Evaluation&Analysis

Within the scope of this study, performance tests have been conducted on three different query types as examples. These queries are as follows:

query-1

SELECT order_id, order_date, total_amount
FROM orders
WHERE customer_name = 'Oliver Wilson';

query-2

SELECT country, SUM(total_amount) AS total_revenue
FROM orders
GROUP BY country;

query-3

SELECT order_id, customer_name, order_date, total_amount
FROM orders
WHERE order_date >= '2022-01-01' AND order_date <= '2022-12-31';

Results

In terms of OLTP query performance, the given query performs better in PostgreSQL when using the N1QL service for comparison. This indicates that PostgreSQL outperforms Couchbase in this specific scenario.

The difference in performance between the two databases can be attributed to several factors:

  • Data Model: Couchbase and PostgreSQL have different data models. Couchbase is a NoSQL document-oriented database, while PostgreSQL is a relational database. The differences in how they store and retrieve data can impact query performance, especially for complex OLTP queries.
  • Indexing Strategies: PostgreSQL offers various indexing strategies, such as B-tree, hash, and generalized inverted indexes (GIN/GiST), which can significantly enhance query performance. Couchbase also provides indexing options, including primary, secondary, and full-text indexes. However, the indexing capabilities and optimization techniques in PostgreSQL might offer better performance for OLTP queries compared to Couchbase’s N1QL.
  • Data Distribution and Scalability: Couchbase’s distributed architecture is designed for scalability and high availability, which is advantageous for certain workloads and use cases. However, for OLTP queries that rely heavily on transactional consistency and immediate response times, PostgreSQL’s centralized architecture may provide better performance.

It is important to note that these performance differences are specific to the given query and may vary based on the workload characteristics, dataset size, hardware configurations, and database tuning parameters.

When comparing the performance of OLTP queries between Couchbase’s N1QL and PostgreSQL, it is crucial to consider the specific requirements of your application and the trade-offs between different database systems.

KV Performance Tests

A KV benchmark test between Couchbase and PostgreSQL involves evaluating the performance of key-value (KV) requests in both database systems. In the context of databases, a KV request refers to the process of retrieving or manipulating data based on its unique identifier, known as the key. In Couchbase, a NoSQL database, KV requests are fundamental as the data is stored and accessed using a key-value model. In PostgreSQL, a relational database, the equivalent operation would be selecting data via the primary key. However, it’s important to note that while both Couchbase and PostgreSQL support KV operations, their underlying architectures and data models differ significantly. Couchbase, being a distributed NoSQL database, is designed to handle massive-scale data and high-throughput workloads with a focus on horizontal scalability and low-latency access. PostgreSQL, on the other hand, is a feature-rich relational database that excels in complex data relationships, transactions, and SQL-based querying. Therefore, when comparing KV requests between Couchbase and PostgreSQL, it’s essential to consider the specific use case, data model, and performance requirements to determine the most suitable solution for your needs.

The test scenario you described involves applying 1000 requests sequentially to either Couchbase or PostgreSQL and measuring the average execution time using Python. Here’s how you can set up and execute such a test scenario:

1. Establish a connection: In Python, use the appropriate libraries to establish connections to both Couchbase and PostgreSQL databases.

2. Start the test loop: Set up a loop to iterate 1000 times, representing the 1000 sequential requests you want to perform.

3. Generate a key: Within each iteration of the loop, generate a unique key for the request. This key could be a sequential number or any other identifier that adheres to your data model.

4. Execute the request: Using the appropriate methods provided by the Python libraries for Couchbase and PostgreSQL, perform the request. In Couchbase, this would involve retrieving or manipulating data based on the key. In PostgreSQL, you would execute a SELECT query using the primary key.

6. Collect execution times: Store the execution times of each request in a list or any appropriate data structure.

8. Calculate the average execution time: After the loop completes, calculate the average execution time by summing up all the individual execution times and dividing by the total number of requests (1000 in this case).

9. Analyze the results: Examine the average execution time for each database system (Couchbase and PostgreSQL). Compare the results to evaluate the performance of the two systems in handling sequential KV requests.

Remember to consider factors such as network latency, hardware capabilities, and any additional configurations specific to Couchbase and PostgreSQL that may impact the test results.

I’ve used the following script to apply KV benchmark test.

import psycopg2
import datetime
from couchbase.cluster import Cluster, ClusterOptions
from couchbase.auth import PasswordAuthenticator
import time
import random


conn = psycopg2.connect(
host='127.0.0.1',
port='5436',
database='orders',
user='test_user',
password='test123'
)
cursor = conn.cursor()

cluster = Cluster.connect(
"couchbase://127.0.0.1",
ClusterOptions(PasswordAuthenticator("test", "test123")))
bucket = cluster.bucket("orders")
collection = bucket.default_collection()


query = "SELECT * FROM orders WHERE order_id = %s"

avg_duration = 0
for _ in range(1000):
# Generate a random order ID
order_id = random.randint(1, 1000)

# Execute the query
start_time = datetime.datetime.now()
cursor.execute(query, (order_id,))
results = cursor.fetchall()
for result in results:
pass
end_time = datetime.datetime.now()
duration = (end_time - start_time).total_seconds() * 1000
avg_duration += duration

avg_duration /= 1000
print("PostgreSQL Average execution time:", avg_duration, "milliseconds")

cursor.close()
conn.close()

def collect_key(key):
start_time = time.time()
result = collection.get(key)
end_time = time.time()
response_time = (end_time - start_time) * 1000 # Convert to milliseconds
return response_time

total_response_time = 0
iterations = 1000

for i in range(iterations):
key_id = random.randint(1, 1000) # Generate a random key ID
key = f'key::{key_id}'
response_time = collect_key(key)
total_response_time += response_time

average_response_time = total_response_time / iterations
print(f"Couchbase Average response time: {average_response_time:.2f} milliseconds")

Indeed, while the performance of KV requests may be comparable between Couchbase and PostgreSQL, Couchbase offers several advantages due to its distributed NoSQL architecture. Here are some of the key advantages of Couchbase:

1. Scalability: Couchbase is designed for horizontal scalability, allowing you to easily scale your database cluster by adding more nodes as your workload grows. This scalability is achieved through a shared-nothing architecture, where data is partitioned and distributed across multiple nodes. As a result, Couchbase can handle large volumes of data and high-throughput workloads with ease.

2. High Availability: Couchbase ensures high availability by employing data replication and automatic failover. It supports configurable data replication, allowing you to define the number of replicas for your data. In the event of a node failure, Couchbase automatically promotes a replica to ensure continuous availability without manual intervention.

3. Integrated Caching: Couchbase incorporates an integrated caching layer, which can significantly boost read performance by storing frequently accessed data in memory. This caching layer reduces the need to fetch data from disk, resulting in faster response times and improved overall performance.

4. Active-Active Replication: Couchbase supports multi-datacenter replication, enabling active-active deployments across geographically distributed locations. This feature allows you to maintain data consistency and provide low-latency access to users in different regions.

These advantages make Couchbase an attractive choice for scenarios requiring high scalability, low-latency access, flexible data models, and easy manageability in distributed environments. However, it’s essential to evaluate your specific use case and requirements to determine if Couchbase aligns with your needs.

Insights and Findings

  • If there are expectations for both performance and handling OLTP-like workloads using GSI (Global Secondary Index) on Couchbase, unexpected slowdowns in performance compared to PostgreSQL may arise. Therefore, if the data model and requirements are primarily focused on OLTP queries, PostgreSQL may be preferred over Couchbase’s GSI indexes.
  • When it comes to KV-style queries, there is no performance difference between PostgreSQL and Couchbase. In PostgreSQL, queries are made directly using the primary key, while in Couchbase, they are made directly using the key. This does not introduce any performance issues. The considerations here should revolve around availability, scaling, and other uptime issues.

Considerations for Choosing the Right Database Platform

In conclusion, when choosing a database platform, it is crucial to evaluate your requirements and select a technology that aligns with them. Factors such as performance, high availability, scalability, and elasticity play a significant role in determining the suitability of a database platform for your specific use case. Each platform has its own strengths and advantages, and it is essential to assess how well they meet your needs. Whether you prioritize raw performance, the ability to handle large workloads, seamless scalability, fault tolerance, or other specific requirements, conducting a thorough analysis and considering the trade-offs will help you make an informed decision. Ultimately, selecting the right database platform is essential for achieving optimal performance, maintaining high availability, and scaling your application effectively.

If you ever have questions or need advice on database-related matters, feel free to reach out to me on LinkedIn, connect on Twitter or Superpeer.

--

--

No responses yet