PostgreSQL + tmpfs

I have heard a couple of times that you use Temporary File System (tmpfs) to store data in volatile memory instead of persisting data to your storage device. Intuitively you assume that this will always give us BLAZINGLY FAST IO, right?

1. Context

Say you have some small database (<500MB) and the workload on such database leans heavily on read side of operations.

I thought to myself, what if we spin a read-only replica via streaming replication with tmpfs?

Something like this:

read-arch.svg

2. Setup

Creating the master database.

As postgres user:

mkdir --parents /var/lib/postgres/16/wal_archive

# Create database
initdb --auth-local=trust /var/lib/postgres/16/master

# Helper function to edit postgresql.conf
function set_prop() { \
  sed --regexp-extended \
      --in-place \
      "s,^#$1.*?$,$1 = $2,gI" \
      /var/lib/postgres/16/master/postgresql.conf; \
}

# Edit relevant postgresql.conf properties
set_prop wal_level replica
set_prop archive_mode on
set_prop archive_command "'test ! -f /var/lib/postgres/16/wal_archive/%f \&\& cp %p /var/lib/postgres/16/wal_archive/%f'"
set_prop restore_command "'cp /var/lib/postgres/16/wal_archive/%f %p'"
set_prop archive_cleanup_command "'pg_archivecleanup /var/lib/postgres/16/wal_archive %r'"
set_prop primary_conninfo "'host=localhost port=5432 user=replicator'"

# Start RDBMS
pg_ctl --pgdata=/var/lib/postgres/16/master \
       --log=/var/lib/postgres/16/master/log \
       start

# Create replication user
createuser --replication replicator

Creating the replica with tmpfs

As root:

mkdir --parents /mnt/ramdisk
mount --types tmpfs --options='size=8192m,mode=0700' tmpfs /mnt/ramdisk
chown --recursive postgres:postgres /mnt/ramdisk

As postgres:

pg_basebackup --write-recovery-conf \
              --pgdata=/mnt/ramdisk

# Start the tmpfs replica on port 5433
pg_ctl --pgdata=/mnt/ramdisk    \
       --options='-c port=5433' \
       --log=/mnt/ramdisk/log   \
       start

3. Benchmark

First, generate some data on the master instance as the postgres user:

pgbench --port=5432    \
        --initialize   \
        --foreign-keys \
        --scale=30     \
        postgres

see: https://www.postgresql.org/docs/current/pgbench.html

How many records did this generate?

select to_char(count(*), '999,999,990') as accounts from pgbench_accounts;
accounts
3,000,000

We will be using the select-only built-in script for the benchmarks:

-- select-only: <builtin: select only>
\set aid random(1, 100000 * :scale)
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

Now, for some action, run as the postgres user:

# change port to 5433 to run against the replica database
pgbench --port=5432   \
        --time=60     \
        --client=100  \
        --jobs=3      \
        --select-only \
        --no-vacuum   \
        postgres

Against the master instance:

pgbench (16.1)
transaction type: <builtin: select only>
scaling factor: 30
query mode: simple
number of clients: 100
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 5453634
number of failed transactions: 0 (0.000%)
latency average = 1.100 ms
initial connection time = 74.276 ms
tps = 90905.785932 (without initial connection time)

Against the replica instance:

pgbench (16.1)
transaction type: <builtin: select only>
scaling factor: 30
query mode: simple
number of clients: 100
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 5506013
number of failed transactions: 0 (0.000%)
latency average = 1.089 ms
initial connection time = 74.259 ms
tps = 91793.918172 (without initial connection time)

Well, that is a surprise (for me at least). I was expecting the tmpfs to outperform the master database by a moderate/large margin. It seems the PostgreSQL + Operating System (OS) caches are doing a great job.

Let's try with a slightly bigger dataset (10x bigger).

pgbench --port=5432    \
        --initialize   \
        --foreign-keys \
        --scale=300    \
        postgres

Let's check the master database size:

select relname
     , pg_size_pretty(pg_relation_size(cl.oid)) as "size"
  from pg_class cl
 where cl.relnamespace = 'public'::regnamespace
 order by pg_relation_size(cl.oid) desc
| relname               | size    |
|-----------------------+---------|
| pgbench_accounts      | 3842 MB |
| pgbench_accounts_pkey | 643 MB  |
| pgbench_tellers       | 136 kB  |
| pgbench_tellers_pkey  | 88 kB   |
| pgbench_branches_pkey | 16 kB   |
| pgbench_branches      | 16 kB   |
| pgbench_history       | 0 bytes |

Let's run again the select-only benchmarks! Against the master instance:

pgbench (16.1)
transaction type: <builtin: select only>
scaling factor: 300
query mode: simple
number of clients: 100
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 5146134
number of failed transactions: 0 (0.000%)
latency average = 1.165 ms
initial connection time = 86.478 ms
tps = 85832.914187 (without initial connection time)

and against the replica instance:

pgbench (16.1)
transaction type: <builtin: select only>
scaling factor: 300
query mode: simple
number of clients: 100
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 5150581
number of failed transactions: 0 (0.000%)
latency average = 1.164 ms
initial connection time = 76.838 ms
tps = 85874.008596 (without initial connection time)

Well, our PostgreSQL instances have the default configuration mostly, which means 128 MB of shared buffers. Clearly, it is not enough cache to perform on par with the tmpfs replica, considering the dataset size of ~4.5 GB. Perhaps if I drop my OS page cache?

As root:

sync; echo 1 > /proc/sys/vm/drop_caches

Benchmark once again versus the master instance:

pgbench (16.1)
transaction type: <builtin: select only>
scaling factor: 300
query mode: simple
number of clients: 100
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 2037721
number of failed transactions: 0 (0.000%)
latency average = 2.943 ms
initial connection time = 84.288 ms
tps = 33976.877020 (without initial connection time)

Aha! About a third of the transactions per second (TPS).

Let's run two more times in order and watch the OS cache do its job:

pgbench (16.1)
transaction type: <builtin: select only>
scaling factor: 300
query mode: simple
number of clients: 100
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 4475925
number of failed transactions: 0 (0.000%)
latency average = 1.341 ms
initial connection time = 70.050 ms
tps = 74598.040075 (without initial connection time)
pgbench (16.1)
transaction type: <builtin: select only>
scaling factor: 300
query mode: simple
number of clients: 100
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 5070696
number of failed transactions: 0 (0.000%)
latency average = 1.183 ms
initial connection time = 79.571 ms
tps = 84538.387397 (without initial connection time)

Ok, now we are back at our starting TPS.

4. Closing thoughts

I don't know anything about other platforms such as Windows and MacOS (well, I know almost nothing about Linux too…), so I will emit my thoughts with a Linux + PostgreSQL combo in mind.

For long-running applications, it appears to be better to just let PostgreSQL and Linux handle caching. While you can still benefit from a steady number of TPS using tmpfs or ramfs, I would argue that it is probably easier to just have a normal setup.

Now, for short-lived applications, such as running your integration tests, it may be an explendid option. I shall explore this topic in a later entry to this blog.

Also, please, tune your RDBMS. If you are clueless like me, the bare minimum can be found here: https://pgtune.leopard.in.ua.

If you want to contact me about this post, do so via this discussion.