PostgreSQL Replication and High Availability in Your Homelab
A single PostgreSQL instance is a single point of failure. If the disk dies, the container crashes at 3 AM, or you need to run maintenance, every service that depends on that database goes down with it. For a homelab running Nextcloud, Gitea, Grafana, and a handful of other services against the same Postgres instance, that's a bad Saturday morning.
PostgreSQL has built-in streaming replication that continuously ships WAL (Write-Ahead Log) records from a primary to one or more standby servers. Add Patroni on top for automatic failover, and pgBouncer in front for connection pooling, and you have a setup that survives node failures without manual intervention. It's the same architecture used in production at companies running millions of transactions per day, scaled down to homelab proportions.
Understanding PostgreSQL Replication
PostgreSQL offers several replication modes. For a homelab HA setup, streaming replication is what you want.
| Replication Type | Description | Use Case |
|---|---|---|
| Streaming (async) | Standby continuously receives WAL from primary. Slight lag possible. | Default, good for most homelabs |
| Streaming (sync) | Primary waits for standby confirmation before committing. Zero data loss. | When you can't lose a single transaction |
| Logical replication | Replicates specific tables/databases. Both sides are writable. | Selective replication, version upgrades |
| WAL archiving | Ships completed WAL files to storage. | Backup and point-in-time recovery |
Asynchronous streaming replication is the sweet spot for homelabs. It provides continuous replication with minimal performance impact on the primary. You might lose a fraction of a second of transactions during an unplanned failover, which is acceptable when your workload is Nextcloud file metadata and Gitea repositories, not financial transactions.
Basic Streaming Replication Setup
Before adding Patroni, it's worth understanding how raw streaming replication works. This knowledge helps when debugging issues later.
Primary Server Configuration
On your primary PostgreSQL node, edit postgresql.conf:
# Replication settings
wal_level = replica
max_wal_senders = 5
wal_keep_size = '1GB'
hot_standby = on
# Recommended for replication monitoring
track_commit_timestamp = on
# Listen on all interfaces
listen_addresses = '*'
Create a replication user:
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'replication-password';
Add the standby to pg_hba.conf:
# Allow replication connections from standby
host replication replicator 10.0.0.0/24 scram-sha-256
Standby Server Setup
On the standby, create a base backup from the primary:
# Stop PostgreSQL on the standby
sudo systemctl stop postgresql
# Remove existing data directory
sudo rm -rf /var/lib/postgresql/16/main/*
# Take a base backup from the primary
sudo -u postgres pg_basebackup \
-h 10.0.0.10 \
-U replicator \
-D /var/lib/postgresql/16/main \
-Fp -Xs -P -R
The -R flag creates standby.signal and configures primary_conninfo in postgresql.auto.conf, which tells PostgreSQL to start as a standby and connect to the primary for WAL streaming.
Start the standby:
sudo systemctl start postgresql
Verify replication is working on the primary:
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn
FROM pg_stat_replication;
You should see the standby listed with streaming state.
Patroni: Automatic Failover
Raw streaming replication doesn't handle failover. If the primary dies, the standby keeps running but in read-only mode. Someone (you, at 3 AM) has to manually promote the standby and reconfigure applications to point to the new primary.
Patroni automates this. It's a cluster manager for PostgreSQL that uses a distributed consensus store (etcd, Consul, or ZooKeeper) to elect a leader, monitor health, and orchestrate automatic failover. When the primary becomes unreachable, Patroni promotes the most up-to-date standby and updates the consensus store so clients know where the new primary is.
Architecture
A minimal Patroni setup needs:
- 2-3 PostgreSQL nodes (each running Patroni as a sidecar)
- etcd cluster (3 nodes for consensus -- can share hosts with PostgreSQL)
- pgBouncer or HAProxy (to route connections to the current primary)
Docker Compose Setup
Here's a complete three-node setup with etcd:
# docker-compose.yml
services:
etcd1:
image: quay.io/coreos/etcd:v3.5.16
container_name: etcd1
hostname: etcd1
environment:
ETCD_NAME: etcd1
ETCD_INITIAL_ADVERTISE_PEER_URLS: http://etcd1:2380
ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
ETCD_ADVERTISE_CLIENT_URLS: http://etcd1:2379
ETCD_INITIAL_CLUSTER: etcd1=http://etcd1:2380,etcd2=http://etcd2:2380,etcd3=http://etcd3:2380
ETCD_INITIAL_CLUSTER_STATE: new
ETCD_INITIAL_CLUSTER_TOKEN: pg-etcd-cluster
networks:
- pg-ha
etcd2:
image: quay.io/coreos/etcd:v3.5.16
container_name: etcd2
hostname: etcd2
environment:
ETCD_NAME: etcd2
ETCD_INITIAL_ADVERTISE_PEER_URLS: http://etcd2:2380
ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
ETCD_ADVERTISE_CLIENT_URLS: http://etcd2:2379
ETCD_INITIAL_CLUSTER: etcd1=http://etcd1:2380,etcd2=http://etcd2:2380,etcd3=http://etcd3:2380
ETCD_INITIAL_CLUSTER_STATE: new
ETCD_INITIAL_CLUSTER_TOKEN: pg-etcd-cluster
networks:
- pg-ha
etcd3:
image: quay.io/coreos/etcd:v3.5.16
container_name: etcd3
hostname: etcd3
environment:
ETCD_NAME: etcd3
ETCD_INITIAL_ADVERTISE_PEER_URLS: http://etcd3:2380
ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
ETCD_ADVERTISE_CLIENT_URLS: http://etcd3:2379
ETCD_INITIAL_CLUSTER: etcd1=http://etcd1:2380,etcd2=http://etcd2:2380,etcd3=http://etcd3:2380
ETCD_INITIAL_CLUSTER_STATE: new
ETCD_INITIAL_CLUSTER_TOKEN: pg-etcd-cluster
networks:
- pg-ha
pg-node1:
image: patroni-pg:latest
build: ./patroni
container_name: pg-node1
hostname: pg-node1
environment:
PATRONI_NAME: pg-node1
PATRONI_SCOPE: pg-ha-cluster
PATRONI_ETCD3_HOSTS: "'etcd1:2379','etcd2:2379','etcd3:2379'"
PATRONI_RESTAPI_CONNECT_ADDRESS: pg-node1:8008
PATRONI_POSTGRESQL_CONNECT_ADDRESS: pg-node1:5432
PATRONI_POSTGRESQL_DATA_DIR: /var/lib/postgresql/data
PATRONI_REPLICATION_USERNAME: replicator
PATRONI_REPLICATION_PASSWORD: rep-pass
PATRONI_SUPERUSER_USERNAME: postgres
PATRONI_SUPERUSER_PASSWORD: postgres-pass
volumes:
- pg-node1-data:/var/lib/postgresql/data
networks:
- pg-ha
pg-node2:
image: patroni-pg:latest
build: ./patroni
container_name: pg-node2
hostname: pg-node2
environment:
PATRONI_NAME: pg-node2
PATRONI_SCOPE: pg-ha-cluster
PATRONI_ETCD3_HOSTS: "'etcd1:2379','etcd2:2379','etcd3:2379'"
PATRONI_RESTAPI_CONNECT_ADDRESS: pg-node2:8008
PATRONI_POSTGRESQL_CONNECT_ADDRESS: pg-node2:5432
PATRONI_POSTGRESQL_DATA_DIR: /var/lib/postgresql/data
PATRONI_REPLICATION_USERNAME: replicator
PATRONI_REPLICATION_PASSWORD: rep-pass
PATRONI_SUPERUSER_USERNAME: postgres
PATRONI_SUPERUSER_PASSWORD: postgres-pass
volumes:
- pg-node2-data:/var/lib/postgresql/data
networks:
- pg-ha
volumes:
pg-node1-data:
pg-node2-data:
networks:
pg-ha:
driver: bridge
Patroni Configuration
Each node needs a patroni.yml. In a Docker setup, environment variables override file configuration, but here's the full file for reference:
# patroni.yml
scope: pg-ha-cluster
name: pg-node1
restapi:
listen: 0.0.0.0:8008
connect_address: pg-node1:8008
etcd3:
hosts:
- etcd1:2379
- etcd2:2379
- etcd3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
wal_level: replica
hot_standby: "on"
max_wal_senders: 5
max_replication_slots: 5
wal_log_hints: "on"
initdb:
- encoding: UTF8
- data-checksums
postgresql:
listen: 0.0.0.0:5432
connect_address: pg-node1:5432
data_dir: /var/lib/postgresql/data
authentication:
superuser:
username: postgres
password: postgres-pass
replication:
username: replicator
password: rep-pass
Patroni Dockerfile
# patroni/Dockerfile
FROM postgres:16
RUN apt-get update && apt-get install -y \
python3 python3-pip python3-venv \
&& python3 -m venv /opt/patroni \
&& /opt/patroni/bin/pip install patroni[etcd3] psycopg2-binary \
&& apt-get clean
COPY patroni.yml /etc/patroni.yml
ENTRYPOINT ["/opt/patroni/bin/patroni", "/etc/patroni.yml"]
Checking Cluster Status
Patroni provides a CLI tool called patronictl:
docker exec pg-node1 /opt/patroni/bin/patronictl -c /etc/patroni.yml list
# Expected output:
# + Cluster: pg-ha-cluster ----+---------+---------+----+-----------+
# | Member | Host | Role | State | TL | Lag in MB |
# +-----------+----------+---------+---------+----+-----------+
# | pg-node1 | pg-node1 | Leader | running | 1 | |
# | pg-node2 | pg-node2 | Replica | running | 1 | 0 |
# +-----------+----------+---------+---------+----+-----------+
pgBouncer: Connection Pooling
PostgreSQL forks a new process for every client connection. With 20 services each opening 10 connections, that's 200 backend processes consuming memory. pgBouncer sits in front of PostgreSQL and multiplexes client connections onto a smaller pool of actual database connections.
# Add to docker-compose.yml
pgbouncer:
image: bitnami/pgbouncer:latest
container_name: pgbouncer
environment:
PGBOUNCER_DATABASE: "*"
PGBOUNCER_PORT: "6432"
PGBOUNCER_POOL_MODE: transaction
PGBOUNCER_MAX_CLIENT_CONN: "200"
PGBOUNCER_DEFAULT_POOL_SIZE: "20"
PGBOUNCER_MIN_POOL_SIZE: "5"
POSTGRESQL_HOST: pg-node1
POSTGRESQL_PORT: "5432"
POSTGRESQL_USERNAME: postgres
POSTGRESQL_PASSWORD: postgres-pass
ports:
- "6432:6432"
networks:
- pg-ha
Pool modes explained:
| Mode | Description | Compatibility |
|---|---|---|
| session | Connection allocated per client session. Released when client disconnects. | Full compatibility. Least efficient. |
| transaction | Connection allocated per transaction. Released between transactions. | Most features work. No SET commands, prepared statements need care. |
| statement | Connection allocated per statement. | Very restrictive. Only simple queries. |
Transaction pooling is the best balance for most homelab workloads. If a specific application needs session-level features (like LISTEN/NOTIFY), give it a direct connection to PostgreSQL instead of routing through pgBouncer.
Routing to the Current Primary
When Patroni fails over, the primary changes. Your services shouldn't need reconfiguration. Patroni exposes a REST API on each node:
GET /primaryreturns HTTP 200 on the current primaryGET /replicareturns HTTP 200 on standbys
You can use HAProxy to health-check these endpoints and route traffic:
# haproxy.cfg
listen postgres-primary
bind *:5432
mode tcp
option httpchk GET /primary
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server pg-node1 pg-node1:5432 check port 8008
server pg-node2 pg-node2:5432 check port 8008
Or, if you're already running pgBouncer, use a script that queries the Patroni API and updates pgBouncer's target host on failover. Patroni supports callback scripts that fire on role changes.
Testing Failover
Never trust a failover setup you haven't tested. Here's how:
# Check current cluster state
docker exec pg-node1 /opt/patroni/bin/patronictl -c /etc/patroni.yml list
# Insert test data on the primary
docker exec pg-node1 psql -U postgres -c \
"CREATE TABLE failover_test (id serial, ts timestamp default now());
INSERT INTO failover_test DEFAULT VALUES;"
# Simulate primary failure
docker stop pg-node1
# Wait 30 seconds for failover
sleep 30
# Check cluster state - pg-node2 should be Leader now
docker exec pg-node2 /opt/patroni/bin/patronictl -c /etc/patroni.yml list
# Verify data is intact
docker exec pg-node2 psql -U postgres -c "SELECT * FROM failover_test;"
# Bring the old primary back - it will rejoin as a replica
docker start pg-node1
sleep 15
docker exec pg-node2 /opt/patroni/bin/patronictl -c /etc/patroni.yml list
After bringing pg-node1 back, Patroni uses pg_rewind to resynchronize it with the new primary and demote it to a replica. This is why use_pg_rewind: true is important in the Patroni configuration.
Monitoring Replication
Add these queries to your Grafana dashboards:
-- Replication lag (run on primary)
SELECT
client_addr,
state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag_bytes
FROM pg_stat_replication;
-- Replication slot status
SELECT
slot_name,
active,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes
FROM pg_replication_slots;
Set alerts for:
- Replication lag exceeding 10 MB (or whatever threshold matches your RPO)
- Replication slots becoming inactive (indicates a standby is down)
pg_stat_replicationreturning no rows (all standbys disconnected)
Resource Considerations
A minimal two-node Patroni cluster with etcd requires:
| Component | Per Node | Total (2 PG + 3 etcd) |
|---|---|---|
| RAM | 512 MB (etcd) + 1 GB (PostgreSQL minimum) | ~4 GB |
| CPU | 1 core (etcd) + 1-2 cores (PostgreSQL) | 6-8 cores |
| Disk | 1 GB (etcd) + database size (PostgreSQL) | Varies |
| Network | Low bandwidth, but low latency matters | 1 Gbps sufficient |
For homelabs, you can co-locate etcd and PostgreSQL on the same nodes. Three nodes running both etcd and PostgreSQL is the most compact HA setup, though two PostgreSQL nodes with three lightweight etcd instances also works.
When HA Postgres Is Worth It
If you're running a single server, adding replication adds complexity without much benefit -- just run good backups instead. But if you have two or three machines and your services depend heavily on PostgreSQL, the setup described here gives you automatic failover, connection pooling, and the peace of mind that a hardware failure won't take your entire homelab offline. The initial setup takes an afternoon. The alternative is being the human failover mechanism at 3 AM.