PostgreSQL Replication and High Availability in Your Homelab

Services 2026-02-14 · 7 min read postgresql database replication high-availability patroni pgpool

A single PostgreSQL instance is a single point of failure. If the disk dies, the container crashes at 3 AM, or you need to run maintenance, every service that depends on that database goes down with it. For a homelab running Nextcloud, Gitea, Grafana, and a handful of other services against the same Postgres instance, that's a bad Saturday morning.

PostgreSQL has built-in streaming replication that continuously ships WAL (Write-Ahead Log) records from a primary to one or more standby servers. Add Patroni on top for automatic failover, and pgBouncer in front for connection pooling, and you have a setup that survives node failures without manual intervention. It's the same architecture used in production at companies running millions of transactions per day, scaled down to homelab proportions.

PostgreSQL high availability architecture

Understanding PostgreSQL Replication

PostgreSQL offers several replication modes. For a homelab HA setup, streaming replication is what you want.

Replication Type	Description	Use Case
Streaming (async)	Standby continuously receives WAL from primary. Slight lag possible.	Default, good for most homelabs
Streaming (sync)	Primary waits for standby confirmation before committing. Zero data loss.	When you can't lose a single transaction
Logical replication	Replicates specific tables/databases. Both sides are writable.	Selective replication, version upgrades
WAL archiving	Ships completed WAL files to storage.	Backup and point-in-time recovery

Asynchronous streaming replication is the sweet spot for homelabs. It provides continuous replication with minimal performance impact on the primary. You might lose a fraction of a second of transactions during an unplanned failover, which is acceptable when your workload is Nextcloud file metadata and Gitea repositories, not financial transactions.

Basic Streaming Replication Setup

Before adding Patroni, it's worth understanding how raw streaming replication works. This knowledge helps when debugging issues later.

Primary Server Configuration

On your primary PostgreSQL node, edit postgresql.conf:

# Replication settings
wal_level = replica
max_wal_senders = 5
wal_keep_size = '1GB'
hot_standby = on

# Recommended for replication monitoring
track_commit_timestamp = on

# Listen on all interfaces
listen_addresses = '*'

Create a replication user:

CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'replication-password';

Add the standby to pg_hba.conf:

# Allow replication connections from standby
host    replication    replicator    10.0.0.0/24    scram-sha-256

Standby Server Setup

On the standby, create a base backup from the primary:

# Stop PostgreSQL on the standby
sudo systemctl stop postgresql

# Remove existing data directory
sudo rm -rf /var/lib/postgresql/16/main/*

# Take a base backup from the primary
sudo -u postgres pg_basebackup \
  -h 10.0.0.10 \
  -U replicator \
  -D /var/lib/postgresql/16/main \
  -Fp -Xs -P -R

The -R flag creates standby.signal and configures primary_conninfo in postgresql.auto.conf, which tells PostgreSQL to start as a standby and connect to the primary for WAL streaming.

Start the standby:

sudo systemctl start postgresql

Verify replication is working on the primary:

SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn
FROM pg_stat_replication;

You should see the standby listed with streaming state.

Patroni: Automatic Failover

Raw streaming replication doesn't handle failover. If the primary dies, the standby keeps running but in read-only mode. Someone (you, at 3 AM) has to manually promote the standby and reconfigure applications to point to the new primary.

Patroni automates this. It's a cluster manager for PostgreSQL that uses a distributed consensus store (etcd, Consul, or ZooKeeper) to elect a leader, monitor health, and orchestrate automatic failover. When the primary becomes unreachable, Patroni promotes the most up-to-date standby and updates the consensus store so clients know where the new primary is.

Architecture

A minimal Patroni setup needs:

2-3 PostgreSQL nodes (each running Patroni as a sidecar)
etcd cluster (3 nodes for consensus -- can share hosts with PostgreSQL)
pgBouncer or HAProxy (to route connections to the current primary)

Docker Compose Setup

Here's a complete three-node setup with etcd:

# docker-compose.yml
services:
  etcd1:
    image: quay.io/coreos/etcd:v3.5.16
    container_name: etcd1
    hostname: etcd1
    environment:
      ETCD_NAME: etcd1
      ETCD_INITIAL_ADVERTISE_PEER_URLS: http://etcd1:2380
      ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
      ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
      ETCD_ADVERTISE_CLIENT_URLS: http://etcd1:2379
      ETCD_INITIAL_CLUSTER: etcd1=http://etcd1:2380,etcd2=http://etcd2:2380,etcd3=http://etcd3:2380
      ETCD_INITIAL_CLUSTER_STATE: new
      ETCD_INITIAL_CLUSTER_TOKEN: pg-etcd-cluster
    networks:
      - pg-ha

  etcd2:
    image: quay.io/coreos/etcd:v3.5.16
    container_name: etcd2
    hostname: etcd2
    environment:
      ETCD_NAME: etcd2
      ETCD_INITIAL_ADVERTISE_PEER_URLS: http://etcd2:2380
      ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
      ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
      ETCD_ADVERTISE_CLIENT_URLS: http://etcd2:2379
      ETCD_INITIAL_CLUSTER: etcd1=http://etcd1:2380,etcd2=http://etcd2:2380,etcd3=http://etcd3:2380
      ETCD_INITIAL_CLUSTER_STATE: new
      ETCD_INITIAL_CLUSTER_TOKEN: pg-etcd-cluster
    networks:
      - pg-ha

  etcd3:
    image: quay.io/coreos/etcd:v3.5.16
    container_name: etcd3
    hostname: etcd3
    environment:
      ETCD_NAME: etcd3
      ETCD_INITIAL_ADVERTISE_PEER_URLS: http://etcd3:2380
      ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
      ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
      ETCD_ADVERTISE_CLIENT_URLS: http://etcd3:2379
      ETCD_INITIAL_CLUSTER: etcd1=http://etcd1:2380,etcd2=http://etcd2:2380,etcd3=http://etcd3:2380
      ETCD_INITIAL_CLUSTER_STATE: new
      ETCD_INITIAL_CLUSTER_TOKEN: pg-etcd-cluster
    networks:
      - pg-ha

  pg-node1:
    image: patroni-pg:latest
    build: ./patroni
    container_name: pg-node1
    hostname: pg-node1
    environment:
      PATRONI_NAME: pg-node1
      PATRONI_SCOPE: pg-ha-cluster
      PATRONI_ETCD3_HOSTS: "'etcd1:2379','etcd2:2379','etcd3:2379'"
      PATRONI_RESTAPI_CONNECT_ADDRESS: pg-node1:8008
      PATRONI_POSTGRESQL_CONNECT_ADDRESS: pg-node1:5432
      PATRONI_POSTGRESQL_DATA_DIR: /var/lib/postgresql/data
      PATRONI_REPLICATION_USERNAME: replicator
      PATRONI_REPLICATION_PASSWORD: rep-pass
      PATRONI_SUPERUSER_USERNAME: postgres
      PATRONI_SUPERUSER_PASSWORD: postgres-pass
    volumes:
      - pg-node1-data:/var/lib/postgresql/data
    networks:
      - pg-ha

  pg-node2:
    image: patroni-pg:latest
    build: ./patroni
    container_name: pg-node2
    hostname: pg-node2
    environment:
      PATRONI_NAME: pg-node2
      PATRONI_SCOPE: pg-ha-cluster
      PATRONI_ETCD3_HOSTS: "'etcd1:2379','etcd2:2379','etcd3:2379'"
      PATRONI_RESTAPI_CONNECT_ADDRESS: pg-node2:8008
      PATRONI_POSTGRESQL_CONNECT_ADDRESS: pg-node2:5432
      PATRONI_POSTGRESQL_DATA_DIR: /var/lib/postgresql/data
      PATRONI_REPLICATION_USERNAME: replicator
      PATRONI_REPLICATION_PASSWORD: rep-pass
      PATRONI_SUPERUSER_USERNAME: postgres
      PATRONI_SUPERUSER_PASSWORD: postgres-pass
    volumes:
      - pg-node2-data:/var/lib/postgresql/data
    networks:
      - pg-ha

volumes:
  pg-node1-data:
  pg-node2-data:

networks:
  pg-ha:
    driver: bridge

Patroni Configuration

Each node needs a patroni.yml. In a Docker setup, environment variables override file configuration, but here's the full file for reference:

# patroni.yml
scope: pg-ha-cluster
name: pg-node1

restapi:
  listen: 0.0.0.0:8008
  connect_address: pg-node1:8008

etcd3:
  hosts:
    - etcd1:2379
    - etcd2:2379
    - etcd3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        max_wal_senders: 5
        max_replication_slots: 5
        wal_log_hints: "on"

  initdb:
    - encoding: UTF8
    - data-checksums

postgresql:
  listen: 0.0.0.0:5432
  connect_address: pg-node1:5432
  data_dir: /var/lib/postgresql/data
  authentication:
    superuser:
      username: postgres
      password: postgres-pass
    replication:
      username: replicator
      password: rep-pass

Patroni Dockerfile

# patroni/Dockerfile
FROM postgres:16

RUN apt-get update && apt-get install -y \
    python3 python3-pip python3-venv \
    && python3 -m venv /opt/patroni \
    && /opt/patroni/bin/pip install patroni[etcd3] psycopg2-binary \
    && apt-get clean

COPY patroni.yml /etc/patroni.yml

ENTRYPOINT ["/opt/patroni/bin/patroni", "/etc/patroni.yml"]

Checking Cluster Status

Patroni provides a CLI tool called patronictl:

docker exec pg-node1 /opt/patroni/bin/patronictl -c /etc/patroni.yml list

# Expected output:
# + Cluster: pg-ha-cluster ----+---------+---------+----+-----------+
# | Member    | Host     | Role    | State   | TL | Lag in MB |
# +-----------+----------+---------+---------+----+-----------+
# | pg-node1  | pg-node1 | Leader  | running |  1 |           |
# | pg-node2  | pg-node2 | Replica | running |  1 |         0 |
# +-----------+----------+---------+---------+----+-----------+

pgBouncer: Connection Pooling

PostgreSQL forks a new process for every client connection. With 20 services each opening 10 connections, that's 200 backend processes consuming memory. pgBouncer sits in front of PostgreSQL and multiplexes client connections onto a smaller pool of actual database connections.

# Add to docker-compose.yml
  pgbouncer:
    image: bitnami/pgbouncer:latest
    container_name: pgbouncer
    environment:
      PGBOUNCER_DATABASE: "*"
      PGBOUNCER_PORT: "6432"
      PGBOUNCER_POOL_MODE: transaction
      PGBOUNCER_MAX_CLIENT_CONN: "200"
      PGBOUNCER_DEFAULT_POOL_SIZE: "20"
      PGBOUNCER_MIN_POOL_SIZE: "5"
      POSTGRESQL_HOST: pg-node1
      POSTGRESQL_PORT: "5432"
      POSTGRESQL_USERNAME: postgres
      POSTGRESQL_PASSWORD: postgres-pass
    ports:
      - "6432:6432"
    networks:
      - pg-ha

Pool modes explained:

Mode	Description	Compatibility
session	Connection allocated per client session. Released when client disconnects.	Full compatibility. Least efficient.
transaction	Connection allocated per transaction. Released between transactions.	Most features work. No `SET` commands, prepared statements need care.
statement	Connection allocated per statement.	Very restrictive. Only simple queries.

Transaction pooling is the best balance for most homelab workloads. If a specific application needs session-level features (like LISTEN/NOTIFY), give it a direct connection to PostgreSQL instead of routing through pgBouncer.

Routing to the Current Primary

When Patroni fails over, the primary changes. Your services shouldn't need reconfiguration. Patroni exposes a REST API on each node:

GET /primary returns HTTP 200 on the current primary
GET /replica returns HTTP 200 on standbys

You can use HAProxy to health-check these endpoints and route traffic:

# haproxy.cfg
listen postgres-primary
    bind *:5432
    mode tcp
    option httpchk GET /primary
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg-node1 pg-node1:5432 check port 8008
    server pg-node2 pg-node2:5432 check port 8008

Or, if you're already running pgBouncer, use a script that queries the Patroni API and updates pgBouncer's target host on failover. Patroni supports callback scripts that fire on role changes.

Testing Failover

Never trust a failover setup you haven't tested. Here's how:

# Check current cluster state
docker exec pg-node1 /opt/patroni/bin/patronictl -c /etc/patroni.yml list

# Insert test data on the primary
docker exec pg-node1 psql -U postgres -c \
  "CREATE TABLE failover_test (id serial, ts timestamp default now());
   INSERT INTO failover_test DEFAULT VALUES;"

# Simulate primary failure
docker stop pg-node1

# Wait 30 seconds for failover
sleep 30

# Check cluster state - pg-node2 should be Leader now
docker exec pg-node2 /opt/patroni/bin/patronictl -c /etc/patroni.yml list

# Verify data is intact
docker exec pg-node2 psql -U postgres -c "SELECT * FROM failover_test;"

# Bring the old primary back - it will rejoin as a replica
docker start pg-node1
sleep 15
docker exec pg-node2 /opt/patroni/bin/patronictl -c /etc/patroni.yml list

After bringing pg-node1 back, Patroni uses pg_rewind to resynchronize it with the new primary and demote it to a replica. This is why use_pg_rewind: true is important in the Patroni configuration.

Monitoring Replication

Add these queries to your Grafana dashboards:

-- Replication lag (run on primary)
SELECT
  client_addr,
  state,
  pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag_bytes
FROM pg_stat_replication;

-- Replication slot status
SELECT
  slot_name,
  active,
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes
FROM pg_replication_slots;

Set alerts for:

Replication lag exceeding 10 MB (or whatever threshold matches your RPO)
Replication slots becoming inactive (indicates a standby is down)
pg_stat_replication returning no rows (all standbys disconnected)

Resource Considerations

A minimal two-node Patroni cluster with etcd requires:

Component	Per Node	Total (2 PG + 3 etcd)
RAM	512 MB (etcd) + 1 GB (PostgreSQL minimum)	~4 GB
CPU	1 core (etcd) + 1-2 cores (PostgreSQL)	6-8 cores
Disk	1 GB (etcd) + database size (PostgreSQL)	Varies
Network	Low bandwidth, but low latency matters	1 Gbps sufficient

For homelabs, you can co-locate etcd and PostgreSQL on the same nodes. Three nodes running both etcd and PostgreSQL is the most compact HA setup, though two PostgreSQL nodes with three lightweight etcd instances also works.

When HA Postgres Is Worth It

If you're running a single server, adding replication adds complexity without much benefit -- just run good backups instead. But if you have two or three machines and your services depend heavily on PostgreSQL, the setup described here gives you automatic failover, connection pooling, and the peace of mind that a hardware failure won't take your entire homelab offline. The initial setup takes an afternoon. The alternative is being the human failover mechanism at 3 AM.