Storage Replication with DRBD for High Availability

Storage 2026-02-09 · 6 min read drbd storage replication high-availability clustering

When you have data that can't go down — your database, your file server, your critical VMs — you need that data on more than one machine. DRBD (Distributed Replicated Block Device) solves this at the block level. It mirrors a partition or logical volume from one server to another in real time, essentially creating a network RAID 1 between two nodes.

DRBD operates below the filesystem layer. Your application writes to what looks like a normal block device. DRBD intercepts those writes and replicates them to the peer node over the network. If the primary node fails, the secondary already has an identical copy of the data and can take over immediately.

This isn't exotic enterprise technology. DRBD is open source, included in the Linux kernel since 2.6.33, and runs on commodity hardware. For a homelab aiming at real high availability, it's one of the most reliable approaches available.

When DRBD Makes Sense

DRBD is ideal when you need:

Active-passive failover for VMs, databases, or file services
Synchronous replication where data consistency matters more than performance
Simple two-node HA without the complexity of a distributed storage system like Ceph

DRBD is less suitable for:

Scaling beyond two nodes (DRBD 9 supports more, but the sweet spot is two)
Large-scale distributed storage (use Ceph or GlusterFS instead)
Situations where eventual consistency is acceptable (use rsync or Syncthing)

Prerequisites

You need two Linux servers (physical or virtual) with:

A dedicated network connection between them (ideally a direct link or VLAN, not through your regular LAN traffic)
An unused partition or logical volume on each node of the same size
Matching DRBD versions on both nodes

For this guide, we'll use:

node1: 192.168.10.1 with /dev/sdb1
node2: 192.168.10.2 with /dev/sdb1
A dedicated 10.0.0.0/24 replication network (192.168.10.x in our example)

Installation

On Debian/Ubuntu:

sudo apt update
sudo apt install -y drbd-utils

On Fedora/RHEL:

sudo dnf install -y drbd drbd-utils

Load the kernel module:

sudo modprobe drbd
echo drbd | sudo tee /etc/modules-load.d/drbd.conf

Verify on both nodes:

cat /proc/drbd

Configuring a DRBD Resource

DRBD resources are defined in configuration files under /etc/drbd.d/. Create a resource file on both nodes — the file must be identical on each.

Create /etc/drbd.d/data.res on both nodes:

resource data {
    protocol C;

    net {
        after-sb-0pri discard-zero-changes;
        after-sb-1pri discard-secondary;
        after-sb-2pri disconnect;
    }

    disk {
        resync-rate 100M;
    }

    on node1 {
        device    /dev/drbd0;
        disk      /dev/sdb1;
        address   192.168.10.1:7789;
        meta-disk internal;
    }

    on node2 {
        device    /dev/drbd0;
        disk      /dev/sdb1;
        address   192.168.10.2:7789;
        meta-disk internal;
    }
}

Key settings:

protocol C — Synchronous replication. A write is only confirmed after both nodes have it. This is the safest option and the right choice for most homelab HA setups. Protocol A (asynchronous) and B (semi-synchronous) trade safety for performance.
after-sb-* — Split-brain recovery policies. More on this below.
resync-rate — Limits bandwidth during initial sync or recovery. Set this to match your dedicated link's capacity.

Initializing the Resource

Run these commands on both nodes:

# Create DRBD metadata
sudo drbdadm create-md data

# Bring up the resource
sudo drbdadm up data

At this point, both nodes are connected but neither has valid data. You need to designate one as the initial primary. On node1:

# Force node1 as the initial sync source
sudo drbdadm primary --force data

This triggers a full sync from node1 to node2. Monitor progress:

watch cat /proc/drbd

Or with the newer tool:

sudo drbdadm status data

You'll see something like:

data role:Primary
  disk:UpToDate
  peer role:Secondary
    replication:SyncTarget done:34.5

For a 500 GB disk over a 1 Gbps link, the initial sync takes roughly 70-80 minutes. Over 10 GbE, it's under 10 minutes. Don't interrupt the sync — let it complete before putting the resource into production.

Creating a Filesystem

Once sync completes, create a filesystem on the primary node:

sudo mkfs.ext4 /dev/drbd0

Mount it:

sudo mkdir -p /mnt/data
sudo mount /dev/drbd0 /mnt/data

Write some test data:

echo "Hello from DRBD" | sudo tee /mnt/data/test.txt

You can only mount the filesystem on the primary node. The secondary node has the raw block data but doesn't mount it — that's the active-passive model.

Failover

To switch which node is primary (planned maintenance, for example):

On the current primary (node1):

sudo umount /mnt/data
sudo drbdadm secondary data

On the new primary (node2):

sudo drbdadm primary data
sudo mount /dev/drbd0 /mnt/data
cat /mnt/data/test.txt   # Should show "Hello from DRBD"

The data is there, byte-for-byte identical. This is the core of DRBD-based HA: the secondary always has current data, and promotion is instantaneous because there's no data to copy.

Split-Brain Recovery

Split-brain happens when both nodes think they're primary, usually because the network link between them goes down and an operator (or automated system) promotes the secondary. Now both nodes have divergent data.

The after-sb-* policies in our config handle common scenarios automatically:

after-sb-0pri — Neither node is primary. discard-zero-changes drops the node with no changes.
after-sb-1pri — One node is primary. discard-secondary drops the secondary's changes.
after-sb-2pri — Both nodes are primary. disconnect stops replication so you can fix it manually.

If automatic recovery can't resolve it, you'll need to manually choose which node's data to keep:

# On the node whose data you want to DISCARD:
sudo drbdadm disconnect data
sudo drbdadm secondary data
sudo drbdadm -- --discard-my-data connect data

# On the node whose data you want to KEEP:
sudo drbdadm connect data

The node with discarded data will resync from the survivor.

Dual-Primary Mode

DRBD can run with both nodes as primary simultaneously. This requires a cluster-aware filesystem like GFS2 or OCFS2 that handles concurrent writes with distributed locking. Regular filesystems like ext4 or XFS will corrupt instantly in dual-primary mode.

Enable it in the resource config:

resource data {
    net {
        allow-two-primaries;
    }
    ...
}

Then format with a cluster filesystem:

sudo mkfs.gfs2 -p lock_dlm -t cluster_name:data -j 2 /dev/drbd0

Dual-primary is needed for live migration of VMs between nodes (both need simultaneous access to the VM's disk). If you're using Proxmox or similar with DRBD-backed shared storage, this is the configuration you'll end up with.

Integration with Pacemaker/Corosync

For automated failover, pair DRBD with Pacemaker and Corosync. These clustering tools monitor node health and automatically promote the secondary if the primary fails.

Install the cluster stack:

sudo apt install -y pacemaker corosync

Configure Corosync for your two nodes, then create Pacemaker resources for DRBD:

sudo pcs resource create drbd_data ocf:linbit:drbd \
    drbd_resource=data \
    op monitor interval=15s

sudo pcs resource promotable drbd_data \
    promoted-max=1 promoted-node-max=1 \
    clone-max=2 clone-node-max=1

sudo pcs resource create fs_data ocf:heartbeat:Filesystem \
    device=/dev/drbd0 directory=/mnt/data fstype=ext4

sudo pcs constraint colocation add fs_data with drbd_data-clone INFINITY with-rsc-role=Promoted
sudo pcs constraint order promote drbd_data-clone then start fs_data

Now Pacemaker handles promotion, mounting, and failover automatically. If node1 goes down, Pacemaker promotes node2 and mounts the filesystem within seconds.

Performance Tips

Use a dedicated network for replication. DRBD traffic can saturate a link during initial sync or heavy write workloads. A separate VLAN or direct cable between nodes keeps replication traffic off your main network.

Match your resync rate to your link speed. Don't set resync-rate higher than your network can handle — it won't go faster but can cause congestion.

Use SSDs for the backing device. DRBD adds latency to every write (the network round-trip for synchronous replication). Starting with fast storage minimizes the impact.

Monitor with Prometheus. The DRBD exporter (drbd_exporter) exposes replication state, sync progress, and connection status as Prometheus metrics.

DRBD isn't flashy. It doesn't have a web UI or a marketing page with animations. It's a kernel module that copies blocks between machines, and it's been doing that reliably for over two decades. For a two-node homelab HA setup, it's hard to beat.