High Availability with Patroni & etcd
Complete guide for setting up a PostgreSQL high-availability cluster using Patroni for automatic failover and etcd for distributed configuration storage.
Overview
This guide covers:
- Patroni: PostgreSQL cluster management tool with automatic failover
- etcd: Distributed key-value store for cluster state
- HAProxy: Load balancing across PostgreSQL nodes
- Watchdog: Prevention of split-brain scenarios
Architecture
┌─────────────┐
│ HAProxy │
│ 192.168.1.50│
│ Port: 5000 │
└──────┬──────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│Patroni │◄──────►│Patroni │◄─────►│Patroni │
│PostgreSQL│ │PostgreSQL│ │PostgreSQL│
│ etcd │ │ etcd │ │ etcd │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└──────────────────┴──────────────────┘
etcd consensus cluster
Prerequisites
All three nodes should have:
- Ubuntu/Debian-based Linux
- Network connectivity between nodes
- Root or sudo access
- Same PostgreSQL version
Part 1: Install etcd on All Nodes
etcd provides the distributed consensus needed for cluster coordination.
Install etcd
On all three nodes:
# Download etcd
ETCD_VER=v3.5.9
GITHUB_URL=https://github.com/etcd-io/etcd/releases/download
DOWNLOAD_URL=${GITHUB_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz
# Download and extract
curl -L ${DOWNLOAD_URL} -o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
tar xzvf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz -C /tmp
# Install binaries
sudo cp /tmp/etcd-${ETCD_VER}-linux-amd64/etcd* /usr/local/bin/
sudo chmod +x /usr/local/bin/etcd*
# Verify installation
etcd --version
Configure etcd
Create etcd configuration directory:
sudo mkdir -p /etc/etcd
sudo mkdir -p /var/lib/etcd
Create etcd systemd service:
sudo nano /etc/systemd/system/etcd.service
On Node 1 (192.168.1.51):
[Unit]
Description=etcd key-value store
Documentation=https://github.com/etcd-io/etcd
After=network.target
[Service]
Type=notify
User=etcd
Environment=ETCD_DATA_DIR=/var/lib/etcd
Environment=ETCD_NAME=node1
Environment=ETCD_INITIAL_ADVERTISE_PEER_URLS=http://192.168.1.51:2380
Environment=ETCD_LISTEN_PEER_URLS=http://192.168.1.51:2380
Environment=ETCD_LISTEN_CLIENT_URLS=http://192.168.1.51:2379,http://127.0.0.1:2379
Environment=ETCD_ADVERTISE_CLIENT_URLS=http://192.168.1.51:2379
Environment=ETCD_INITIAL_CLUSTER_TOKEN=patroni-cluster
Environment=ETCD_INITIAL_CLUSTER=node1=http://192.168.1.51:2380,node2=http://192.168.1.52:2380,node3=http://192.168.1.53:2380
Environment=ETCD_INITIAL_CLUSTER_STATE=new
ExecStart=/usr/local/bin/etcd
Restart=always
RestartSec=10s
LimitNOFILE=40000
[Install]
WantedBy=multi-user.target
On Node 2 (192.168.1.52):
[Unit]
Description=etcd key-value store
Documentation=https://github.com/etcd-io/etcd
After=network.target
[Service]
Type=notify
User=etcd
Environment=ETCD_DATA_DIR=/var/lib/etcd
Environment=ETCD_NAME=node2
Environment=ETCD_INITIAL_ADVERTISE_PEER_URLS=http://192.168.1.52:2380
Environment=ETCD_LISTEN_PEER_URLS=http://192.168.1.52:2380
Environment=ETCD_LISTEN_CLIENT_URLS=http://192.168.1.52:2379,http://127.0.0.1:2379
Environment=ETCD_ADVERTISE_CLIENT_URLS=http://192.168.1.52:2379
Environment=ETCD_INITIAL_CLUSTER_TOKEN=patroni-cluster
Environment=ETCD_INITIAL_CLUSTER=node1=http://192.168.1.51:2380,node2=http://192.168.1.52:2380,node3=http://192.168.1.53:2380
Environment=ETCD_INITIAL_CLUSTER_STATE=new
ExecStart=/usr/local/bin/etcd
Restart=always
RestartSec=10s
LimitNOFILE=40000
[Install]
WantedBy=multi-user.target
On Node 3 (192.168.1.53):
[Unit]
Description=etcd key-value store
Documentation=https://github.com/etcd-io/etcd
After=network.target
[Service]
Type=notify
User=etcd
Environment=ETCD_DATA_DIR=/var/lib/etcd
Environment=ETCD_NAME=node3
Environment=ETCD_INITIAL_ADVERTISE_PEER_URLS=http://192.168.1.53:2380
Environment=ETCD_LISTEN_PEER_URLS=http://192.168.1.53:2380
Environment=ETCD_LISTEN_CLIENT_URLS=http://192.168.1.53:2379,http://127.0.0.1:2379
Environment=ETCD_ADVERTISE_CLIENT_URLS=http://192.168.1.53:2379
Environment=ETCD_INITIAL_CLUSTER_TOKEN=patroni-cluster
Environment=ETCD_INITIAL_CLUSTER=node1=http://192.168.1.51:2380,node2=http://192.168.1.52:2380,node3=http://192.168.1.53:2380
Environment=ETCD_INITIAL_CLUSTER_STATE=new
ExecStart=/usr/local/bin/etcd
Restart=always
RestartSec=10s
LimitNOFILE=40000
[Install]
WantedBy=multi-user.target
Create etcd User
On all nodes:
sudo useradd -r -s /bin/false etcd
sudo chown -R etcd:etcd /var/lib/etcd
Start etcd
On all nodes:
sudo systemctl daemon-reload
sudo systemctl enable etcd
sudo systemctl start etcd
sudo systemctl status etcd
Verify etcd Cluster
On any node:
etcdctl member list
etcdctl endpoint health
Expected output:
127.0.0.1:2379 is healthy: successfully committed proposal: took = 1.234ms
192.168.1.51:2379 is healthy: successfully committed proposal: took = 1.567ms
192.168.1.52:2379 is healthy: successfully committed proposal: took = 1.890ms
Part 2: Install PostgreSQL on All Nodes
Install PostgreSQL on all three nodes:
# Add PostgreSQL repository
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
# Import repository key
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
# Update and install
sudo apt-get update
sudo apt-get install -y postgresql-15 postgresql-contrib-15
# Stop PostgreSQL (Patroni will manage it)
sudo systemctl stop postgresql
sudo systemctl disable postgresql
Part 3: Install Patroni on All Nodes
Patroni manages the PostgreSQL cluster and handles automatic failover.
Install Dependencies
On all nodes:
sudo apt-get install -y python3-pip python3-dev libpq-dev
sudo pip3 install patroni[etcd] psycopg2-binary
Configure Patroni
Create Patroni configuration directory:
sudo mkdir -p /etc/patroni
On Node 1 (192.168.1.51):
sudo nano /etc/patroni/patroni.yml
scope: postgres-cluster
namespace: /db/
name: node1
restapi:
listen: 192.168.1.51:8008
connect_address: 192.168.1.51:8008
etcd:
hosts: 192.168.1.51:2379,192.168.1.52:2379,192.168.1.53:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
parameters:
wal_level: replica
hot_standby: "on"
max_connections: 100
max_wal_senders: 10
wal_keep_size: 256MB
max_prepared_transactions: 0
max_locks_per_transaction: 64
track_commit_timestamp: "off"
initdb:
- encoding: UTF8
- data-checksums
pg_hba:
- host replication replicator 127.0.0.1/32 scram-sha-256
- host replication replicator 192.168.1.0/24 scram-sha-256
- host all all 0.0.0.0/0 scram-sha-256
users:
admin:
password: admin_password
options:
- createrole
- createdb
replicator:
password: replicator_password
options:
- replication
postgresql:
listen: 192.168.1.51:5432
connect_address: 192.168.1.51:5432
data_dir: /var/lib/postgresql/15/main
bin_dir: /usr/lib/postgresql/15/bin
pgpass: /tmp/pgpass
authentication:
replication:
username: replicator
password: replicator_password
superuser:
username: postgres
password: postgres_password
parameters:
unix_socket_directories: '/var/run/postgresql'
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
On Node 2 (192.168.1.52):
Create /etc/patroni/patroni.yml with the same content, but change:
name: node2restapi.listen: 192.168.1.52:8008restapi.connect_address: 192.168.1.52:8008postgresql.listen: 192.168.1.52:5432postgresql.connect_address: 192.168.1.52:5432
On Node 3 (192.168.1.53):
Create /etc/patroni/patroni.yml with the same content, but change:
name: node3restapi.listen: 192.168.1.53:8008restapi.connect_address: 192.168.1.53:8008postgresql.listen: 192.168.1.53:5432postgresql.connect_address: 192.168.1.53:5432
Make sure to change the passwords in production! The passwords shown here are examples only.
Create Patroni Systemd Service
On all nodes:
sudo nano /etc/systemd/system/patroni.service
[Unit]
Description=Patroni PostgreSQL High-Availability Manager
After=syslog.target network.target etcd.service
[Service]
Type=simple
User=postgres
Group=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
ExecReload=/bin/kill -s HUP $MAINPID
KillMode=process
TimeoutSec=30
Restart=always
[Install]
WantedBy=multi-user.target
Set Permissions
On all nodes:
sudo chown postgres:postgres /etc/patroni/patroni.yml
sudo chmod 600 /etc/patroni/patroni.yml
Start Patroni
Important: Start Patroni one node at a time, waiting for each to initialize:
# On Node 1 first
sudo systemctl daemon-reload
sudo systemctl enable patroni
sudo systemctl start patroni
sudo systemctl status patroni
# Wait 30 seconds, then on Node 2
sudo systemctl daemon-reload
sudo systemctl enable patroni
sudo systemctl start patroni
# Wait 30 seconds, then on Node 3
sudo systemctl daemon-reload
sudo systemctl enable patroni
sudo systemctl start patroni
Verify Cluster Status
On any node:
patronictl -c /etc/patroni/patroni.yml list
Expected output:
+ Cluster: postgres-cluster ----+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------+---------------+-------+---------+----+-----------+
| node1 | 192.168.1.51 | Leader| running | 1 | |
| node2 | 192.168.1.52 | Replica| running | 1 | 0 |
| node3 | 192.168.1.53 | Replica| running | 1 | 0 |
+--------+---------------+-------+---------+----+-----------+
Part 4: HAProxy Setup
HAProxy distributes client connections across the PostgreSQL cluster.
Install HAProxy
On a separate server (or on one of the nodes):
sudo apt-get install -y haproxy
Configure HAProxy
sudo nano /etc/haproxy/haproxy.cfg
global
maxconn 100
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin
stats timeout 30s
user haproxy
group haproxy
daemon
defaults
log global
mode tcp
retries 2
timeout client 30m
timeout connect 4s
timeout server 30m
timeout check 5s
listen stats
mode http
bind *:7000
stats enable
stats uri /
listen postgres_write
bind *:5000
option httpchk GET /
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server node1 192.168.1.51:5432 maxconn 100 check port 8008
server node2 192.168.1.52:5432 maxconn 100 check port 8008
server node3 192.168.1.53:5432 maxconn 100 check port 8008
listen postgres_read
bind *:5001
option httpchk GET /replica
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server node1 192.168.1.51:5432 maxconn 100 check port 8008
server node2 192.168.1.52:5432 maxconn 100 check port 8008
server node3 192.168.1.53:5432 maxconn 100 check port 8008
Start HAProxy
sudo systemctl enable haproxy
sudo systemctl restart haproxy
sudo systemctl status haproxy
Test HAProxy
View stats dashboard:
http://192.168.1.50:7000
Connect to primary (write):
psql -h 192.168.1.50 -p 5000 -U admin postgres
Connect to replicas (read):
psql -h 192.168.1.50 -p 5001 -U admin postgres
Part 5: Testing Failover
Manual Failover
Promote a replica to leader:
# On any node
patronictl -c /etc/patroni/patroni.yml switchover
You'll be prompted to confirm the switchover.
Simulate Node Failure
Stop Patroni on the current leader:
# On the leader node
sudo systemctl stop patroni
Patroni will automatically promote a replica within 30 seconds (TTL setting).
Check cluster status:
patronictl -c /etc/patroni/patroni.yml list
The former replica should now be the leader.
Restore Failed Node
Start Patroni again:
sudo systemctl start patroni
The node will automatically rejoin as a replica.
Part 6: Monitoring
Check Cluster Status
# List cluster members
patronictl -c /etc/patroni/patroni.yml list
# Show detailed history
patronictl -c /etc/patroni/patroni.yml history
# Check configuration
patronictl -c /etc/patroni/patroni.yml show-config
Check etcd Cluster
# Member list
etcdctl member list
# Cluster health
etcdctl endpoint health --cluster
# Get cluster configuration
etcdctl get --prefix /db/
Monitor HAProxy
View HAProxy stats:
http://192.168.1.50:7000
Check HAProxy logs:
sudo tail -f /var/log/haproxy.log
PostgreSQL Monitoring
On any node:
# Check replication status
psql -U postgres -c "SELECT * FROM pg_stat_replication;"
# Check if node is in recovery
psql -U postgres -c "SELECT pg_is_in_recovery();"
# Check replication lag
psql -U postgres -c "SELECT client_addr, state, sync_state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;"
Automation with autobase
For automated deployment of PostgreSQL HA clusters, consider using the autobase project by Vitaly Kukharik.
About autobase
Repository: https://github.com/vitabaks/autobase
autobase is an Ansible-based automation tool for deploying PostgreSQL high-availability clusters with:
- Patroni for automatic failover
- etcd/Consul/ZooKeeper for DCS
- HAProxy or pgBouncer for connection pooling
- Optional monitoring with Prometheus/Grafana
Features
- One-command cluster deployment
- Support for cloud providers (AWS, GCP, Azure, DigitalOcean)
- Automatic SSL certificate generation
- Backup configuration (pgBackRest, WAL-G)
- Monitoring stack setup
- Rolling updates
Quick Start with autobase
# Clone repository
git clone https://github.com/vitabaks/autobase.git
cd autobase
# Install requirements
pip3 install -r requirements.txt
# Edit inventory
nano inventory
# Configure variables
nano vars/main.yml
# Deploy cluster
ansible-playbook deploy_pgcluster.yml
Troubleshooting
Patroni Won't Start
Check logs:
sudo journalctl -u patroni -f
Common issues:
- PostgreSQL data directory already exists (remove it for fresh install)
- Wrong permissions on
/etc/patroni/patroni.yml - etcd not accessible
- Port conflicts (5432, 8008)
etcd Connection Failed
Check etcd is running:
sudo systemctl status etcd
etcdctl endpoint health
Test connectivity:
curl http://192.168.1.51:2379/health
Split-Brain Scenario
If cluster shows multiple leaders:
- Stop Patroni on all nodes:
sudo systemctl stop patroni
- Clear etcd cluster data:
etcdctl del --prefix /db/
- Start Patroni one node at a time:
# Node 1
sudo systemctl start patroni
# Wait 30s
# Node 2
sudo systemctl start patroni
# Wait 30s
# Node 3
sudo systemctl start patroni
HAProxy Shows All Nodes Down
Check Patroni REST API is accessible:
curl http://192.168.1.51:8008/
Should return HTTP 200 for leader, 503 for replica.
Verify HAProxy can reach nodes:
sudo tcpdump -i any port 8008
Best Practices
- Use Odd Number of Nodes: 3 or 5 nodes for quorum
- Monitor etcd Health: Critical for cluster consensus
- Set Appropriate TTL: Balance between fast failover and stability
- Use Synchronous Replication: For zero data loss
- Regular Backup Testing: Patroni is not a backup solution
- Monitor Replication Lag: Alert on excessive lag
- Document Failover Procedures: Train your team
- Test Failover Regularly: In development/staging
- Use Watchdog: Prevent split-brain scenarios (advanced)
- Secure Communications: Use SSL/TLS in production
Security Enhancements
Enable SSL for PostgreSQL
Generate certificates:
# On all nodes
sudo -u postgres openssl req -new -x509 -days 365 -nodes \
-text -out /var/lib/postgresql/15/main/server.crt \
-keyout /var/lib/postgresql/15/main/server.key \
-subj "/CN=postgres"
sudo chmod 600 /var/lib/postgresql/15/main/server.key
Update patroni.yml:
postgresql:
parameters:
ssl: on
ssl_cert_file: '/var/lib/postgresql/15/main/server.crt'
ssl_key_file: '/var/lib/postgresql/15/main/server.key'
Secure etcd with TLS
Generate certificates for etcd:
# Create CA and certificates
# See etcd security documentation
Update etcd service to use TLS.
Restrict Network Access
Use firewall rules:
# Allow only cluster nodes
sudo ufw allow from 192.168.1.0/24 to any port 2379 # etcd client
sudo ufw allow from 192.168.1.0/24 to any port 2380 # etcd peer
sudo ufw allow from 192.168.1.0/24 to any port 5432 # PostgreSQL
sudo ufw allow from 192.168.1.0/24 to any port 8008 # Patroni REST API