Kubernetes simplified backup
Introduction
Backup is important unless you like to spend a lot of time recreating what was lost - if possible.
My kubernetes cluster is mostly stateless, with any state stored outside of the cluster on dedicated storage.
So to facilitate backup of the cluster I initially started doing full machine backups of all nodes, but that seemed silly since I can recreate a node in 5-10 minutes with my PXE setup that boots and installs Rocky Linux plus all the required prequisites to allow the machine to work as a kubernetes node.
When the machine is up and running its just a matter of following my own cook book and then the machine should be part of a cluster.
So “all” I need is a backup of the cluster configuration, which is stored inside the cluster itself inside etcd.
Backup script
I found this page - which I copied most of the code from and that gave most of my control node backup script:
#!/bin/sh
export MAILTO='[email protected]'
export PBS_PASSWORD='cc535e84-b425-4bb4-8575-d8cb886d0e2f'
DIR='/root/backup'
cd /root
if [ -d "$DIR" ]; then
rm -rf $DIR
fi
mkdir $DIR
cp -rx /k8s/config $DIR
cp -rx /k8s/dockerconfig $DIR
cp -rx /k8s/kubeletconfig $DIR
sudo docker run --rm -v $(pwd)/backup:/backup \
--network host \
-v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \
--env ETCDCTL_API=3 \
k8s.gcr.io/etcd:3.4.3-0 \
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
snapshot save /backup/etcd-snapshot-latest.db
BACKUP_ID=`hostname`
/usr/local/sbin/proxmox-backup-client backup root.pxar:/root/backup --backup-id $BACKUP_ID --repository '[email protected][email protected]:backup'
Explained simply - it creates /root/backup
- and copies the configuration files into that directory - then it launches a docker instance that snapshots and saves the etcd database into the same directory.
This allows the proxmox-backup-client to backup the important parts of the node into my backup server, where I can get to is easily in case I lose both of my control plane nodes.
This script runs on my primary control-plane node as a cron job - so I always have a daily backup of the cluster.
Restore script
To restore I would need to re-initialize the machine with my pxe boot and install the configuration/settings as per my setup guide - and then I would have to restore the kubernetes configuration and restore the etcd data inside the cluster.
Something similar to:
#!/bin/sh
DIR='/root/backup'
export PBS_PASSWORD='cc535e84-b425-4bb4-8575-d8cb886d0e2f'
BACKUP_ID=`hostname`
/usr/local/sbin/proxmox-backup-client restore host/$BACKUP_ID/2022-04-27T15:37:39Z root.pxar $DIR --repository '[email protected][email protected]:backup'
That will give me my backed up data located in /root/backup
Then I can simply reverse my backup from above:
With the knowledge that this is how my mount binds are made so all my docker/kubernetes information is below /k8s
/k8s/docker /var/lib/docker none nofail,bind 0 0
/k8s/config /etc/kubernetes none nofail,bind 0 0
/k8s/dockerconfig /etc/docker none nofail,bind 0 0
/k8s/kubeletconfig /var/lib/kubelet none nofail,bind 0 0
sudo cp -rx $DIR/config /k8s
sudo cp -rx $DIR/dockerconfig /k8s
sudo cp -rx $DIR/kubeletconfig /K8s
Then we restore etcd:
sudo mkdir -p /k8s/etcd
sudo mkdir -p /k8s/etcd
sudo docker run --rm \
-v $(pwd)/backup:/backup \
-v /k8s/etcd:/var/lib/etcd \
--env ETCDCTL_API=3 \
k8s.gcr.io/etcd:3.4.3-0 \
/bin/sh -c "etcdctl snapshot restore '/backup/etcd-snapshot-latest.db' ; mv /default.etcd/member/ /var/lib/etcd/"
With etcd restored into the correct location in the filesystem I can grab my cluster-init.yaml file from my git repository and run the cluster-init.
sudo kubeadm init --config ~/cluster-init.yml --upload-certs --ignore-preflight-errors=DirAvailable--var-lib-etcd
The --ignore-preflight-errors=DirAvailable--var-lib-etcd
argument simply tells kubeadm to not init a new etcd and re-use the existing directory - and not complain about an already existing directory.
If everything works as expected, the cluster should be up and running again and I should be able to re-add nodes etc.
Leave comments below if you want to add something.