Kubernetes simplified backup

Wed, Apr 27, 2022 3-minute read


Backup is important unless you like to spend a lot of time recreating what was lost - if possible.

My kubernetes cluster is mostly stateless, with any state stored outside of the cluster on dedicated storage.

So to facilitate backup of the cluster I initially started doing full machine backups of all nodes, but that seemed silly since I can recreate a node in 5-10 minutes with my PXE setup that boots and installs Rocky Linux plus all the required prequisites to allow the machine to work as a kubernetes node.

When the machine is up and running its just a matter of following my own cook book and then the machine should be part of a cluster.

So “all” I need is a backup of the cluster configuration, which is stored inside the cluster itself inside etcd.

Backup script

I found this page - which I copied most of the code from and that gave most of my control node backup script:

export MAILTO='[email protected]'
export PBS_PASSWORD='cc535e84-b425-4bb4-8575-d8cb886d0e2f'

cd /root

if [  -d "$DIR" ]; then
rm -rf $DIR
mkdir $DIR

cp -rx /k8s/config  $DIR
cp -rx /k8s/dockerconfig  $DIR
cp -rx /k8s/kubeletconfig  $DIR

sudo docker run --rm -v $(pwd)/backup:/backup \
    --network host \
    -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \
    --env ETCDCTL_API=3 \
    k8s.gcr.io/etcd:3.4.3-0 \
    etcdctl --endpoints= \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
    --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
    snapshot save /backup/etcd-snapshot-latest.db

/usr/local/sbin/proxmox-backup-client backup root.pxar:/root/backup --backup-id $BACKUP_ID --repository '[email protected][email protected]:backup'

Explained simply - it creates /root/backup- and copies the configuration files into that directory - then it launches a docker instance that snapshots and saves the etcd database into the same directory.

This allows the proxmox-backup-client to backup the important parts of the node into my backup server, where I can get to is easily in case I lose both of my control plane nodes.

This script runs on my primary control-plane node as a cron job - so I always have a daily backup of the cluster.

Restore script

To restore I would need to re-initialize the machine with my pxe boot and install the configuration/settings as per my setup guide - and then I would have to restore the kubernetes configuration and restore the etcd data inside the cluster.

Something similar to:

export PBS_PASSWORD='cc535e84-b425-4bb4-8575-d8cb886d0e2f'
/usr/local/sbin/proxmox-backup-client restore host/$BACKUP_ID/2022-04-27T15:37:39Z root.pxar $DIR --repository '[email protected][email protected]:backup'

That will give me my backed up data located in /root/backup

Then I can simply reverse my backup from above:

With the knowledge that this is how my mount binds are made so all my docker/kubernetes information is below /k8s

/k8s/docker            /var/lib/docker                            none    nofail,bind     0 0
/k8s/config            /etc/kubernetes                            none    nofail,bind     0 0
/k8s/dockerconfig      /etc/docker                                none    nofail,bind     0 0
/k8s/kubeletconfig     /var/lib/kubelet                           none    nofail,bind     0 0

sudo cp -rx $DIR/config /k8s
sudo cp -rx $DIR/dockerconfig /k8s
sudo cp -rx $DIR/kubeletconfig  /K8s

Then we restore etcd:

sudo mkdir -p /k8s/etcd

sudo mkdir -p /k8s/etcd

sudo docker run --rm \
    -v $(pwd)/backup:/backup \
    -v /k8s/etcd:/var/lib/etcd \
    --env ETCDCTL_API=3 \
    k8s.gcr.io/etcd:3.4.3-0 \
    /bin/sh -c "etcdctl snapshot restore '/backup/etcd-snapshot-latest.db' ; mv /default.etcd/member/ /var/lib/etcd/"

With etcd restored into the correct location in the filesystem I can grab my cluster-init.yaml file from my git repository and run the cluster-init.

sudo kubeadm init --config ~/cluster-init.yml --upload-certs --ignore-preflight-errors=DirAvailable--var-lib-etcd

The --ignore-preflight-errors=DirAvailable--var-lib-etcd argument simply tells kubeadm to not init a new etcd and re-use the existing directory - and not complain about an already existing directory.

If everything works as expected, the cluster should be up and running again and I should be able to re-add nodes etc.

Leave comments below if you want to add something.