Skip to content
Snippets Groups Projects

Shivering-Isles GitOps Infrastructure

This repository contains the Kubernetes objects that are synced and managed by flux in order to be deployed as well as the terraform definitions to setup the base infrastructure.

Note: Glue code to make the base infrastructure a usable Kubernetes cluster is still missing.

Usage

In order to use the repository properly, you'll need the koolbox-CLI. In order to install the koolbox-CLI just run make cli.

The next step is to setup the configuration using make cli-config. This will open an editor to write the environment file for koolbox. Here you should put all the access tokens needed and maybe some additional variables for terraform:

HCLOUD_TOKEN=<hetzner token>
CLOUDFLARE_EMAIL=<your cloudflare email>
CLOUDFLARE_API_TOKEN=<your cloudflare API token>
TF_VAR_dns_domain=<DNS base domain you want to use>
TF_VAR_dns_zone_id=<dns zone ID on Cloudflare>

Switch into koolbox using the koolbox command. (all further commands are ran inside koolbox).

Generate an ssh key for your koolbox container using ssh-keygen -t ed25519 inside the container. Upload the public key (cat ~/.ssh/ed25519.pub) to your project on Hetzner.

Note: The deployment will setup all hosts with all SSH keys that are uploaded to the project on Hetzner.

Deploy the infrastructure using make deploy. This will boot up the entire infrastructure on Hetzner Cloud and setup all DNS entires on Cloudflare. Wait for all machines to boot and reboot after cloud-init.

Note: You might have to run make deploy twice due to how Hetzner's terraform module works…

🏗️ Automation work from here still in progress 🏗️

With the infrastructure set up, it's time to deploy Kubernetes. In order to do that, bootstrap Kubernetes using ssh node01.${TF_VAR_dns_domain} kubeadm init --control-plane-endpoint "api.${TF_VAR_dns_domain}:6443" --upload-certs --pod-network-cidr "192.168.0.0/16". Store the controlplane and worker node commands for later. Then also enable the kubelet on the node permanently using ssh node01.${TF_VAR_dns_domain} systemctl enable kubelet.service.

Now join the other nodes into the cluster. Using the ssh nodeXX.${TF_VAR_dns_domain} <kubeadm join command from above> and enable their kubelet on boot using ssh nodeXX.${TF_VAR_dns_domain} systemctl enable kubelet.service by replacing nodeXX with each further node you deployed and want to add to your cluster.

Next step is to fetch the admin credentials/config using scp -4 node01.${TF_VAR_dns_domain}:/etc/kubernetes/admin.conf /root/.kube/config.

Optionally: If you didn't deploy any worker nodes (default), you have to untaint your master nodes to allow workload on master nodes using kubectl taint nodes --all node-role.kubernetes.io/master-.

In order to make hetzner-CSI available in your cluster, deploy the secret from koolbox to the cluster using:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: hcloud-csi
  namespace: kube-system
stringData:
  token: $HCLOUD_TOKEN
EOF

Note: This secret might just be stored in your gitops secrets, but for the sake of completeness it's mentioned here

Finally in order to boostrap fluxcd in your cluster. For SI-GitLab this would look like this:

export GITLAB_TOKEN=<project access token able to write the API and repository>
flux bootstrap gitlab \
  --hostname=git.shivering-isles.com \
  --ssh-hostname=git.shivering-isles.com:2222 \
  --ssh-key-algorithm ed25519 \
  --owner=<your user / team> \
  --repository=<your repository name> \
  --path=clusters/<your cluster name>

🏗️ Automation work until here still in progress 🏗️

Play around with things. Once you done, you can clean up the whole mess using make destroy.

Note: Sadly again the Hetzner Cloud terraform module isn't the greatest. Therefore make destroy might fail due to the firewall still being in use. Servers still having volumes attached or alike. In this case, please remove the label selectors from the firewall rules and delete the volumes + servers by hand. Run make destory once more to make sure everything was cleaned up properly.

Ideas & ToDo's

This toolchain is still under development. Before it will be used in production there are still some things left to do:

  • Automate infrastructure deployment
  • Provide CLI container that contains all tools.
  • Automate overlay network deployment
  • Use encrypted overlay network (wireguard)
  • Automate cluster monitoring deployment
  • Automate ingress-controller deployment
  • Automate policy enforcement (kyverno) deployment
  • Provide an fully encrypted storage class (rook)
  • Automate ingress-controller default certificate deployment
  • Automate ingress-controller configuration for proxy-protocol
  • Automate hetzner cloud integration deployment (hetzner-cloud-controller-manager)
  • Document usage and thoughts in repository and blog posts
  • Automate deployment of Kubernetes
  • Automate flux bootstrap
  • Automate flux OpenPGP bootstrap
  • Enforce SELinux on the deployed machines (Currently conflicts with Rook)
  • Encrypt root filesystems for all nodes
  • Remove default storage class "hcloud-csi"
  • Integrate Renovatebot with this repository to manage updates.
  • Migrate apps to gitops and Kubernetes
  • Move to immutable base-system
  • Automate system upgrades using Kubernetes
  • Automate system configuration using Kubernetes
  • Integrate with hcloud-dynfw
  • Automate deployment of cluster autoscaler

Assumption

Building smaller, more-minimalistic, plain Kubernetes clusters will be cheaper than OpenShift with OKD and more stable since etcd doesn't have to write a ton of data to disk and there aren't two API server running that take up to 3GB of RAM per master node.

The goal is still to manage everything GitOps style, but more iterative and slowly grinding the way forward before clusters will become productive.

Original assumptions / Lessons Learned

This repository is focused on a setup based on OpenShift, OKD to be specific. Therefore some installations and settings might be based on the expectation of OKD's default setup instead of going the plain Kubernetes way of inventing everything ourselves.

Sadly this previous assumption didn't hold up. OpenShift on Hetzner Cloud resulted in quite annoying downtimes during upgrades. While the origin of the problem was not fully determined, it was proven that severe spikes in etcd writing fsyncs of up to 600ms did play a major role in it.

Tools

To handle things properly, try to get the following tools (all included in koolbox):

  • kubectl
  • flux
  • sops (for secret handling)
  • helm (just for sake of completeness and validation)
  • terraform
  • make
  • git