Skip to content
Snippets Groups Projects
README.md 7.33 KiB
Newer Older
Shivering-Isles GitOps Infrastructure
===

This repository contains the Kubernetes objects that are synced and managed by [flux](https://fluxcd.io) in order to be deployed as well as the terraform definitions to setup the base infrastructure.
**Note**: *Glue code to make the base infrastructure a usable Kubernetes cluster is still missing.*

Usage
---

In order to use the repository properly, you'll need the `koolbox`-CLI. In order to install the `koolbox`-CLI just run `make cli`.

The next step is to setup the configuration using `make cli-config`. This will open an editor to write the environment file for `koolbox`. Here you should put all the access tokens needed and maybe some additional variables for terraform:

```
HCLOUD_TOKEN=<hetzner token>
CLOUDFLARE_EMAIL=<your cloudflare email>
CLOUDFLARE_API_TOKEN=<your cloudflare API token>
TF_VAR_dns_domain=<DNS base domain you want to use>
TF_VAR_dns_zone_id=<dns zone ID on Cloudflare>
```

Switch into koolbox using the `koolbox` command. (all further commands are ran inside koolbox).

Generate an ssh key for your `koolbox` container using `ssh-keygen -t ed25519` inside the container. Upload the public key (`cat ~/.ssh/ed25519.pub`) to your project on Hetzner.

**Note:** *The deployment will setup all hosts with all SSH keys that are uploaded to the project on Hetzner.*

Deploy the infrastructure using `make deploy`. This will boot up the entire infrastructure on Hetzner Cloud and setup all DNS entires on Cloudflare. Wait for all machines to boot and reboot after cloud-init.

**Note:** *You might have to run `make deploy` twice due to how Hetzner's terraform module works…*

🏗️ Automation work from here still in progress 🏗️

With the infrastructure set up, it's time to deploy Kubernetes. In order to do that, bootstrap Kubernetes using `ssh node01.${TF_VAR_dns_domain} kubeadm init --control-plane-endpoint "api.${TF_VAR_dns_domain}:6443" --upload-certs --pod-network-cidr "192.168.0.0/16"`. Store the controlplane and worker node commands for later. Then also enable the kubelet on the node permanently using `ssh node01.${TF_VAR_dns_domain} systemctl enable kubelet.service`.

Now join the other nodes into the cluster. Using the `ssh nodeXX.${TF_VAR_dns_domain} <kubeadm join command from above>` and enable their `kubelet` on boot using `ssh nodeXX.${TF_VAR_dns_domain} systemctl enable kubelet.service` by replacing `nodeXX` with each further node you deployed and want to add to your cluster.

Next step is to fetch the admin credentials/config using `scp -4 node01.${TF_VAR_dns_domain}:/etc/kubernetes/admin.conf /root/.kube/config`.

Optionally: If you didn't deploy any worker nodes (default), you have to untaint your master nodes to allow workload on master nodes using `kubectl taint nodes --all node-role.kubernetes.io/master-`.

In order to make hetzner-CSI available in your cluster, deploy the secret from koolbox to the cluster using:

```
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: hcloud-csi
  namespace: kube-system
stringData:
  token: $HCLOUD_TOKEN
EOF
```

**Note:** *This secret might just be stored in your gitops secrets, but for the sake of completeness it's mentioned here*

Finally in order to boostrap fluxcd in your cluster. For SI-GitLab this would look like this:

```
export GITLAB_TOKEN=<project access token able to write the API and repository>
flux bootstrap gitlab \
  --hostname=git.shivering-isles.com \
  --ssh-hostname=git.shivering-isles.com:2222 \
  --ssh-key-algorithm ed25519 \
  --owner=<your user / team> \
  --repository=<your repository name> \
  --path=clusters/<your cluster name>
```

🏗️ Automation work until here still in progress 🏗️

Play around with things. Once you done, you can clean up the whole mess using `make destroy`.

**Note:** *Sadly again the Hetzner Cloud terraform module isn't the greatest. Therefore `make destroy` might fail due to the firewall still being in use. Servers still having volumes attached or alike. In this case, please remove the label selectors from the firewall rules and delete the volumes + servers by hand. Run `make destory` once more to make sure everything was cleaned up properly.*

Ideas & ToDo's
---

This toolchain is still under development. Before it will be used in production there are still some things left to do:

- [x] Automate infrastructure deployment
- [x] Provide CLI container that contains all tools.
- [x] Automate overlay network deployment
- [x] Use encrypted overlay network (wireguard)
- [x] Automate cluster monitoring deployment
- [x] Automate ingress-controller deployment
- [x] Automate policy enforcement (kyverno) deployment
- [x] Provide an fully encrypted storage class (rook)
- [ ] Automate ingress-controller default certificate deployment
- [ ] Automate ingress-controller configuration for proxy-protocol
- [ ] Automate hetzner cloud integration deployment ([hetzner-cloud-controller-manager](https://git.shivering-isles.com/github-mirror/hetznercloud/hcloud-cloud-controller-manager))
- [ ] Document usage and thoughts in repository and blog posts
- [x] Automate deployment of Kubernetes
- [ ] Integrate OIDC-based authentication
- [x] Automate flux bootstrap
- [ ] Automate flux OpenPGP bootstrap
- [ ] Enforce SELinux on the deployed machines (Currently conflicts with Rook)
- [ ] Encrypt root filesystems for all nodes
- [ ] Remove default storage class "[hcloud-csi](https://git.shivering-isles.com/github-mirror/hetznercloud/csi-driver)"
- [ ] Integrate [Renovatebot](https://git.shivering-isles.com/shivering-isles/renovate-bot) with this repository to manage updates.
- [ ] Migrate [apps](https://git.shivering-isles.com/shivering-isles/infrastructure/) to gitops and Kubernetes
- [ ] Move to immutable base-system
- [ ] Automate system upgrades using Kubernetes
- [ ] Automate system configuration using Kubernetes
- [ ] Automate Kubernetes upgrades
- [ ] Integrate with [hcloud-dynfw](https://git.shivering-isles.com/sheogorath/hcloud-dynfw)
- [ ] Automate deployment of [cluster autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/hetzner)
Assumption
---

Building smaller, more-minimalistic, plain Kubernetes clusters will be cheaper than OpenShift with OKD and more stable since etcd doesn't have to write a ton of data to disk and there aren't two API server running that take up to 3GB of RAM per master node.

The goal is still to manage everything GitOps style, but more iterative and slowly grinding the way forward before clusters will become productive.

Original assumptions / Lessons Learned
> This repository is focused on a setup based on OpenShift, [OKD](https://okd.io) to be specific. Therefore some installations and settings might be based on the expectation of OKD's default setup instead of going the plain Kubernetes way of inventing everything ourselves.

Sadly this previous assumption didn't hold up. OpenShift on Hetzner Cloud resulted in quite annoying downtimes during upgrades. While the origin of the problem was not fully determined, it was proven that severe spikes in etcd writing fsyncs of up to 600ms did play a major role in it.
To handle things properly, try to get the following tools (all included in `koolbox`):
- flux
- [sops](https://github.com/mozilla/sops/releases/) (for secret handling)
- [helm](https://helm.sh/) (just for sake of completeness and validation)
- [terraform](https://terraform.io/)
- make