K8S and Me

October 1, 2023

I've recently completed my CKA certification after a multi-year journey of learning and experimenting. I started with minikube and went through all the various iterations, up to and including OpenShift (OKD) 4.10. I maintain a cluster in my lab to keep my hand in and extend and expand my knowledge. Frankly, for my small applications, Proxmox's LXC containers are a superior solution, but I need K8S chops in my day job, and it's fun in a perverse sort of way. It's entirely overkill – as is my entire home lab – but you don't need to have the homelab illness I have to operate a kubernetes cluster.

minikube

If you have the ram and the CPU power, minikube is a great intro. It spins up a tiny single-node k8s cluster by default, though you can scale it to multiple nodes if your computer has the gas to pull it off. Great way to get your feet wet, or to do local development of apps and packages. You could technically actually use it for a k8s cluster but it's probably not the best solution for non-development workloads.

'Vanilla K8S'

My next step was a cluster of VMs in Proxmox running Debian and “k8s from scratch”. Just stepping through the manual install process, even with kubeadm, learning about CNIs and config details was very useful and critical. This is a great learning process and will help you understand the details around kublet and etcd that you'll need if you wanna get certified. Pretty easy with a single master, but HA vanilla K8S is harder; you'll need proxy chops to configure your API proxy if you want more than one master. You'll also need a lot of machines, or you'll need to enable workloads on your masters. I found out that the triple etcd database itself put significant load on my VMs, and DR was a PITA. A power failure killed my etcd instances and I eventually deleted that cluster and started over.

Openshift/OKD

Next was Openshift via OKD. OKD's relationship to Openshift is somewhat similar to the relationship between Fedora and RHEL; it's the opensource core that RedHat adds proprietary tools and support to and calls Openshift. Lemme just say Openshift is a HOG. You'll need five machines for HA (vms work, but they have to be pretty robust). You'll need a lot of network setup, including DNS and probably vlans. Openshift offers a lot of canned features that are not in vanilla K8S, from a web console to an operator store, security and routing advancements, etc. There's an Openshift solution similar to minikube called CodeReady Containers that lets you run a minimal OKD 4 single-node cluster on your local box for development and learning. I've not spent much time with it, tbh.

The biggest problem I had with OKD was that every time I had a power outage I had to rebuild the cluster. It wasn't hard because I'd created an ansible playbook that did everthing. I'd kick off “build-cluster” and it would delete the old virtual machines, create new ones. configure them, install OKD, and then I'd restore the backup. But it was a PITA. Also, I wanted my CKA and Openshift was teaching me bad habits (not the least of which was thinking in oc instead of kubectl).

Microk8s (from canonical)

So I went looking for a “light” K8S version. Microk8s looked interesting. I liked the very lightweight database (dqlite) and low idle load I got from deploying it. The built in options are awesome; you can just enable stuff like istio, k8s dashboard, metallb, observability, traefik – all sorts of stuff. Easy-peasy. This was great for several months. Then I had a power outage that crashed the whole system. I restored from node backups and unfortunagely could never quite get the cluster happy again. It had something to do with the k8s backing store (dqlite in this instance) but I was unable to discover a solution, though I found quite a few posts from people who were in the same position; unable to recover their microk8s cluster despite having good recent ( like, six hours old) backups.

K3S

So recently I decided to give k3s a go. I made this decision because several products I'm familiar with run k3s under the hood and they have been pretty reliable, and I like the fact that it can use an external database – in my case mariadb – that I can back up separately and eventually even make into an HA configuration. The documentation isn't as clear and clean as microk8s, and I had a few false starts along the way, but once you sort it out it's pretty straighforward. It'll run in very small VMs (though for my home lab I have a four-node cluster of VMs with 8 VCPU and 16+ GB RAM). I've only been running k3s for a couple of weeks, but so far it's been pretty tight; I had a power drop during a storm and lost two nodes, but they came back online with no trouble. It recovers rapidly from a lost node, it has very simple HA config (though my DB isn't yet HA, I have four API nodes that work just fine).

Ultimately, every step in that path was a learning experience important to my understanding of the overall k8s system. You may or may not want to recapitulate that sequence, depending on how deep you need your k8s knowledge to be, and how deep your existing infrastructure knowledge is to start with. But if you wanna skip to the end, I don't blame you. I'll be adding some articles on specific solutions I've come up with in my lab, but here are a few of the resources I referenced while building out my own k3s cluster:

Backing Up: a Journey

September 13, 2023

from "Steve's Place"

The Problem

We all know we need to back up stuff that's important to us. We've all been told this before, many times. For some classes of information, our providers (google, apple, etc) have decided to help us with built-in cloud storage like iCloud or Google Photos. They aren't bulletproof, but they're useful and you've got images in two places. No longer must you lose your history if you lose your phone.

Even so, I'm not a trusting soul, so I back my images up locally and in the cloud. If there's a meltdown in an Apple or Google datacenter tonight, I'll still be able to get to my images (and the music I own, for example).

There's another class of data: Stuff That's Not Automagically In The Cloud. Like lots of random shit on my laptop; works in progress are often not backed up to the cloud, etc.

I also run a fairly significant test lab here. It's three enterprise-class servers and a handful of cast-off PCs pressed into 'server duty' over the years. The enterprise servers are set up as a virtualization environment (ProxMox).

ProxMox does a bang-up job of backing up its own children – VMs and LXC containers – sufficiently well done that I can restore them to a different machine and it works just fine.

This leaves me with miscellaneous scattered systems to back up and this is where my problems lie. So many applications have “use docker” as their recommended deployment strategy. I'm on board with that to some extent; containers are a big win in some regards. They also remove the need for bare-metal restores of servers and filesystems if you properly isolate the storage.

My Solutions

Infrastructure

I've got a TrueNAS server in my network to manage the storage for backups. It will, incidentally, run some random apps, but it's important to note (ironically) that it doesn't back those apps up usefully, so don't use it for anything that's not a throwaway.

Mobile devices

I use iCloud backup for my phone and my iPad. It's just too convenient. If someone grabs my phone and throws it in the river, I can have a new phone with all my old info in just a couple of hours. Same with my iPad. If you are concerned about data safety, you'll have to do something more manual, like plug your phone or tablet into a computer and manually back up.

Laptop and Desktop

I use Time Machine as my Mac client for backups to the TrueNAS server. It can be fiddly and slow over the wire; if I were to switch to a local SSD it'd be much faster, I suspect. Still, it's robust enough.

Virtualization Lab

Proxmox VE has a robust backup mechanism with scheduling and retention and will happily restore the entire vm or container to any pve node and it works. It's happy to use NFS as transport to the TrueNAS server. It's been fairly 'set-and-forget' for me.

In my k8s cluster I use velero for namespace backups. I haven't done a full disaster recovery scenario, but I have deleted apps and namespaces and successfully restored them to the same cluster. Velero is using Minio on TrueNAS as an S3 object storage back end.

Miscellany

There are a LOT of backup utilities for linux. I've been screwing with several, ranging from hand-scripted rsync jobs to burp to borg to ... you name it. I've finally sorta settled on restic.

restic does incremental snapshots and deduplication so your storage can contain many versions without actually having many copies. This is a win from a storage perspective and a speed perspective. It's great for backing up directories thus far. It uses encryption and has a raft of back-ends it can use (I currently use restic's sftp backend to the TrueNAS server).

Additionally, I like restic because of the low barrier of entry. apt install -y restic and you're ready to back shit up. You can back it up locally or to a remote system with several backends available: local, sftp, S3, Minio, Backblaze B2, and more.

Need a repo? Set up your ssh creds on your backup server for password-less login and go to town.

restic init \
   sftp:user@10.10.10.10:/home/user/restic

It'll ask you the password you wanna use and then for confirmation and you're cooking with microwaves. Don't forget that password. There's no coming back from that.

Then you can set a couple of environment variables:

export RESTIC_REPOSITORY=sftp:user@10.10.10.10:/home/user/restic
export RESTIC_PASSWORD=somesecurepassword
#alternately a file containing the password:
export RESTIC_PASSWORD_FILE=/home/user/.conf/resticp.txt

I have multiple repos, so I created some environment files.

File for repo #1 (.repo1)

export RESTIC_REPOSITORY=sftp:user@10.10.10.10:/home/user/restic
export RESTIC_PASSWORD_FILE=/home/user/.conf/resticp.txt

File for repo #2 (.repo2)

export RESTIC_REPOSITORY=sftp:user@10.10.10.20:/home/user/restic
export RESTIC_PASSWORD_FILE=/home/user/.conf/resticp2.txt

Now I can . .repo2 and then restic snapshots to see what's what:

 ➤ restic snapshots
repository fc56ded3 opened successfully, password is correct
ID        Time                 Host        Tags        Paths
------------------------------------------------------------------------------
4eab9583  2023-09-14 21:52:04  labbox                 /home/user/Documents
------------------------------------------------------------------------------
1 snapshots

Then I can backup a directory with:

 ➤ restic backup ./Desktop
repository fc56ded2 opened successfully, password is correct
no parent snapshot found, will read all files

Files:         284 new,     0 changed,     0 unmodified
Dirs:          140 new,     0 changed,     0 unmodified
Added to the repo: 47.006 MiB

processed 284 files, 63.214 MiB in 0:02
snapshot 5df6aaf1 saved

and check it out with:

➤ restic snapshots
repository fc56ded2 opened successfully, password is correct
ID        Time                 Host        Tags        Paths
------------------------------------------------------------------------------
4eab9583  2023-09-14 21:52:04  labbox                 /home/user/Documents
5df6aaf1  2023-09-14 22:21:24  labbox                 /home/user/Desktop
------------------------------------------------------------------------------
2 snapshots

So say there's a change:

$ echo "This is a new file"> Desktop/newfile.txt

$ restic backup ./Desktop
repository fc56ded2 opened successfully, password is correct
using parent snapshot 5df6aaf1

Files:           1 new,     0 changed,   284 unmodified
Dirs:            0 new,     1 changed,   139 unmodified
Added to the repo: 2.624 KiB

processed 285 files, 63.214 MiB in 0:00
snapshot 6a4f7265 saved

$ restic snapshots
repository fc56ded2 opened successfully, password is correct
ID        Time                 Host        Tags  Paths
------------------------------------------------------------------------
4eab9583  2023-09-14 21:52:04  labbox           /home/stwhite/Documents
5df6aaf1  2023-09-14 22:21:24  labbox           /home/stwhite/Desktop
6a4f7265  2023-09-14 22:28:13  labbox           /home/stwhite/Desktop
------------------------------------------------------------------------
3 snapshots

Now, if we want the unchanged one back (Ignore the fact that we could just delete newfile.txt for the sake of the illustration), we can delete our Desktop folder and restore from snapshot.

 ➤ restic restore 5df6aaf1 -t ~/
repository fc56ded2 opened successfully, password is correct
restoring <Snapshot 5df6aaf1 of [/home/user/Desktop] at 2023-09-14 22:28:13.563663716 -0500 CDT by user@labbox> to /home/user/

Et Voila!

$  ls Desktop/
'Old Firefox Data'   computer.desktop   network.desktop   trash-can.desktop   user-home.desktop

No pesky newfile.txt in this newly restored directory!

But wait, there's more! If you want to see what's changed between two snapshots:

restic diff 5df6aaf1 6a4f7265
repository fc56ded2 opened successfully, password is correct
comparing snapshot 5df6aaf1 to 6a4f7265:

+    /Desktop/newfile.txt

Files:           1 new,     0 removed,     0 changed
Dirs:            0 new,     0 removed
Others:          0 new,     0 removed
Data Blobs:      1 new,     0 removed
Tree Blobs:      2 new,     2 removed
  Added:   2.624 KiB
  Removed: 2.235 KiB

Or you can make sure your repo is consistent and solid:

$ restic check
using temporary cache in /tmp/restic-check-cache-4275756157
repository fc56ded2 opened successfully, password is correct
created new cache in /tmp/restic-check-cache-4275756157
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
[0:00] 100.00%  3 / 3 snapshots
no errors were found

We can maintain the snapshots in the repo with restic, as well:

$ restic forget --keep-last 3 --prune

This will tell it to forget any snapshots but the latest three. You can use more complex qualifiers too, include --keep-daily or --keep-weekly, for instance. ``` Restic has a lot deeper feature set than is covered here, but these few commands will get you on the road to CLI backup easily enough. It's easy to script for cron jobs, and with incremental snapshots and built-in deduplication, you can run it every hour if you want to.

There are ... many backup utilities out there with varying degrees of complexity and flexibility. Some of them (like velero) use restic in the background. I've come to like it quite a bit. But if it's still not yor cup of tea, I still wanna leave you with a bit of advice: Back your shit up. Do it now. Don't wait.

Proxmox and Shared Storage

March 9, 2023

from "Steve's Place"

Proxmox is a virtualization server that I have become quite fond of because LXC containers are first-class citizens. Easy to back up, light on CPU load, and you can change CPU/RAM on the fly. Backups basically snapshot and copy, though you can pause as well. It runs on everything I've put it on so far, and it clusters.

technical debt

“Technical debt” is a phrase we geeks use to describe the situation where we built something and it evolved on the fly and we later learned that we should have done it differently but are in a situation where we can't afford (time, materials, data, whatever) to go back and do it right. So we often end up hacking at stuff and doing it even wronger in order to avoid that heavy lift of going back and doing it right.

What I'm about to describe is 'technical debt'. It's one of those “don't try this at home, kids” things, and I'm describing it for folks who may find themselves in the same exact situation of technical debt.

SPOF

Yes, I realize that this process renders one node of my Proxmox cluster a “SPOF” (single point of failure). And that this compromises any concept of “HA” one might gain from a “cluster”. But this is a home lab and I have a standalone NAS for backups. There's little mission critical going on here other than stuff I can back up and restore. So my risk exposure is just a bit of downtime, no more. Most of what runs on my cluster is dogfooding as I learn Kubernetes and BGP and the like, while I run my Fediverse services (like this blog, my PixelFed, my Mastodon and Pleroma) and other useful stuff like hastebin, flashpaper, and picoshare. If you think you need five nines of uptime, you probably shouldn't be hosting your own stuff in your basement like I am :D

Virtualization

I figured out a long time ago that I could get old enterprise servers for cheap and stick them in a room in my basement (my “datacenter”) and have significant server power for peanuts. The current incarnation of this concept is a pair of HP DL385p boxes with 32 Opteron cores each and 128G RAM each. These boxes cost me roughly $250 each, delivered, and the kill-o-watt attached to their PSUs says they cost me about $6/month. (I have fairly cheap power). I DARE you to find a VPS for $6 a month – or even an order-of-magnitude more – that has 64 cores and 256GB RAM. ( that bit is just there for all the folks that always say “Why not use a VPS?” )

The one drawback of enterprise servers is theyr'e picky. Picky about RAM, picky about all kinds of things, but particularly picky about hard drives. These machines have... eight? 2.5 drive bays per, but if you put anything but HP drives in 'em they complain. What's WORSE is that if you put more than FOUR unrecognized drives in 'em, the chassis fans all kick on “JET ENGINE MODE”. This is unacceptable even in my little lab. As it is, the brief thirty seconds of JEM we get on reboot is enough to wake you up if power bounces in the middle of the night.

Before these boxen I ran VMWare's free edition, but it didn't like the Opteron processors on these machines and I wasn't going to fight with it, so I switched to proxmox and I'm glad. I can see how VMWare's enterprise tools might be valuable to ... an enterprise ... but they're useless for my little lab.

So I put four random 2.5” drives I had laying around into the server and installed Proxmox and created storage willy-nilly. If you're just starting out, be aware that it supports shared storage like Ceph and if you can buy identical drives for three nodes, by all means, close this page immediately and go do that very thing. When I can afford to buy eight 2 or 4TB SATA SSDs, I'll probably back this whole cluster up and rebuild it that way, but until then...

Storage

I have this drive bay I bought years ago that fits in three full-sized CD drive slots and holds five drives vertically. It's been laying around the house, but when I decided I needed more storage in my life I ordered four ( lol ) HGZT 8TB drives and stuck these drive bays in my ex-windows box (A Ryzen R7). I threw the four drives in four of the five bays, hammered it into my old cooler-master case (it's an awesome case but I don't need a windows box anymore), and installed Proxmox and added it to the cluster, then made the four drives into a 24TB RAIDZ pool.

If you're paying attention, you'll probably guess that the storage is only accessible from that proxmox node, not the others. And you'd be right. But I'm just crazy enough to violate RFC1925 and add YET ANOTHER layer of indirection, and also to move the problem to a different part of the network. What did I do? I downloaded TrueNAS Core, of course. I then created a virtual machine and installed TrueNAS on it, and gave it a big chunk of that storage to work with. I THEN created an NFS share on that TrueNAS server.

Storageception

At this point you might want to close your laptop in disgust and walk away as you start to comprehend the technological abomination I'm describing. I wouldn't blame you. I came pretty close myself, while writing this. But we're just to the juicy part, so here we go.

I then went to the Proxmox UI in the Datacenter and added NFS storage to all nodes, pointing to the TrueNAS vm that I had just configured. It happily added the storage and it showed up on all three nodes and et voila! I have RAIDZ shared storage.

To test it, I created a container on one node using the NFS storage. I then right clicked on the container and told it to move it to another node. It promptly did so. MISSION ACCOMPLISHED.

Ultimately, this does not present more risk than simply running a single-node external NAS server, but adds the benefit of using the other six Ryzen cores for compute rather than simply storage. My other NAS is an old 2012 Mac Pro with 12 Xeon 3.46GZ cores and I find myself running services on that to avoid wasting the compute power serving files, so this is kind of a shortcut.

You do have to make sure that your “NAS” vm comes up first, but you can make that happen in Proxmox fairly easily.

Adventures in Access Networking

February 18, 2023

from "Steve's Place"

I like encrypted tunnels, VPNs, whatever you wanna call 'em. I've always been fascinated by encapsulation in a technical sense, despite the wisdom of RFC1925. I've deployed LLTP, PPTP, IPSec, OpenVPN, and others. Recently, I started playing with wireguard. I had a conceptual block early on and it took me a bit to figure it out. Once I did, it was one of those facepalm moments like “Oh, crap, this is STUPID SIMPLE”.

In my testing with iperf3, wireguard is the fastest of the crew to boot. I've got a set of test servers set up in a wireguard mesh for testing. It's actually pretty cool stuff. IPv6 mesh with RFC4193 ULA address space, and it's very fast. Not significantly slower than line rate of un-encrypted V6 packets. Now I read that the Calico CNI supports wireguard as a first class encrypted in-cluster K8S transport. This is awesome! I gotta try it!

Side Quest: k8s

So I've deployed k8s with calico and bgp, then a BIRD bgp daemon on a linux box running an nginx reverse proxy config. This allows nginx to load balance to the pods that are referenced by a “headless service” – one that only identifies pods, it doesn't do network config like building NodePorts/ClusterIP/LoadBalancer. You configure your nginx to proxy to the result of servicename.namespace.svc.cluster.local and tell it to ask the cluster's coredns for the address. This will get it a round-robin result – not ideal for all endpoints, but great for others.

The wireguard CNI would offer a similar solution, I think. I have to try it to be sure, but I think you can just add it to the wireguard mesh and et voila! Encrypted and tunneled! Pretty nifty.

Main Quest: Encrypted Mesh

Then I discovered Tailscale. Tailscale uses wireguard – but so does a lot of other stuff, like Ubiquiti's Teleport. Tailscale is a commercial product with a 'free personal tier' that allows you to have 20 nodes, one subnet router and some limits on functionality. It's STUPID SIMPLE to make it work initially, and there are a few things you can learn in one afternoon to expand its functionality. It's a full-time tunnel, but by default you only talk to other nodes on the mesh via TS; However, one node can publish routes, and you can accept those routes, and any traffic you send to them will be forwarded through the TS mesh. e.g. one node (my TrueNAS node) publishes my internal RFC1918 v4 range and any of my other nodes can access that network by accepting those routes.

A node can also advertise itself as an “exit node”; My other nodes can choose to use it, and that sends all traffic through that node. If it has internet access, that means your internet access goes through that node. And it doesn't matter where you are, as long as you have internet connectivity. Your friend's house, the coffee shop on the corner... as soon as you choose “use exit node” and choose the exit node, your traffic is tunneled through that node, and encrypted. I have two exit nodes configured. My NAS is on the tailscale mesh, as are my laptops and my proxmox servers.

Get back home, or decide you don't need the security of the encrypted tunnel anymore? Just stop using the exit node. You can still ssh to your tailscale nodes from anywhere you have internet connectivity, but there are no 'ports' open to the world.

Side Quest: “Self hosted” alternatives

I like to host my own solutions. I have a Mastodon instance, a Pleroma instance, a WriteFreely instance (this one), a Pixelfed instance, a Friendica instance, etc. I like the control of knowing the network inside and out and having control and knowledge. Thus, Tailscale being a commercial solution that I don't own, I had to look for alternatives. I found three. Project nebula, Headscale, and spinning one's own mesh.

Nebula and Headscale are not bad projects, but Nebula doesn't offer comprehensive IPv6 solutions (last I checked), and Headscale doesn't have an IOS client. They offer some of the features of TailScale, but are not nearly as simple to get going (of course, since you have to build the system yourself). If nebula ever gets full IPv6 functionality in both overlay and underlay, I might switch. Or if Headscale gets an IOS client... :D

Spinning your own mesh isn't hard, but is tedious. Building all the config files and designing the network is a spreadsheet job, and I hate spreadsheet jobs.

fin

Anyway, if you have any use for a VPN in your private life or hobby world, I highly recommend giving Tailscale a spin. No, I'm not getting paid, it's just really cool. If you don't care about IPv6, give nebula and headscale a look, they might be a good solution for you. And if you're really into spreadsheet jobs, build your own mesh!

Adventures in PiHole

December 24, 2022

from "Steve's Place"

You really need more than one DNS server. DNS is such a vital service that when – not if – your local DNS cache dies you'll be dead in the water and puzzled for some time before you FINALLY realize that the cat turned off the Raspberry Pi running PiHole or Blockblock.

The problem, however, is replication. If you only use blocklists, you can just set 'em both up to use the same block list and Bob's your uncle. But if, like me, you use the local dnsmasque features of pihole for “split dns” functionality, you need a bit more.

rqlite

rqlite is a raft-elected distributed database with databases that are readable by sqlite3. So I got the brilliant idea (LOL) of setting up rqlite replication between the piholes. I fought with it for a bit, and eventually got it to replicate, but once replication was working I couldn't get rqlite to notice changes to the DB and replicate 'em, even though I used the —on-disk flag.

So then I considered re-factoring pihole to use the rqlite libraries directly. While researching this idea, I came across gravity-sync.

gravity-sync

gravity-sync is a set of scripts that sync the important databases in the /etc/pihole directory. It's smart, so it only replicates changes, and it seems reliable enough. It has automation, so you can synchronize the databases every fifteen minutes ( which I do ). This means you need to make changes to the primary DNS first or you'll get differential resolution for fifteen minutes. I've sometimes resorted to doing a manual sync after a change, but thus far it's been the only issue I've encountered with gravity-sync.

Conclusion

Documentation on gravity-sync is good, installation is simple; it's my recommendation until such a time as I feel ambitious enough to fork pihole and refactor it to use rqlite or dqlite for databases.

wf.r8z.us

Reader

K8S and Me

minikube

'Vanilla K8S'

Openshift/OKD

Microk8s (from canonical)

K3S

Backing Up: a Journey

The Problem

My Solutions

Infrastructure

Mobile devices

Laptop and Desktop

Virtualization Lab

Miscellany

Proxmox and Shared Storage

technical debt

SPOF

Virtualization

Storage

Storageception

Adventures in Access Networking

Side Quest: k8s

Main Quest: Encrypted Mesh

Side Quest: “Self hosted” alternatives

fin

Adventures in PiHole

rqlite

gravity-sync

Conclusion