Normally, you have to announce you’re doing a thing before you can post mortem it when it fails. I didn’t announce this to anyone but myself, but I feel it’s good practice to analyze the facts and prepare for moving forward.
I don’t normally find myself in a situation where a plan does not come together, but here we are. I will have to take a step back and try this again with a different solution.
The Idea
I recently put together a plan to replace my aging HP ProLiant 1U server with something more cost effective and hopefully a little more resistant to failure. I took a good look at my environment and the majority of the things in production were Docker containers. I also have things spread across multiple Docker VMs so that if one were to fail, I could recover fairly quick. That also means I’m picking and choosing which Docker service goes where any time I want to stand up a new service.
The Docker VMs are broken into four environments.
- Core services: reverse HTTP proxy, home lab utilities, VPN ingress, Nova Trade Wars Admin Panel, etc.
- API services: I use Appwrite self-hosted and it is the core of all of my web projects. Therefore, it has its own dedicated VM.
- Projects: Anything else that isn’t a core functionality.
- Testing: Anything that is for testing purposes and is not ready for a production environment and/or may not ever see production.
This works for the most part, but it is not resilient to any type of failure nor does it scale. If I want to beef up a project’s infrastructure, I simply can’t.
The Solution?
I acquired 3 HP EliteDesk i5 mini PCs and 1 HP ProDesk i5 mini PC. The design goal was simple: since the majority of my production services were run in Docker, I could, in theory, migrate everything to Docker Swarm and let the swarm figure out placement and availability. This solves a lot of the identified issues.
The ProDesk is the beefier of the 4 as it supports up to 32GB of RAM and has a SATA drive as well as an NVMe slot. It would be the best candidate to run Proxmox and a Docker Swarm manager node as well as other VMs to support apps that aren’t meant to work with Docker. The three EliteDesk PCs support up to 16GB of RAM, but only a single SATA drive. Not too much of a problem as they shouldn’t be storing much on their own system anyway with how Swarm should work.
Some Swarm vs vanilla Docker things to note:
- Swarm works with ingress and network overlays. The manager node will be the ingress and will route traffic to the correct worker node.
- Docker compose files are not 100% compatible with Docker Swarm
- Docker containers are part of services and services are assigned in the swarm. Because of this, things like bind mounts need to work correctly on any node in the swarm.
- If you plan on having bind mounts, some type of shared mount is best or a S3/S3 compatible service. I chose NFS for the sake of simplicity.
Small Test
I stood up a very small test on the ProDesk. I created two identical 2GB of RAM Ubuntu Server 22.04 servers with Docker installed. One became the manager and one the worker. I deployed a test Nginx service and everything worked as intended. I then tried deploying the Nova Trade Wars website as well with two replicas to test that functionality. That worked as expected, but I honestly didn’t know which node was replying as I haven’t built that into the website code.
Reverse Proxy
Up until this point, I’ve used Nginx Proxy Manager to take care of all my reverse proxy manager needs. It’s worked extremely well and has been great for what I needed. I think it would have worked for this application as well, but I wanted to make the transition to Traefik so that I didn’t necessarily need to manage both the deploy of the service and the setup of the reverse proxy settings. This can be accomplished inside the Docker Swarm config.
This is where things started to get weird.
That’s Not Right
After having successfully deployed the Nova Trade Wars website and getting Traefik to be the http reverse proxy, things started to be on the up and up, or at least I thought.
Enter: the WordPress Blog. Not this WP blog, another for a small project I support. It was an easy kill, or that was my logic. I set up the NFS share at this point and had rolled out Swarm to all of the mini PCs. Every node could access persistent data without problem. The moment I stood up the WordPress site was the moment I realized something had gone terribly wrong. At first I thought it was a node being slow, so I scaled up to three worker nodes. Nothing changed. The site timed out in the Traefik router and when I exposed the port to my LAN, it didn’t get any better. Other services could access the share just fine and speed was great.
Ok. Not the end of the world, let’s just use my Portainer UI and drop into a shell to inspect some things.
Well, that’s not good. Also, the clock is ticking. I need critical services back up and running sooner than later. WordPress runs fine as long as it isn’t trying to interact with the NFS through a bind mount.
After spending countless hours attempting to troubleshoot WordPress, I gave up and moved on to Appwrite in which I could have definitely accomplished, but at some point you have to look a broken and messy project in the eyes and admit that this isn’t going to work. At least not in the way I envisioned it working. After bringing core services down for entirely too long, I decided to implement roll-back procedures.
Lessons Learned
I spent a lot of time working on figuring out which scalable infrastructure was right for me. I really thought Docker Swarm was going to be it…and it still may be it. However, deployments like these can’t stretch on forever. I need my key services running to do my normal routines.
Some of the things I learned from this:
- I was not prepared enough for rolling everything out. I needed more understanding of the features and nuances of Swarm.
- I was not prepared for the amount of downtime I incurred. I did not budget for large obstacles.
- I did have a good roll back plan in place and was able to restore my original setup in about 20 minutes.
Moving Forward
I will take some more time to research K8s, K3s, and Swarm. I’ll see what fits my purpose and what is most likely to be the best option for my environment.
This fight isn’t over yet, but I am taking a break until I have more research and practice in setting up these services.