This semester has been such a shitshow that I haven't had any time working on the site, much less writing posts. When I wasn't studying for all the lab classes and failed subjects I was busy finishing my project at the company I worked for so I could finally quit and focus on passing uni tests and exams.

Looking at the git log there was a fair number of commits. But they were almost exclusively backend-related or quick fixes meant to prevent the infrastructure from falling apart. This week I've fixed two annoying issues that made the cluster periodically (as of late, every day) shit itself. I use LXC for containerization and I've noticed that from time to time some containers tend to lose their IP address assigned by the DHCP server. It turned out that LXC macvlan and IPv6 do not work well together, the router assigned a v4 and v6 IP address to every container, but the v4 would eventually disappear. The solution was to disable the IPv6 DHCP server on the router - it's not like I needed that in the first place, as the whole internal network is IPv4-only.

The second achievement was fixing the storage issue that made the reverse proxy server freak out. My sites are self-hosted on a wide variety of ARM shitboxes with little internal storage, so I have to extend them with external HDDs and SSDs. The proxy server's SATA SSD is connected through a JMicron JMS578 (though it advertises itself to the kernel as a JMS579) USB-SATA bridge. External hdd enclosures using this chip are cheap as fuck, for a reason: JMS578 has a notorious spindown bug, where it would stop working after some arbitrary time (usually a few minutes) has elapsed without write operations. In practice it meant that the storage the containers were running from got remounted as read-only from time to time. I can only guess how this happened, because by the time I noticed the syslog was flooded with totally unhelpful error messages.

This went on for months, with frequency converging from every few weeks to every few hours. That was the point when I stopped the server and took out the enclosure to find a solution. Luckily the JMS578 can be safely reflashed with an official tool to get rid of this behavior, so following this guide did the trick for me. The server's been going on since then without a hitch.

Edit 1: The containers of the jpdream cluster lost their IP a few hours later after I posted this. I did some investigation and it turned out IPv6 has nothing to do with this. The problem lied within the container images: the ones I use ship without a DHCP client. That's why when I created the containers they came up without an IP address, I had to put dhclient eth0 into the delivery script. What I didn't know is that dhclient is a single-shot program i.e. not a fully-fledged DHCP client, it does not renew IP addresses. So after 6 hours the IP issued by the DHCP server expired and I was back where I started. The solution was to install dhcpcd, the actual DHCP client in every container. The IP addresses already survived container restarts, let's see if they survive the first IP expiration as well.

As for how the fuck did this happen: a few months ago I switched to my custom prebuilt containers that didn't come with dhcpcd preinstalled. Before that the delivery script downloaded a fresh image on every clean deployment and installed/compiled everything.