Why my website was down for a week.

Why my website was down for a week.

On the 5th of July 2022, my server went offline for about a week. I went to my Proxmox Hypervisor Dashboard, and found it wouldn't load. I was met with errors like this

Error: systemd-journald[528]: Failed to write entry (22 items, 757 bytes), ignoring: Read-only file system

Day 1

Immediately, I restarted the server and was then met by fsdisk. This was worrying as I only have two drives. And these are running in Raid 1 for redundancy. The possibility that they have both failed and gone into read-only is quite a terrifying prospect. It could mean that all my data is lost, or at the minimum, I would need to backup all my data to somewhere else and buy new drives to put it on. Not only is this a problem that would maybe take a over a week to fix, its also a fairly expensive problem.

However, the priority was to back everything up. And this is what anyone who goes through a drive failure needs to do. Do as much as possible to get a backup of all their data. Because at least then they can start prodding at the drives or think about getting replacements.

Here I just want to thank @BeyaMale who guided me on what to do and helped me attempt a backup.

However, we were unable to get any backup going and I decided to sleep on it.

Day 2

After having slept on it, the idea was planted that it wasn't in fact my hard drives that had failed. You see, the only other storage device in my server is a 16GB USB stick. This has the Proxmox installation on it and what the server boots off of. This was much more likely to have failed, especially since it was just a cheap £10 USB stick of Amazon. On top of this, in the Array Configuration Utility (ACU), there were no warnings on the drives other than that the weren't authenticated. This was because I didn't have any caddies.

The warnings given by the drives

As you can see, there are no warnings or errors indicating that the drives have failed. And my suspicions were right. After booting a disk management operating system, I was able to mount and write to my disks however had some issues with my USB stick.

Here I want to say thank you to @realdeadbeef who suggested this idea and helped prove it.

With this new knowledge, I could try and recover possibly corrupted data from my USB stick or get a new drive or USB stick and create a new Proxmox installation on there.

I decided to create a new installation. I had an old 80GB hard drive lying around and I plugged it in. To my surprise, it had HP's Sea of Sensors (even though it was an old laptop hard drive for a Dell laptop). Any who, I could now use this to store my new Proxmox installation. Once installed I mounted my other hard drive.

Day 3-4

So what was left to do was to manually recreate every VM & LXC Container and then import it's hard drive. I used this command to clone the virtual hard drive for VM

sqm importdisk /mnt/sdb1/images/105/vm--disk-0.qcow2 <hard-drive-name> -format qcow2

I then assigned the drive in the GUI and made it the bootable drive.

Importing virtual hard drives into LXC Containers was a little harder. Instead of being a qcow2 file, they were raw files. I also couldn't use a command-line command like sqm importdisk. So instead, I created a dummy drive with my new LXC container, I then removed the hard drive created by the LXC container and replaced it with the original one so that the path to the container's hard drive was pointing to the original one.

Finally I rescanned all the drives by running

pct rescan

This automatically resized drives that were too small/big and warned me about unused drives. And I went through deleting them.

Other problems

One of the biggest head aches was networking. I had to re-reserve the IPs for all of my VMs so that they had their old internal IP. I had some particular head aches with IPs between my Media Server VM and my Raspberry Pi. Even now, my proxy server can't resolve my Media Server and I am struggling to access my network drive on my Raspberry Pi. (This has since been fixed)

In conclusion

Go with redundant, reliable storage mediums. Obviously a £15 USB stick off Amazon doesn't cut it. If it is reliable, its less likely to fail. And making it redundant would mean even if one storage medium fails, the other still carries on meaning you still have functioning storage while you replace the failed drive.

Another big thing that I will be working to fix soon is making sure you have NO warnings on your server. This confused us initially in what drives had failed. Because both my hard drives had 'Not Authenticated' warnings, we assumed they had gotten smart warnings (meaning that they had completely failed). If I had spent the extra money to buy caddies so my drives were authenticated, I wouldn't have any issues with being mistaken. If there were any warnings with them it would mean that it is a genuine problem.

I hope you learnt something from my troubles, and I hope that I don't have any issues like this in the future. Hopefully everything will be fully functional soon.

Show Comments