Google

Home
Most Popular
Petals

|
*
2007/04/07
 00:42:29

The week from hell

I started off the week at a point where I needed a vacation, looking forward to it being a short week and then a 4 day weekend. Last week was busy and ended with a long evening of taking everything down to replace a cache controller on a SAN. Unfortunately despite a good start this week of hiding at work and getting things done for 3 hours, it just went downhill. The problems started with runaway stuff filling up a disk (of course on a volume that happened to be missed when setting up monitoring). There's then the fun issue of Linux drivers turning SCSI_STATUS_BUSY (retry) into BUS_BUSY (abort) to fix some specific bug, which unfortunately when hit with the contention and delays the combination of ESX and shared SAN cause, causes locking devices in read-only mode. Neither the filesystem or database engines like that much, and unfortunately the "fix" is currently a hack in the HBA driver (to roll back the vendor's change). Then came finding random corruption in a database (on Windows this time, unrelated to the Linux driver issue). After trying to track that down, it was discovered the timing correlated with a drive issue behind a SAN controller, which apparently leaked through the redundancy without being caught. As the problem started on the weekend, and the state of log replays requiring downtime, it required replaying transaction logs from Saturday through yesterday - which took hours in itself, not counting the staging/testing and pulling backup copies and restoring files at various states.

Then I had a spam filter randomly die. Turns out the CPU fan failed, so it'd run for 5-10 minutes and then go into thermal protect. Luckily once I knew the problem I was able to safely dequeue the mail without issue and with a screenshot and checking the motherboard model (for which they had me ignore the "Warranty void if removed" sticker), a replacement part was on the way. Not the instant replacement swap out I expected from most reports in support forums, but swapping a fan and heatsink is quicker and easier for me than a backup/restore. The replacement arrived the next day, along with a replacement and extra warranty sticker (which amused me), although I'm not thrilled with the design of the replacement part even though it seems to work. This is all worked around 7 hours of meetings, and the usual daily stuff. Plus I ended up rescheduling my Monday vacation between not really getting the whole weekend anyways and practicalities of scheduling. Hopefully that's turning into next weekend being 4 days though. I did end up having a few minutes of extra time while being at the mall over lunch on Thursday though, which resulted in toy shopping (which will be another post), which at least is something fun for the week. I'm just glad there was one fewer day for things to go wrong, and it's now the weekend.