Hey everyone, and happy new year!

Sorry about that super long downtime there. Yesterday (Sunday) morning at 10:03AM PST our server suffered a physical hardware failure, apparently a power supply failure. Unfortunately despite opening a ticket with our hosting vendor (OVH) a few minutes later and them claiming to have 24/7 support, nobody looked at our ticket until this morning when their phone support lines opened and I called them.

They’ve now replaced a defective power supply and we’re back online, after ~26 hours of being offline. Some pretty disappointing response times, to put it nicely.

We’re planning to move away from OVH at the end of this month, onto proper enterprise grade hardware that we own and control. This will give us a HUGE boost in server resources and allow us to scale for the foreseeable future, while also giving us the control to resolve problems like this much quicker. Expect another follow up post about this in the next couple weeks once I’ve put together the migration plan.

Timeline:

  • Jan 5th 10:03am PST - We get alerts to the server being non-responsive.
  • Jan 5th 10:05am PST - I pull up the console via IPMI and it’s completely non-responsive. Attempting to power off / on the server or do anything, does not work.
  • Jan 5th 10:15am PST - Initial support ticket created with OVH. I followed up a couple times over the next few hours, and got no response.
  • Jan 6th 6:32am PST - Called OVH, gave them the case number and asked them to investigate
  • Jan 6th 7:34am PST - I get notified they’ll start their “intervention” in 15 minutes.
  • Jan 6th 11:04am PST - Call them again, the tech is still working on it and they’ll get back to me with an update
  • Jan 6th 11:34am PST - “I was informed by our data centre technician that there is an issue with the power supply unit for the rack on which your server resides. Your server will come back online once they have replaced the power supply.”
  • Jan 6th 12:17pm PST - We’re back up finally!

Edit on Jan 7th @ 8:40am PST: We just had another outage of about an hour. Investigating with OVH.

  • Ace T'Ken
    link
    fedilink
    English
    arrow-up
    6
    ·
    2 days ago

    Shit. I didn’t read your comment as a joke initially and almost had a mini-rant about how we know if a server 3 provinces away is down for 5 minutes and go into a red alert.

    That was dumb of me. Carry on.

    • Em Adespoton
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 day ago

      Sorry, that humour is from my 3 decades of asking people “what do you mean it failed? WHAT WEREN’T WE LOGGING THAT IT COULD HAVE SAT IN THAT STATE FOR 24 HOURS WITH NOBODY TELLING ME???”

      Unfortunately I’ve been at my current company long enough that I start work with hundreds of improperly tuned notifications now.

      It’s enough to have left me a bit fatalistic about the many ways monitoring can be screwed up both through good intentions and through ignorance and inattention.

      • Ace T'Ken
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 day ago

        Totally get that. It’s what made me spin up my company instead of staying a tech lead at massive MSPs. They don’t give a fuck about procedures and properly taking care of systems, they care about making sure the client signs the next contract. That is all.

        We do a full onboarding at every client and make sure every profile on every desktop and every server, switch, printer, and router are fully updated and up to spec. Everything is fucking perfect. It’s why we only sit on (checks RMM) 12 tickets for over 1000 seats at any given time. NOTHING waits and we know everything inside and out. Some clients pay for perfectionism and adore it.

        So yeah. Long-winded way of saying I totally appreciated the humour once I pulled my head out of my ass and in fact realized that it was humour.