As an SRE, what do I do about Alerts caused almost entirely by poor customer communication or misuse of a product?

th3raid0r@tucson.social · 1 year ago

As an SRE, what do I do about Alerts caused almost entirely by poor customer communication or misuse of a product?

th3raid0r@tucson.social · 1 year ago

That is exactly what we do. The problem is that as a managed service offering. It is on us to scale in response to these alerts.

I think people are misunderstanding my original post. When I say that customer cluster will go into stop writes, that does not mean it is not functional. It is an entirely intended function of the database so that no important data is lost or overwritten.

The problem is more organizational. It’s that we have a 5 minute SLA to respond to these types of events and that they can happen at any random customer impulse.

I don’t have a problem with customers that can correctly project their load and let us know in advance. Those are my favorite customers. But they’re not most of our customers.

As for automation. As I had exhaustedly detailed in another response, we do have another product that does this a lot better. And it’s the one that we are mass marketing a lot more. The one where I’m feeling all the pain is actually our enterprise level managed service offering. Which goes to customers that have “special requirements” and usually mean that they will never get as robust automation as the other product line.