People fail to understand that large projects have inertia. He could have shuttered all twitter offices, fired all employees, and only paid the server bills, and the website would probably continue to function just fine for a few months.
But as a devops/SRE, this whole saga has been awesome to watch
And often the tipping point is invisible. Some small routine or service degrades, but outwardly everything still works fine… there is just more strain on the services and clients that use that service, causing them to slowly degrade over the next few hours, days, or weeks, which in turn puts more strain on the services that call those services… etc etc.
Until one day the system is so degraded major things start breaking. It seems like it came out of nowhere, but the initial failure happened weeks ago and has been cascading since then.
Once a system hits that point it’s often not enough to just fix the initial problem because so much of the ecosystem around it has been thrown out of whack.
As a way-too seasoned web developer who appreciates working alongside great SREs, this has been pretty interesting. I’m honestly surprised more hasn’t gone wrong but maybe that’s yet to come. Since they are (I imagine) losing users instead of growing it might actually avoid running into future scaling issues that were looming.
I remember reading comments on how the site still fine after firing so many people. “What do they do”.
People fail to understand that large projects have inertia. He could have shuttered all twitter offices, fired all employees, and only paid the server bills, and the website would probably continue to function just fine for a few months.
But as a devops/SRE, this whole saga has been awesome to watch
Weren’t they not paying their AWS bills for a while?
Back in March it was reported they weren’t paying their AWS bill. Two weeks ago it was reported they weren’t paying their GCP bill either.
Back in March it was reported they weren’t paying their AWS bill. Two weeks ago it was reported they weren’t paying their GCP bill either.
And often the tipping point is invisible. Some small routine or service degrades, but outwardly everything still works fine… there is just more strain on the services and clients that use that service, causing them to slowly degrade over the next few hours, days, or weeks, which in turn puts more strain on the services that call those services… etc etc.
Until one day the system is so degraded major things start breaking. It seems like it came out of nowhere, but the initial failure happened weeks ago and has been cascading since then.
Once a system hits that point it’s often not enough to just fix the initial problem because so much of the ecosystem around it has been thrown out of whack.
See the film Passengers for an example of cascade failures from systems trying to cover for each other.
The Expanse has a whole b-plot about an artificial ecosystem going through cascade failure in one of its arcs.
As a way-too seasoned web developer who appreciates working alongside great SREs, this has been pretty interesting. I’m honestly surprised more hasn’t gone wrong but maybe that’s yet to come. Since they are (I imagine) losing users instead of growing it might actually avoid running into future scaling issues that were looming.