Stability Patterns & Antipatterns

YOUTUBE VZePNGQojfA Michael T. Nygard presents his talk at GOTO 2016

Nygard's talk discusses his experiences going into an Ops team which he joined in the early 2000's. In his debugging of production stability problems he started to see trends and started classifying them in the hope that people may avoid these classes. He published these in a book titled Release It!, and this talk feels like a few of the bigger ones.

These slides are also available as PDF

Availability is the probability that a system running at any time.

Stability is an architectural characteristic, producing availability despite faults and errors.

Fault is an incorrect internal state in your application. Systems could be fault tolerant by repairing internal state. Fault intolerance says if you've got an error state then there's no state as clean as your initial state so just shut down. A language with exceptions is generally fault tolerant.

An error is when a fault becomes visible to the user. If a fault is not visible, then it's no big deal.

A failure is when the system doesn't respond or looses availability. That's the state we're most wanting to avoid.

Stability Antipatterns

Integration points are the #1 risk to stability. Avoid or contain them. It's impossible to engineer away all the problems that can happen. Instead you must expect failures to occur around every integration point you've got and instead to deal with them so they don't propagate to the entire system and take it down.

Nygard gives an example of a database where long established connections were being dropped by the firewall. This was hard to diagnose and not expected by the system. To avoid this problem you can enable "dead connection detection" to check if the client is still alive and will clean up if not.

Another example of people abusing inputs. For instance, a user may send an infinite stream of open element tags to blow out memory. Don't trust your inputs.