anti-decay 2: health monitoring and notification

yesterday i fixed a issue created long time ago for a postmortem

the solution is generating a metric and email alerts if a condition is not satisfied

this kind of condition checking is like annual health check

it can spot health issues before it cause disasters

g has many tools for health checking and maybe every mission critical system has corresponding health monitoring sub-systems

there are two kinds of monitoring:

1) black box or external monitoring

2) white box or internal monitoring

For the first approach, no code modification is required for the system and no internal knowledge is required. its like black box testing but no code coupling to the system. The second approach is white box testing.

both approaches follows the same flow:

1) collect raw metrics and save data to somewhere

2) query data and generate higher level metrics

3) condition checking and notification

4) the above process is run periodically or continuously


