I’m a big fan of netdata; it’s part of my standard deployment. I put in some custom configs depending on what services are running on what servers. If there’s an issue it sends me an email and posts into a slack channel.
Next step is an influxdb backend to keep more history.
I also use monit to restart certain services in certain situations.
Here they’re pushing the “must be within 60 miles from the office” trope; I bet they’d say to drive in if it’s after hours.