A lot of people here are suggesting metrics that are easy to collect but nearly ...

A lot of people here are suggesting metrics that are easy to collect but nearly useless for troubleshooting a problem, or even detecting it.

CPU and Memory are the easiest and most obvious to collect but the most irrelevant.

If nobody’s looked at any metrics before on the server fleet, then basic metrics have some utility: you can find the under- or over- provisioned servers and fix those issues… once. And then that well will very quickly run dry. Unfortunately, everyone will have seen this method “be a success” and will then insist on setting up dashboards or whatever. This might find one issue annually, if that, at great expense.

In practice, modern distributed tracing or application performance monitoring (APM) tools are vastly more useful for day-to-day troubleshooting. These things can find infrequent crashes, expired credentials, correlate issues with software versions or users, and on and on.

I use Azure Application Insights in Azure because of the native integration but New Relic and DataDog are also fine options.

Some system admins might respond to suggestions like this with: “Other people manage the apps!” not realising that therein lies their failure. Apps and their infrastructure should be designed and operated as a unified system. Auto scale on metrics relevant to the app, monitor health relevant to the app, collect logs relevant to the app, etc…

Otherwise when a customer calls about their failed purchase order the only thing you can respond with is: “From where I sit everything is fine! The CPUs are nice and cool.”