I will here tell you a few truths, the most valuable ones I know on this subject:
Your primary monitoring should try imitate a customer as closely as possible, covering every major function a customer needs.
Prefer functional tests over diagnostic tests. Functional tests tell you whether a system is doing a job. Diagnostics only tell you whether a system thinks it’s doing its job.
Determine log retention in consultation with legal, and enforce it. It makes disk-full outages less likely, and costs more predictable.
Alerts should have tunable noisiness. Don’t turn the noisiness down until you’re sure you know what’s noise.
Spend your money monitoring what costs you money. Metrics you don’t use are a wasted expense.
But: Data is cheap. Get a lot of it. Constantly think of ways to turn it into information that you can use to your profit.