The rest was assembling the right team to look at it and even then it took almost two days to nail down what was causing the problem. (Lack of proper tooling to find the issue quickly kept causing us delays.)
We're in the early stages of getting Dynatrace implemented and that would've found the issue in minutes for us, even in our very large (20,000+ servers, thousands of microservices and a highly segmented & secured) environment. Broad based agreement on that conclusion.
Dude, if you are adopting Dynatrace, know that I am a pro in that tool. Example: I’ve written console apps that poll Dynatrace’s (vast) db and pull out peak requests per hour for all our services, finding the spikes so we can look at resiliancy. Ditto response times. If you guys need me, I ain’t cheap but I am beyond excellent.
And that’s just one aspect of my skillset. You guys would be very happy with the Laz. :)
Oh and BTW, having a sense WHERE the problem lies is a skillset in and of itself. You have that instinctual ability, clearly. That is extremely valuable.