Over the years I’ve found that almost every enterprise monitoring software I’ve been working with so far (and there are a few dozen of them), put extensive load on targets under the monitoring.
While one monitoring tool brings extensive hard parsing on its targets, another one calculates free space for each tablespace in a very inefficient way, third one has its own set of issues etc.
It is extremely rare to find monitoring tool with a very low overhead, that can do its primary job well.
Second major issue with all monitoring tools is how to decide what to monitor.
Every monitoring solution I’ve been working with so far takes approach to monitor everything.
Since for each component of your enterprise architecture there are thousands of metrics you can monitor, apart from a significant additional load on the target system, such extensive monitoring makes it difficult to interpret the results (from the forest you can’t see a single tree).
I prefer one metric that always works as a starting point for more detail troubleshooting instead of thousands of them.
Third issue with monitoring tools is to setup threshold properly from thousands of different metrics staying at your disposal.
If I can have only one metric I can rely on, setting a threshold properly won’t be a problem.
Fourth major issue with monitoring tools is to interpret the results.
The most difficult part is to have a clear understanding what is cause and what is a consequence.
Only technology experts can interpret results correctly.
I’m a big fan of simplicity which is a reason I’d like to have a tool that can do just that – to be able to interpret obtained results even without deep technical understanding of how the system under the monitoring works.
Last major issue with all monitoring tools is how to properly evaluate importance and business impact of some alert.
In vast majority of cases I’ve found that, due to all previously mentioned reasons, monitoring tools generate hundreds or even thousands of unnecessary alerts making your job to react on important alerts difficult.
In this post I’ll present my way proven numerous times in the field.
Since I didn’t find a tool that can fulfill all important points I’ve listed previously, I’ve created my tool with only one metric called “User experience”.
I can monitor user experience either in a real-time or a history.
On the following picture you can see how it looks like.
On the X-axis you can see the date range, while on the Y-axis on the left side you can see the blue scale (in cs 1 s = 10 ms – millisecond) with a range from 0 – 1, while on the right red bar I’m using the same metrics in (cs) but with a different range.
In this case range is between 0 – 5 cs, but range is dynamically generated based on the input values.
Observe two lines – blue one which represents average user response time and corresponds to the left vertical (blue) bar, and the other one which is in red and is equal to the maximum value at some small period of time within a sub-range and its corresponding right side vertical bar colored in red.
The whole graph is dynamic – hovering over some points will show exact values, I can zoom it, hide or display average or max values etc.
Now you can understand what I mean by producing something simple, easy to understand and interpret, with almost no impact on the system under the monitoring, where I’m monitor only one metrics – the most important one which is “user experience”.
Unlike many performance reports (e.g. AWR, ASH, ADDM…) there is a timeline (X-Axis) so I can track performance issue by time dimension (AWR and similar reports uses two points – begin and end time and calculate the difference of various metrics between those points).
With my approach I’m getting not only real time data of the user experience metric, but the history as well, which helps me to set threshold of that metric easily (e.g. 25 ms is a normal response time, 100 ms is warning and everything above 300 ms is something you need to react immediately).
Further on, since I’m not showing OS stats or latches or some other technical metric that will obscure the performance picture, as I have only one metric in place which always express user experience I have a clear picture of what is going on.
There are many enterprise monitoring tools available such as Oracle Enterprise Manager Cloud Control (popular for Oracle based systems).
Although Enterprise Manager Cloud control and similar tools are great, at the same time they are very complex too, and they require constant care and maintenance, especially to install, upgrade and patch.
Enterprise monitoring tools also impose a lot of overhead for its client collectors that needs to be deployed on each data source / target that you monitor (memory, CPU and a lot of hard parsing, unshareable cursors etc) and enterprise monitoring tools by itself are resource hungry.
My approach (and tool I created) requires no installation, no maintenance and no care at all with almost negligible overhead.
Here is another picture showing the same metrics over the same time period but on different database.
If you know that two databases communicate intensively either through database gateway, JMS, message broker or some other way, you can quickly figure out what is going on if you are experience some issue just by comparing two dynamically interactive graphs.
Summary:
When designing an enterprise solution, among many factors which drive key decision, one is very often overlooked – simplicity.
In this article I’ve demonstrated what simple approach means in case of monitoring, but similar principle you can apply in any other place.
Comments