I've spent the last couple of years at CERN working on building monitoring infrastructures for grid systems. Recently a colleague asked about a taxonomy of monitoring and I dredged up the original taxonomy [PDF] we came up with three years ago. It seems to have stood the test of time and I think is still useful as a way to structure your metrics and decide which tool should gather, store and process the metrics. I want to revisit some of the concepts raised there
Local and remote monitoring
The first concept examined in the paper was that of local and remote monitoring systems. It was logical then to split monitoring into local and remote, meaning either on the site/data center where the services resided, or external to it. In 2007 mostly the monitoring services were local, integrated into the fabric, for example Nagios, ganglia, cacti, .... The exception tended to be lightweight 'ping' style services for HTTP. Now with the rise of cloud I see a shift in this - some PaaS monitoring offerings like Circonus are basically by default remote, with the ability to add a local 'enterprise monitoring' component. This makes sense especially for things you want to do analytics or complex event processing on.
A nice split emerges in such a system where the local monitoring component is really responsible for the operation of the underlying infrastructure and the remote monitoring provides rich information on the performance of the application, often in terms of components that the business as well as operations understand. Again in the cloud this can be supplied by the IaaS provider - the acquisition of CloudKick by Rackspace points to the IaaS players trying to build out expertise and services in this area. Amazon is also trying to grow here with their recent offering of free basic CloudWatch monitoring along with alarms on top of the monitoring metrics.
The new players
Since 2007, new open-source monitoring offerings have also come into the market - often as offshoots from internal projects. They include:
- Reconnoiter (rebundled as a PaaS offering as Circonus, and also part of the Cloudkick infrastructure). Based on the experience of OmniTI of running super large websites
- Graphite. Scalable storage and display of performance metrics
- OpenTSDB. Open sourced version of StumbleUpons internal monitoring systems. Built on top of HBase and able to scale to billions of data points.
These are interesting as they focus more on the higher level monitoring functions such as trending and fault detection by complex correlations of metrics as well as moving past RRD for display and storage of performance data. There's a lot of work here still to do for these projects - documentation is sparse and packaging and integration in distributions is weak, but they are looking in interesting directions for solving the storage, visualisation and analysis of metrics.
I'll follow up in separate postings on some of the other areas we looked at - status and performance metrics and categorization of metrics.