Tuesday 11 January 2011

Giving structure to your monitoring [Part 1: Local and remote monitoring]

I've spent the last couple of years at CERN working on building monitoring infrastructures for grid systems.  Recently a colleague asked about a taxonomy of monitoring and I dredged up the original taxonomy [PDF] we came up with three years ago.  It seems to have stood the test of time and I think is still useful as a way to structure your metrics and decide which tool should gather, store and process the metrics.  I want to revisit some of the concepts raised there 

Local and remote monitoring

 The first concept examined in the paper was that of local and remote monitoring systems.  It was logical then to split monitoring into local and remote, meaning either on the site/data center where the services resided, or external to it. In 2007 mostly the monitoring services were local, integrated into the fabric, for example Nagios, ganglia, cacti, .... The exception tended to be lightweight 'ping' style services for HTTP. Now with the rise of cloud I see a shift in this - some PaaS monitoring offerings like Circonus are basically by default remote, with the ability to add a local 'enterprise monitoring' component. This makes sense especially for things you want to do analytics or complex event processing on.  

A nice split emerges in such a system where the local monitoring component is really responsible for the operation of the underlying infrastructure and the remote monitoring provides rich information on the performance of the application, often in terms of components that the business as well as operations understand.  Again in the cloud this can be supplied by the IaaS provider - the acquisition of CloudKick by Rackspace points to the IaaS players trying to build out expertise and services in this area.  Amazon is also trying to grow here with their recent offering of free basic CloudWatch monitoring along with alarms on top of the monitoring metrics.

The new players

Since 2007, new open-source monitoring offerings have also come into the market - often as offshoots from internal projects.  They include: 

  • Reconnoiter (rebundled as a PaaS offering as Circonus, and also part of the Cloudkick infrastructure). Based on the experience of OmniTI of running super large websites 
  • Graphite. Scalable storage and display of performance metrics
  • OpenTSDB. Open sourced version of StumbleUpons internal monitoring systems.  Built on top of HBase and able to scale to billions of data points.

These are interesting as they focus more on the higher level monitoring functions such as trending and fault detection by complex correlations of metrics as well as moving past RRD for display and storage of performance data.  There's a lot of work here still to do for these projects - documentation is sparse and packaging and integration in distributions is weak, but they are looking in interesting directions for solving the storage, visualisation and analysis of metrics.

I'll follow up in separate postings on some of the other areas we looked at - status and performance metrics and categorization of metrics.

Monday 10 January 2011

Resolutions for 2011

2011 is set to be a year of big change for me as I leave CERN after 8 years.  So I thought it was a good year to write down some resolutions (for the first time ever!).  I think it's going to be important for me not to be too mono-focussed on any one area in particular this year, so some resolutions for Body, Speech and Mind:

Body

In general I need to work a bit more on mountain fitness - I ran 1:42 for a half marathon this year, it'd be nice to get the training in to pull that under 1:40

  • Run 1080 Kilometres (I did ~950 last year)
  • Enter 7 races (I did 3 last year)
  • Complete 21 ski tours (~3 last year)
  • 21 other 'mountain days' - trail running, climbing, hiking

Speech

This year I want to write more, both code and text.  Especially since I won't have morning coffee colleagues to talk to (other than my cat!), I'm going to try and publish more stuff on this blog and twitter..

  • 49 blog entries
  • 108 tweets
  • 7 new open-source projects on github

Mind

  • Learn a new language (tibetan, hindi ?) to basic conversational level
  • Learn a new computer language (Erlang ?)

Let's see in 12 months how I did...

Thursday 6 January 2011

4 years of development activity in under 2 minutes

We started the WLCG Grid Monitoring Subversion repository in June 2007.  For nearly four years it's been the home of the majority of the monitoring tools used to monitor the WLCG infrastructure.  Now courtesy of gource here's a 2 minute visualisation of all the checkins over that period.

Stats: 38 commiters, 4898 commits, 450K LoC....

 

WLCG Monitoring SVN repository visualisation from James Casey on Vimeo.

New Activemq yum repository (and 5.4.2 rpms)

I got around to setting up a new yum repository to store my activemq rpms.

You can find it at http://packages.platform14.net/repo/activemq/. Currently I've got RHEL/Centos5 packages there.  I'll put up a koji instance there soon and add Fedora13/Fedora14 package too. they all build out of my github repository : https://github.com/jamesc/apache-activemq

To enable, here's a .repo file:

[activemq-centos]
name=activemq-centos
failovermethod=priority
baseurl=http://packages.platform14.net/repo/activemq/centos/5/$basearch/
enabled=1
gpgcheck=0


[activemq-source]
name=activemq-source
failovermethod=priority
baseurl=http://packages.platform14.net/repo/activemq/centos/5/SRPMS/
enabled=1
gpgcheck=0

It currently contains the latest 5.4.2 RPMS.

Tuesday 14 September 2010

Fudge in Python

I though Fudge was an interesting format for messaging when Kirk Wylie first mentioned it last year.  Since then C and C++ libraries have appeared alongside the original Java and .Net.  Fudge, along with being a strongly-typed, extensible, self-describing format has some nice ideas, such as taxonomies that look like a good fit for some of our messaging use-cases, such as sending GLUE records over messaging.  In GLUE, keys are long and a taxomomy would help to reduce the byte count on the wire by a lot.  Also the typed nature gives us a big advantage over LDAP, and would save on lots of type conversions.

Another interesting usage (once you have a python implementation) would be automatic translation from Django Models to Fudge messages.  This could work very well with django-celery as a custom serializer.

With some time on my hands I decided to take a crack at a Python implementation of the specification.  The specification is short, and detailed enough to work from but diving into the reference implementations is highly recommended if you have any doubts on what should be done. 

A few days work has lead to a basic implementation up and running with all basic types working (except for date types).  There still is a lot missing, including recursive Fudge messages, taxonomies, context objects, streaming interfaces and better accessors for fields in a fudge Message object.  A lot of these are the next higher level functionality on top of the basic encoding/decoding functionality.

The code is now up at github, expect changes in the next few days as I flesh out the implementation.

Thursday 3 June 2010

(Some) Java apps run faster on fewer cores...

From Paul Tyma: Mailinator(tm) Blog: How I sped up my server by a factor of 6

Odd effects of reducing #cores for the JVM to improve message throughput.