I have been catching up on some reading, today’s reading was: Monitoring and Troubleshooting Microsoft.com a really interesting article on how Microsoft have constructed their organisation and technology to tackle the operation of one of the world’s busiest Internet sites.
A few things struck me.
They obviously have the same monitoring problems as the rest of us:
“Left to their default configurations, most monitoring systems generate an excessive number of alerts that become like spam to administrators. Especially with large systems, it is important for organizations to carefully define what should be monitored and what events or combination of events should be raised to the attention of operations personnel. An organization must also plan to learn from the data collected. As with alert planning, this aspect of the solution is a significant undertaking. It requires creating data retention and aggregation policies, and combining and correlating all of the data into a data warehouse from which administrators can generate both predefined and impromptu reports.”
But have got to a point where:
“The overall system processes over 60,000 alerts a day, conducts approximately 11.5 million availability tests a day, parses 1.7 terabytes of IIS log data a day, and collects 185 million performance counters a day at a sampling rate of 45 seconds. However, to reach this degree of monitoring sophistication was a long process and required significant effort and cross-organizational coordination.”
I’m not sure whether those numbers indicate ‘monitoring sophistication’ or not.
The other thing was the ability of Microsoft to leverage internal resources and to operate a continuous improvement methodology that genuinely improved things. These things are incredibly difficult in large organisations.
“After implementing and stabilizing the asset management and reactive monitoring systems, the focus of the operations team shifted to proactive testing of applications and defining proactive monitoring events.”
“The testing process also helps to determine what events are meaningful, and what corrective actions are appropriate in the case of those events. All of the information learned from transactional and stress testing is thoroughly documented as part of the release management process of the Microsoft Solutions Framework (MSF) that many of the development teams use.”
“The operations team wants to create a common eventing and logging class, based on recommendations from the Microsoft Patterns and Practices group, with deep application tracing.”
It’s very easy to implement something and then to leave it alone because it’s working, that is until it stops working. When it stops working that’s when the problems start because people expect thing to be as they left them when they implemented them and they never are. Changes occur, the best thing you can do is make sure the changes contribute to improvement rather than to service entropy.