Monitoring and Troubleshooting Microsoft.com

Jimmy gets stranded

I have been catching up on some reading, today’s reading was: Monitoring and Troubleshooting Microsoft.com a really interesting article on how Microsoft have constructed their organisation and technology to tackle the operation of one of the world’s busiest Internet sites.

A few things struck me.

They obviously have the same monitoring problems as the rest of us:

“Left to their default configurations, most monitoring systems generate an excessive number of alerts that become like spam to administrators. Especially with large systems, it is important for organizations to carefully define what should be monitored and what events or combination of events should be raised to the attention of operations personnel. An organization must also plan to learn from the data collected. As with alert planning, this aspect of the solution is a significant undertaking. It requires creating data retention and aggregation policies, and combining and correlating all of the data into a data warehouse from which administrators can generate both predefined and impromptu reports.”

But have got to a point where:

“The overall system processes over 60,000 alerts a day, conducts approximately 11.5 million availability tests a day, parses 1.7 terabytes of IIS log data a day, and collects 185 million performance counters a day at a sampling rate of 45 seconds. However, to reach this degree of monitoring sophistication was a long process and required significant effort and cross-organizational coordination.”

I’m not sure whether those numbers indicate ‘monitoring sophistication’ or not.

The other thing was the ability of Microsoft to leverage internal resources and to operate a continuous improvement methodology that genuinely improved things. These things are incredibly difficult in large organisations.

“After implementing and stabilizing the asset management and reactive monitoring systems, the focus of the operations team shifted to proactive testing of applications and defining proactive monitoring events.”

and

“The testing process also helps to determine what events are meaningful, and what corrective actions are appropriate in the case of those events. All of the information learned from transactional and stress testing is thoroughly documented as part of the release management process of the Microsoft Solutions Framework (MSF) that many of the development teams use.”

and

“The operations team wants to create a common eventing and logging class, based on recommendations from the Microsoft Patterns and Practices group, with deep application tracing.”

It’s very easy to implement something and then to leave it alone because it’s working, that is until it stops working. When it stops working that’s when the problems start because people expect thing to be as they left them when they implemented them and they never are. Changes occur, the best thing you can do is make sure the changes contribute to improvement rather than to service entropy.

Microsoft 'Motion'

Jimmy and Grandad struggle to get back into the house

Channel 9 today has a video on Motion, it was one of the topics at the Architecture Insight Conference.

There’s also a couple of ARCasts to.

If you are an IT Architect then Motion will be of interest to you. If your a technical person it won’t.

Motion is about building a bridge between business architecture and IT architecture it does this via building a bridge between business services and IT services. That’s right Microsoft doing business architecture.

It looks really interesting as an approach but there isn’t that much collateral available online today because it’s still in incubation and ‘motion’ is still a code name.

Tags: , , , ,

What is Architecture? The return

A family outing - with Grandad driving. Oh dear!!!

I commented on a piece by Michael Platt the other day looking at the definition of architect.

Since then Steve has commented.

Craig Andera has commented on Michael’s initial post and Michael has responded.

Michael has also added another document.

The chasing of definitions can be a wonderful tool for procrastination and I’m in danger of doing just that so I’m not going to comment anymore. If I ever produce something as wonderful as Sir Christopher Wren (Architect) or as useful as Sir Joseph Bazalgette (Engineer) I’ll be more than happy.

“My definition of an expert in any field is a person who knows enough about what’s really going on to be scared.” P. J. Plauger, Computer Language, March 1983

Getting Control of the Infrastructure: Atonomic, WSDM, DSI, SDM, etc.

Grandad find a snow drift

In my previous post I talked to the problem of the complex infrastructure.

Is there anything going on in the industry to try and resolve these issues.

One of the first things that should be evidently clear is that this issue isn’t an issue for a single company and thankfully a number of companies are working together to resolve the issues (Microsoft, IBM, HP).

As with most technologies that are early in their development cycle many names are being used and there is no clear taxonomy yet. Most people seem to recognise the title of ‘Autonomic’ which was originally conceived by IBM (I think) but each vendor has their own initiative but are coming together under the ‘WS-DM’ banner also. The problem with the ‘Autonomic’ word is that is has another perfectly good use in biology. I’m not sure that WS-DM help either, as it links the issue to Web Services which is a bit limiting when the major elements are infrastructure and infrastructure does lots of things which aren’t really Web Services.

The basic concept is that a service and all of its elements can be described starting with the business requirements and working down into technical requirements, Microsoft call this the Service Definition Model. Each of the service elements are then told to follow the document, if the document updates they update, likewise changes made to the elements are assessed against the document and can only be applied if they don’t have an impact, they then update the document.

The technology has a long way to go, but the concept seems to work.

————

One of the questions that CFO did ask, one of the few repeatable ones, was this:

“What changed to cause the problem, it was working fine so it must have been caused by a change”.

It’s one of those questions that cuts through to the issue, “who knows” is the real response. There are lots of changes going on all the time, patches, fixes, configuration. If someone did change something how were they supposed to know it was impacting upon a service that was using the element of the infrastructure that they were changing.

Until we can answer this question categorically and precisely, and preferably with the answer “nothing that’s had an impact”, we haven’t finished.

Tags: , , WSDM, ,

Getting control of the Infrastructure: The Problem

Grandad decides to sweep up

Once upon a time a developer sat down and started work on an application he had been commissioned to write. He looked at the specification, it required screens which required inputs. So he started coding. He decided what the data file structure should look like and created it on the local computer, he decided what the screens should look and created them and he decided the logic and coded it. There was no network involved, there was no database involved, there was only going to be one client on one computer. The whole project required a team of one.

But that was then and this is now:

An organisation has a process they are wanting to automate, it’s a completely new process because they have already automated all of the others. A group of architects get together to try and understand the data inputs, the work-flow, the security requirements, the resilience requirements, the flexibility requirements, the extensibility requirement, the interfaces to other applications and the diverse client base. They decide that this process can be automated by linking together a number of existing systems and by using a browser based interface which will be developed using an off-the-shelf application. The people who require access to the interface include people from a variety of different partners and suppliers, each of them with a variety of browsers. In order for the process to complete a plethora of applications, databases, networks and servers are involved. This service makes this customer the bulk of their money, it is critical to them, the faster it runs the faster they get paid.

About 9 months after the service has been created it has a problem, it’s running slow and people are starting to notice. The CFO makes an angry call to the person responsible for the service.

“Where is the problem and why haven’t you fixed it?” is the question the CFO starts with.

“Well” says the service provider “this is the first I have heard of a problem, who contacted you?”.

“John from our delivery partner phoned me to say that they were only getting requests for deliveries after the due date and that it wasn’t their fault that parcels were being delivered late” the CFO retorts.

“I’ll get the team together to assess the likely cause” says the Service Provider a little sheepishly, though trying to sound bold and assertive.

“You have until 14:00” says the CFO.

Following the call the Service Provider gets together the team of people he regards as responsible for the technical elements of the service, they each involve others they think should also be involved, they each contact the vendors who deliver the software or hardware they are responsible for. They each take a look at their elements of the service.

At 14:00 the Service Provider has his follow-up call with the CFO and delivers this report:

“I have formed an investigative team to try and get to the root cause of this problem, I am still waiting for a representative from our local networks team, but I already have representatives from the internal SAP team, the Windows team, the SQL Server team, the firewall team, the wide area network team, the storage team, the backup team, the batch processing team, the CRM team, the Oracle team, the work-flow team, the identity management team, the desktop team, the AS400 team, the directory team, the email team and the UNIX team. I have also managed to free up one of the original architects to try and get his overall view of the issue. Furthermore a number of the vendors have offered their help.

They have each assessed their element of the service and can find no issues. Everything is working as they would expect it to work.”

I think you can imagine the CFO’s response.

This story is based on a caricature of a real situation I have personally been involved in, and I don’t believe it is at all overstated. The modern infrastructure and application mix is very complicated.

Current monitoring and management techniques don’t recognise the service in the same way as the person using it does, they recognise all of the elements, but don’t put it together as a whole.

Is anyone working on an answer?