Incident management failures | IT Operator view

The ideal world

R&D, Support Service and the IT Ops collaborate
in the same room, on the same incident.

Investigation gets easy and fast.
Root cause gets identified in no time.
Resolution is just a matter of hours.

The reality

R&d, Support Service and the IT Ops reside in their own space.

And multiple support levels co-exist: L1, L2..

IT Ops are the application guardians.
Hosted application knowledge – business or technical – is optional.
Investigation procedures must be ideally documented.

The production

Production environment is an excellent stronghold.

Expected for security reasons. No discussion.

You do not let strangers accessing it easily, even colleagues.

So data access is under strict control : procedures must be followed.

Non-production environments are here to reproduce issues.
But it is often difficult to get those in sync with the production,
also many factors make it the incident reproducibility impossible :
special applicative usage, external connectivity, race conditions…

The geography

All actors are not always in the same geographical region.
Servers hosted in the US, support in India, R&D in Europe.

Communication gets slower because different time zones.
You’ll get your answer tomorrow morning.

Differences of culture can also affect the communication.

The means

IT operators, support people and developers do not think the same way.
They basically do not speak the same language.

Tools at their disposal are really different :
– Production tools are monitoring oriented, very expensive.
– Customer support have often home made investigation tools.
– R&D tools are powerful but not adapted for production environments.

“I do not understand your request as I do not know the hosted application.
And running that command is out of question.
Let me provide you with a screen shot of what I observe on prod..”

The organizations

Different companies can be involved : barriers get in place.
It means different internal process flows.

I need my manager approval first to prioritize your ticket.

Team rotation is also a reality in support and production.
And resource turnover cannot be ignored on medium term.

Sorry but your support contact has unfortunately left the company.
Yes I know, he was very knowledgeable on the product.

The Ping Pong game

And finally are the interactions, called the ping pong game.

Successive round trips, taking time to access partial information,
hoping R&D will get the right ball to catch the issue.

Some people do play the watch, asking useless questions
while waiting for knowledgeable people to appear.

IT Ops see the incident signs and impacts, with some delay.
Challenge is to react in the right way : support guidance is vital here.
Unfortunately, the ping pong game makes it very slow.
Investigation procedures should be as light as possible.
Risky restart is often the solution.