The ideal world

R&D, Support Service and the IT Ops collaborate
in the same room, on the same incident.

Investigation gets easy and fast.
Root cause gets identified in no time.
Resolution is just a matter of hours.

The reality

R&D, Support Service and the IT Ops reside in their own space.

And multiple support levels co-exist: L1, L2..

R&D is the final link.
Developers do know the application internals and are therefore vital in the incident resolution which can be bad applicative usage or code related.

The production

Production environment is an excellent stronghold.

Expected for security reasons. No discussion.

You do not let strangers accessing it easily, even colleagues.

So data access is under strict control : procedures must be followed.
Exceptions exist to let R&D investigating when it’s getting super critical but this is often too late, especially when the issue is not reproducible at will.

The geography

All actors are not always in the same geographical region.
Servers hosted in the US, support in India, R&D in Europe.

Communication gets slower because different time zones.
You’ll get your answer tomorrow morning.

Differences of culture can also affect the communication.

The means

IT operators, support people and developers do not think the same way.
They basically do not speak the same language.

Tools at their disposal are really different :
 – Production tools are monitoring oriented (APMs) and very expensive .
 – Customer support have often home made investigation tools.
 – R&D tools are powerful but not adapted for production environments.

“I do not understand your request as I do not know the hosted application.
  And running that command is out of question.
  Let me provide you with a screen shot of what I observe on prod..”

 

The organizations

Different companies can be involved : barriers get in place.
It means different internal process flows.

I need my manager approval first to prioritize your ticket.

Team rotation is also a reality in support and production.
And resource turnover cannot be ignored on medium term.

Sorry but your support contact has unfortunately left the company.
Yes I know, he was very knowledgeable on the product.

The Ping Pong game

And finally are the interactions, called the ping pong game.

Successive round trips, taking time to access partial information,
hoping R&D will get the right ball to catch the issue.

Some people do play the watch, asking useless questions
while waiting for knowledgeable people to appear.

R&D gets difficulties the have the full view and provide the right fix.
It’s also very challenging to reproduce any exotic issue.

You probably have several of those failure factors active.

What do we need then ?

How can we make the incident management fast and efficient ?