As Bank of England governor Mervyn King calls for an investigation into RBS Group's IT failings, a former senior IT manager at RBS Group, and a banking industry IT veteran, writes exclusively for Computing about what may have caused the 10 days of problems that have affected millions of customers at RBS, NatWest and Ulster Bank. He wishes to remain anonymous.
As a former IT manager at RBS it has been interesting to see the various directions in which the finger of blame has pointed for the group’s recent “technical glitch”. Some have blamed declining coding standards, others outsourcing, others an over-reliance on the internet. However, the likely causes of this – and related problems in banking systems generally – are somewhat more mundane.
When I first moved to RBS (taking over an established team and systems) the most serious issue I found was poor source control; nobody knew what code was running in the production system. In fact, there was a high degree of confusion regarding where the code was in general. Some was stored on individual developers' PCs.
If you do not have code under control it makes it close to impossible to do valid testing. It also makes supporting the system or fixing bugs a rather hit-and-miss affair. Sadly, not knowing what code was running in production was not a problem unique to my team – or to RBS.
Releases going into an unknown environment are dangerous enough, but a further (and related) problem I found was the uncontrolled way that releases were being made.
Not long into my role, I had to declare a complete freeze on all changes to my systems (even bug fixes) because of the damage my (inherited) team was doing.
In principle, there were strict controls around releasing code, with every code change requiring sign-off from an authorised business user and an IT manager. But in practice, anyone could release whatever they liked whenever they wanted.
No one really was responsible because around a dozen IT people and a dozen business people could sign off a change. Where responsibility is diffused that much, any change can get signed off.
Reviewing all proposed changes personally I quickly learned that half the changes (averaging 10 a week) were to fix things broken in previous releases.
"Poor practices around release management meant that, in some RBS teams, there was no guarantee the code released was the code that had been tested"
Closely tied to source control are release and environment management. In many ways these are some of the dullest areas of the software development lifecycle and RBS regarded them as commodity services that could easily be moved offshore to its development centre in India.
However, environment management (setting up the versions of the system for development and testing) and release management (moving the new versions of code to production) are critically important and need to be done by people who understand the systems they are working on. Poorly managed environments can slow down development and invalidate testing.
Release management ultimately needs to make sure the right version of the programme is released (ideally the version that had been tested). Poor practices around release management meant that, in some RBS teams, there was no guarantee the code released was the code that had been tested.
Documentation is another area regarded as dull, but which can be a major source of failure. One area RBS IT got right was producing the diagrams showing how the bank’s many systems are connected. I have passed through many banks that have failed to get this basic piece of housekeeping right, even where they had large teams of architects theoretically guiding the development of the infrastructure.
However, large swathes of infrastructure across the industry have little if any documentation. Even where teams are forced to produce a disaster recovery plan, this can turn into a box-ticking exercise, producing plans that are impossible to follow.
Testing is one of the most crucial parts of the software lifecycle, is greatly talked about and has large amounts of money spent on it, but I still get the feeling – in bank after bank – that it is not really taken seriously.
In RBS, testing was aggressively moved offshore both because the testers were cheaper and almost always computer science or engineering graduates. But the reality of testing is that testers need to understand what they are testing from both a functional and a business perspective.
At RBS – and other banks – large test teams add little value if they are assembled quickly with no time to understand what they are testing.
One large RBS test team was even forbidden to raise bugs directly with the vendor because they did not understand enough about the normal operation of the system to distinguish between a genuine bug and functionality they had not seen before.
None of these areas is as exciting as rolling out the latest Agile techniques, trying to achieve CMMI level 5 or rolling out shiny new systems, but they are the things that, if neglected, can turn the smallest bug into a major system failure.
• If you are a banking CIO in search of excitement you can actually magnify these problems by simultaneously re-engineering the whole systems infrastructure (as RBS did in its ill-fated, £550m, North Star programme) and aggressively moving every possible technical function offshore.
By eliminating high entry costs for big data analysis, you can convert more raw data into valuable business insight.
A discussion of the "risk perception gap", its implications and how it can be closed