Africa has the potential to be a leader in open source distributed computing, says Standard Bank Hadoop expert

But it's held back by a lack of available skills

South Africa's Standard Bank faces all of the familiar financial-sector challenges around data security, governance and compliance, plus a few more besides. The country's new data protection law PoPI - based on the EU GDPR - is due to land next month. Meanwhile data security is an increasing concern, and volumes of data continue to grow inexorably. Unlike in other parts of the world, finding a home for archived data in the cloud is not yet an option.

While Amazon and Microsoft have plans for a footprint in South Africa they currently operate no data centres in the country, and regulations forbid the movement of personal data cross-border.

A few years ago the need for a central store to hold the rising volumes of data led the bank to look at Hadoop.

"Our original PoC in 2015 was to offload data, so a data lake not an analytics platform, it was to free up space" said information scientist Brad Smith. "Now we're starting to look at Hadoop as the basis of analytics."

Smith and his colleague Hadoop administrator Ian Pillay head up the big data group in the bank tasked with setting up the Hadoop data lake. The bank's IT team is keen on adopting as much open source software as possible, both for the cost aspect and the community aspects - being able to contribute to the code, make their own tweaks and to receive support in return. This informed their choice of vendors.

"We chose Hortonworks because of their flexibility and openness, and because the offered the right sort of training and support," Smith said.

Ideally, he added they would only use open source tools to ease the integration woes, but the real world gets in the way.

"We are about 90 per cent open source but we still get stuck with proprietary and will do for the foreseeable future because the business gets what the business wants, but we do push back where we can", he said.

Of more serious concern, though, is the skill shortage in areas like data science.

"The biggest issue in South Africa is the skillset - it's non-existent. So you've got guys doing computer science coming in and trying to do all the data science," Smith said.

Such is the scarcity of talent that training people internally in data science or Spark just leads to shortages lower down. "It's a real problem," Pillay added.

Standard Bank operates in 25 different countries including eight in Africa and has 15 million retail customers. The Hadoop-based storage and analytics system is intended to provide real-time responsiveness in business areas from marketing to fraud detection. It is a single instance, multi-tenant data lake that currently serves many of the bank's African branches, although this may need to change when the new data protection laws come in with their geolocation restrictions.

"There might be a requirement for them to run their own stacks as a sort of data fabric," Pillay explained.

Currently the emphasis is on storage, offloading data from the bank's archives and applications, ingested rapidly using the Hortonworks Data Flow (HDF) system, but - provided they can find the staff - Smith and Pillay are now looking to up their analytics potential, combining the streaming data with that stored in the data lake.

Skills shortages aside, Smith is bullish about Africa's potential to be a leader in the field of distributed computing for the same reason that mobile commerce took off so quickly there: the ability to leapfrog a generation of technology.

Africa is a continent to itself, but I think you'll see some amazing things come through, because they don't have legacy systems so the adoption rates of distributed computing technologies, specifically open source, are skyrocketing," he said.