Vertu: 'If you spend £10,000 on a phone it has to work'

Cloud architect Rob Charlton explains how analysing operational data takes the guesswork out of releasing new luxury smartphones

Vertu is a UK manufacturer of high-end handsets. For high end, read the highest of ends. The smartphone equivalent of a Bentley, the Signature model in titanium and red gold livery will set you back a cool £13,500 in Selfridges at the time of writing.

Unsurprisingly, the buyers of such phones are notably unforgiving of any failure. Cloud architect Rob Charlton put it very simply.

"If you spend £10,000 on a phone it has to work," he told the audience at Computing's DevOps Summit yesterday.

At this level, any failure is much more likely to be down to software than to hardware. While Vertu provides its own apps (one that allows you to speak directly to a personal concierge to organise your life and another that allows you to bid for prizes including dinner with president Obama, for example), failures in the software are likely to originate lower down in the stack. The low-level systems, the Linux kernel and Android between them comprise 35 million lines of code and there is yet more piled on top of that.

"I checked the source tree on our current product and there are 60 million lines of code in it," Charlton said. "That's a lot to go wrong. The phone might reboot, the camera might take fuzzy photos, the battery might run out quickly or overheat."

Five years ago it was clear that Vertu's IT team needed to be more agile and responsive, to improve throughput and to become better at nipping problems in the bud during the testing phase. The company began an IT transformation process, moving on-premise infrastructure to the Amazon cloud, transitioning development to DevOps, automating manual processes with Puppet and then Ansible, and later using Splunk to analyse operational data generated by the phones.

During the development of a new model, a software agent that collects relevant metrics is run on the device. The data from the agent is uploaded to a server in AWS and then fed into a cluster running Splunk Enterprise. Feedback from this system helps the testers to pinpoint the causes of any failure.

"We look at how long has it been on, did it crash, what's the battery life like, all sorts of things," explained Charlton.

"We're constantly trawling through that data and if we see a signature of a phone crash, then rather than waiting for the person to report it, an email is sent directly to the crash analysis team."

Product managers are given more visibility too, using a dashboard to monitor the entire system and removing the guesswork from the timing of the launch of a new product.

"They can see how many people are testing, what they've tested, who is on an old version of the software," Charlton said. "Suddenly, they can get a really good picture of the software and the main metric we use is the mean time between failures, or MTBF, which is used to assess the maturity of the software."

He continued: "This data has absolutely transformed the way we make our products. We are now data led. We can get an actual figure out saying the MTBF has now reached this key level and so we know we can launch, rather than saying, well, there aren't many bugs now so I think it's OK."