When HMRC's data analytics team stepped up to collect their award for the Best Big Data project at the UK IT Industry Awards in November for its Connect data warehousing and analysis project, it was the culmination of an initiative conceived as far back as 2005, when the Inland Revenue was merged with Customs & Excise.
The merger had been the long-overdue catalyst for a rethink of the way in which both organisations handled fraud detection. While the kind of data warehouse-based fraud detection systems commonly used in the banking industry were not unknown to the taxman, the implementations had been heavily siloed.
For example, there was a system for detecting VAT fraud, but focused solely on data drawn from the VAT system; likewise, there was a separate system for analysing self-assessment tax returns. Yet, key markers for fraud might come from anywhere, such as a mismatch in the figures a company may provide for its corporate tax return compared to the trading it claimed underpinned its VAT return.
"There were numerous systems in use, but none of them was integrated," says Mike Hainey, head of the risk and intelligence service data analytics team at HMRC. "To ‘risk assess' on a broader front, you would have to dip in and out of these marts and have some very skilled people make the joins between the information sets. That was not very effective," he says.
Furthermore, there are many much more subtle markers for fraudulent activity that simply couldn't be pursued because of the inability of staff in the Enforcement and Compliance department to be able to simply "play" with the data. It could take weeks or months for specialist computer staff to put together the data mart to enable compliance staff to investigate a new area, or in a new way.
As a result, HMRC was unable to get a truly rounded picture of individual taxpayers or businesses - a genuine, single "view" of the "customer".
The all-seeing eye
The merger of the two revenue-collecting departments offered the opportunity to put together a business case for a more sophisticated approach, using the latest data warehousing technology and analysis tools.
First, the organisation put together a group made up of people from both "sides" of HMRC, as well as Aspire, the organisation that runs HMRC's outsourced IT, and services company Detica.
It analysed the software options on the market before approaching vendors to conduct a pilot phase. This involved delving more deeply into the data for a particular - unnamed - county, analysing data around both individuals and companies to assess how effective it might be country-wide.
The aim was to identify the real-world "views" that would enable HMRC to make most sense of the data from a fraud and evasion point of view. Entities were put together so that the data could be analysed in different ways. So, says Hainey, an individual would be one entity, a family would be another entity, and a company another.
"So we identified real-world entities in which data clusters around, and then looked at the commonality in those areas that link those entities together," he says.
From that, it would be much easier to extrapolate someone who was the director of a number of companies, his family connections and, say, the companies that his wife is director of, as well as any family trusts, too - the data, in other words, could be clustered around these entities.
"We started to see low-hanging fruit early on in certain areas and in terms of spotting certain trends and patterns," says Hainey. "That made us realise that it was actually delivering a quality product in the area of spotting fraud indicators that we previously hadn't seen because we were suddenly aligning other data sets that, combined, was telling us a different story."
The pilot project was not only a success, he adds, it highlighted a number of areas where HMRC could achieve an almost immediate yield. Indeed, what was learnt in the pilots was almost immediately fed back into the "business" so that action could be taken.
It also helped to build a strong business case not only for the system, but for the ongoing running costs that would be incurred. This had to be put to an internal investment committee to get the business case approved.
The front-end of the system comprises what HMRC calls the Integrated Compliance Environment (ICE), a graphical tool from Detica that enables investigators to put together information around entities, and the Analytical Compliance Environment (ACE) that analysts and statisticians use to put together risk profiles and interrogate large volumes of data.
At the back-end, SAS Institute provides the data warehousing, while DAN - the Data Acquisition and Networking system - provides the extraction, transformation and loading (ETL) capabilities for taking the data items in different formats and transforming them into a structure that the data warehouse can store and make available for analysis.
A feast of data
In addition to being able to get a single view of "the customer", the system is able to incorporate a wide range of data from other sources too, says Hainey. "It's departmental data at one end of the spectrum, commercial data - bought-in information."
Lawsuit claims that NHS England failed to do an impact assessment before handling a new two-year contract to Palantir
Facebook and Google unlikely to be punished as long as they can demonstrate a 'significant contribution' to local journalism
Rishi Sunak must now show us the benefits of Brexit to support pandemic recovery
Rather than tying up with Oracle and Walmart, ByteDance is instead seeking to restructure TikTok's US operations