NAB is steadily reducing its reliance on overnight batch processes to get data from source systems into a central data platform, while also reducing its operational costs.
Speaking at the Databricks Data+AI Summit in San Francisco, data platforms executive Joanna Gurry provided the most comprehensive view of Ada, the bank’s second-generation data platform, since its existence first broke cover in late 2022.
Ada, which runs Databricks on AWS at its core, replaces several large data platforms, including the NAB Data Hub (NDH) and a 26-year-old Teradata environment considered “intrinsic” to a number of banking operations.
Data made its way into these previous environments via batch processes run each night, presenting a “kind of batch schedule hell” for some source system owners.
“Having to extract files very late in the day or overnight means that their operational teams have to be very attentive during the night to make sure that those massive files that they're extracting and sending over to the data lake and the data warehouse arrive intact,” Gurry said.
The process also meant that a large amount of duplicate data was scooped up and uploaded every night.
One of the key changes with Ada - or, more specifically, with a tool in its stack provided by Fivetran - is that it’s allowing the bank to “move away from file-based ingestion” and be more selective in what gets extracted and replicated in the centralised data platform.
The bank is using Fivetran’s data connectors to interface with source systems and its “change data capture” functionality to work out what’s new in the source system, and to relay that across to Ada in real-time.
That, in theory, should be considerably more efficient for the bank - and Gurry’s appearance at the summit quantified that efficiency for the first time.
“The couple of benefits that arise from that approach are certainly around the volume of data that you need to store, so we're seeing reductions in terms of cloud, processing and storage costs,” Gurry said.
“Within the first year we've seen ingestion costs fall by about 50 percent.
“The performance is also really good - we’ve seen a 30 percent increase in the performance of machine learning models as well as just ad hoc queries that are written in SQL, and in some cases it's much faster than that as well.
“So, the benefits are certainly starting to accrue.”
Gurry said that source system owners backed the use of Fivetran once they realised that the technology would not have a performance impact on the source system that it was constantly checking for changes.
“The Fivetran tools hook onto their systems and [the changed data] just trickles through during the day and overnight,” she said. “It's a great relief to them.”
Bringing in more data
NAB is aiming to have data from “over 200” source systems represented within Ada.
“In one year, we've loaded 80 systems; partway through our second year we're already up to 120 systems, and we're heading for what will be more than 200,” Gurry said.
The ingestion plan is accessible to staff via NAB’s intranet.
“We publish exactly where we're up to on every single source ingestion and when it's going to land, so that's all internally available to anyone working in the business,” Gurry said.
Technical progress is tracked using a dashboard featuring pictorial illustrations of five kangaroos.
“Every source system, before it's finished and we say that we're done and we move on to the next set, has to meet these five ‘kangaroo’ standards.
“We have a dashboard and it's got the five kangaroos on it, and you can actually click through each one and see who signed off on the requirements, who ran the test report, where is the test report, how were the test results run, and all of the screenshots to say this thing ran 14 days straight after we loaded it in, with no interruption and that the pipelines are stable and that we were able to exit hypercare.
“It's actually great for our own teams because they can track if they're dependent on any loads to finish ahead of when they start their work, and the auditors like it because it's all self-serve … and they can grab anything anytime they want.
“I think that's been one way that we've really gained trust, not just with our own team, but with our stakeholders and our user community as well.”
Metadata capture
A known part of the program of work around Ada is also centrally documenting business metadata, such as data dictionary definitions, classification and data lineage, in Databricks Unity Catalog.
This is important work, particularly where source systems are older and “some of the knowledge around these systems has dissipated” over time.
“There's hardly anyone in the bank that actually knows what's inside some of these databases. Even the application teams that run them might not have a data dictionary.
“So, we're building a new metadata repository within Unity Catalog. I would say it's probably the first time that we've had these [hundreds of] systems with all of their business metadata and the technical metadata documented in one place.”
AI springboard
Gurry noted that data is “an inexhaustible resource”, with many reuse opportunities as long as there was data platform infrastructure in place to support that.
“If you build it and load it once, you’ve really set yourself up for success with reuse across digital pipelines, traditional BI, regulatory reporting but also [created] the platform you need for machine learning,” she said.
“For us, [Ada] is also the springboard for enabling all the exciting GenAI use cases that we're pursuing at the moment.”
Gurry raised several use cases that were also published in an article earlier this month. She called out one example not on this list: “We're doing a tone-of-voice alignment where we're using GenAI to prompt changes to campaigns and [personalised] communications to further align to NAB’s tone-of-voice.”
Ry Crozier attended the Databricks Data+AI Summit in San Francisco as a guest of Databricks.