ANZ Banking Group has spent the past two years removing internal “gatekeepers” that slowed the progress of new code, introducing automation and removing the need for developers to pull marathon 18-hour Saturday nights to put stuff into production.
Details of this project have been drip-fed from the bank over the past year, including at last month’s regional Red Hat forum, where the bank picked up a vendor award.
The project started as a way to “increase the speed of software releases”, Red Hat said, and saw the bank migrate “more than 25 services into containers within nine months” of 2017.
The technology stack underpinning that is Red Hat’s OpenShift container platform as well as a range of other Red Hat technologies.
“ANZ reports it reduced deployment time from hours to minutes, allowed development teams to deploy their own code into production, automated configuration, and enabled zero downtime deployments,” Red Hat said.
A presentation given by the bank in August - which has only just been made public - provides one of the most detailed looks yet inside the project.
Platform engineering technology area lead Mike Berry said the project owed its earliest origins to industry hype surrounding Docker.
“Back in 2016 we were thinking we should do ‘Docker in production’ ... because we think this will be great,” Berry said.
“Then we started to talk about requirements. We had a look at a few different ways to do that - Rancher was a big at the time, we had a look at Docker Data Center, but then we also looked at OpenShift and we decided to go with that.
“One of the reasons for that is OpenShift is kind of like a Red Hat-branded version of Kubernetes, so if you want to do things with OpenShift and with Kubernetes, it’s kind of the same thing, but Red Hat gives you a few extra bells and whistles to do that.
“And then if you go look at the code base right now for Kubernetes, the number one [contributor] is Google putting lots of commits into it, but the number two company is actually Red Hat and so they have some experience there.”
Berry knew that the Docker and Kubernetes ‘cool’ wasn’t going to be enough to sell the project so his team looked at “other reasons” to start using.
“For us our reasons were faster deployments, particularly with zero downtime; increased stability for our environments so that if you lose a server it’s not the end of the world; more production-like environments - it’s hard if you’re hand building bits and pieces here and there to make them all the same; and greater end to end control of the application lifecycle,” he said.
Removing the blockers
ANZ had problems with “gatekeepers” in various internal teams slowing the introduction of new code into production.
Berry cited an example of a job that had to run on five servers every morning from 2am.
“To achieve that goal in our organisation - and fortunately this organisation does not exist any more, this is last year, but at the time this was all genuine 100 percent - you had [to engage all] these different teams,” he said.
“You could fill out a form to engage them, or send them an email or phone call or maybe have a Skype conversation.
The first two months of the process would be “wasted trying to get five servers to run a job”.
“For the next two months we had to do things like find charge codes, determine whether a project management person was assigned to this, and deal with all kinds of issues that had nothing to do with technology,” Berry said.
“So we get to the end of this and there’s two phone calls because I can tell you these teams do not like to phone each other, there’s four different forms that you fill in to engage with people, there’s 17 conversations on Skype, there’s 106 emails and 62 business days - I’m not even counting weekends or the Christmas break - and not a single server has been completed”.
Berry said he couldn’t foist all the blame for the delays on other internal teams.
“To be fair we did change our requirements during this process. We said we don’t want five servers, we want nine servers. So part of it’s my fault,” he conceded.
Painting by numbers
But the process was overly complex, and Berry was able to convince the business it was so by graphically representing the process - and huge delays - to ANZ management.
“If you’ve got data in a spreadsheet, turn it into a nice picture and then your managers will understand what you’re talking about, and then they’ll agree this is not right and we’ll change this,” he said.
In summary, that graph showed that when an infrastructure request was lodged, it went “to a pool of people where eventually it gets to someone who actually looks at it, and then what happens if it’s not quite right - it goes through again and again. Eventually it comes out the other end, maybe at some point with what you wanted.”
“The more people and teams in the way from here to here ... the longer it takes,” Berry said.
“That was the motivation for us behind [deploying] Kubernetes because I can get to a developer and say, ‘you just do everything you want to into this container, and then I’m going to build a system that automatically pushes that through. You don’t need to talk to people about that thing, it’s going to be automatic’.”
Berry noted that shifting to container-based deployment was not just a technology challenge but also a cultural one.
“We had to talk to teams and say, ‘look we have the new technology, you’re not the gatekeeper anymore, your job is to build automation’,” Berry said.
“If you want to audit our system, go to the logs, you can look at them anytime you want, but don’t stop us and have meetings about having to approve things.”
Other lessons learned
Berry used his presentation to deal with a number of lessons that ANZ had learned from its Kubernetes and container orchestration experience.
One of those was to keep it simple - the bank started out deploying only two services onto Kubernetes, and this ended up working well, as ANZ experienced problems with logging once into production.
“One does not simply deploy everything to Kubernetes and hope for the best,” Berry said.
“We were going a bit fast with this, and this was our post production live problem.
“When we first set up logging it wasn’t great and there was all kinds of issues with throughput … Most logs made it through but not all, and if you work in a bank you really want all of them.
“So our report card was ‘needs improvement’. The lesson is you need to plan for logging. Logging doesn’t magically happen - it won’t be automatic.”
For ANZ, not having migrated everything across to Kubernetes at once gave it some breathing space.
“When you just have two services logging it’s really not that bad. But we had more things ready, and if we had just deployed them all and went live then it would have been much worse, and much harder to fix,” Berry said.
“There’s a lot of new things in this ecosystem to learn, and you’ll only learn by doing it. Logging was the one surprise for us.”
Time to represent
Berry said the bank also introduced Kubernetes with strong internal representation on both the development and operations side of the bank.
“You need some people on your side that are willing to fight for this new thing,” Berry said.
“Things that are new and exciting in technology organisations that are big and clearly complex [don’t experience] success without some good allies on your side.”
Containerisation also brought some technical challenges, and Berry said a reasonable amount of education and training had been required.
This included dispelling internal thinking that “containers are ... little virtual servers”.
“Particularly for containers, they should be immutable - you never ever change the container, you don’t jump in if it’s not working well and add something to fix it. You get your image, you find something wrong, you change your image, you redeploy it, and now you have a new container running,” Berry said.
“The other [characteristic of containers is they’re] ephemeral. Ephemeral means containers can be stopped, destroyed, or replaced at any point in time.”
Berry said that understanding the characteristics of containers, and the differences from virtual servers, was important to developers knowing what they should and shouldn’t containerise.
ANZ did some hands-on virtual training to reinforce that; the virtual nature - via videos, phone calls and Slack - was necessary since ANZ’s developers fan across Melbourne, Sydney, Bangalore and Chengdu.
“We have all our people on a phone call and Slack where one of my team will make a broken CI [continuous integration] build and give it to everybody and get them to make it work,” Berry said.
“We talk through how you would troubleshoot that.
“Because it’s not the running that teaches you anything about automation because it just runs. It’s how do you fix it if it breaks [that is where you learn].”
Big speed gains
The results of the Kubernetes deployment and automated code deployment have been pronounced.
ANZ’s developers have gone from pulling all-nighters on weekends to being able to get code into production during business hours in about an hour.
“This is still super common but if you were at a large company you might do the old 18 hour Saturday night to deliver a thing into production,” Berry said.
“Some people like those things - we’ll all get together, get some takeaway and deploy some software. I’d rather be at home.
“Now we do deployments in an hour but that’s during business hours and we verify and check it. That’s pretty good.”
The bank now also “rarely requires” either planned or unplanned outages of its dev and test environments.
“Because people would change things in a test environment so much, we would always have ‘it’s a planned outage, it’s fine’. I’m like, no, it’s still offline. Just because you planned it doesn’t make it better,” Berry said.
“And so even though we have very meticulously planned outages, they’re still outages.
“Those have gone away now. Even if we do have an outage - because it’s not a perfect world, things crash - in the new world, or at least in Kubernetes, just that element crashes, not the whole thing. It’s better.”
New environments used to “take ages” to create and validate, and ANZ Bank has gotten that process down from months to “same day”.
Berry was particularly pleased that the bank could now react quickly to issues it experienced. Though he noted this wasn’t specifically thanks to Kubernetes, the project formed part of wider efforts to improve continuous integration and continuous delivery (CI/CD).
“To me the most important thing is that we had a security issue we found, we pushed that through the full cycle in one day,” he said.
“If you have a really important thing that needs to go through, we have a system that allows that same day.”