Macquarie Bank spent some of the past two years finding ways to beat “diseconomies of scale” it experienced when scaling up the number of teams deploying applications into the cloud.
Kubernetes platform owner and associate director Jason O'Connell told the recent OpenShift Commons Gathering in Boston that standardising the way deployments to OpenShift were run allowed the bank to onboard more teams and applications, without taking up too much time or internal resources.
Macquarie Bank’s retail digital banking platforms run on a platform-as-a-service that consists of Red Hat’s OpenShift container platform hosted on AWS infrastructure.
The bank is also testing an OpenShift-like alternative in Google’s new Anthos service, which may be related to its previously-stated desire to run in multi-cloud, including the Google Cloud Platform (GCP).
Two years ago, the bank was migrating its first applications to OpenShift - and therefore into the cloud - “so our whole focus at that time was getting into production and making sure we had everything stable and running,” O’Connell said.
“After that, we wanted to move to scaling out OpenShift and offering it for every team in the organisation to make sure that any team could use it to migrate to the cloud.
“So Openshift became a core part of our cloud migration strategy.”
However, as the bank put more into the cloud, it started to see certain costs - such as those incurred by the platform team supporting and migrating an application into the cloud - increasing, at a time when the bank had expected these costs would reduce.
“When we first went live in production [with OpenShift], we were getting quite good at onboarding new applications so our cost was coming down,” O’Connell said.
“But things started slowing down as we onboarded more and more teams.
“Initially, we were working together with product teams to onboard their applications. Later, as we expanded out to the organisation, we were dealing with teams that aren't even in the same building, let alone the same country, and that are less mature in their understanding of [containers].
“Everything was new for them, and so what it means is we slowed down a lot.
“This is a diseconomy of scale. What we want really is an economy of scale here. We want to make it that we can onboard 10-20 teams, or go from 300 applications to 500 applications, seamlessly and with no friction.
“It shouldn’t need the platform team ... extra resources in order to onboard more and more applications. So that's what we're aiming for.”
O’Connell said that the process of migrating and deploying applications into the cloud was already heavily automated, but that the automation was being done on a “per application” basis.
“The teams used to run their own deployment scripts, and things got very messy. Rather than everyone doing things differently, we started to say 'you have to do things the same way',” he said.
“What we really wanted is reuse. We don't just want to automate - we want to automate it once for everyone, and have teams reuse those scripts.
“So we said there's going to be one way that you do deployments, and we're going to write those scripts.”
One thing that helped the change was that, upon review, it was found that most applications being put into the cloud had similarities.
“Although all teams think that they're doing things differently, we actually realised that 90 percent of the applications we run are Spring Boot microservices, and they're actually very, very similar,” O’Connell said.
“So we had to force teams to standardise.”
The bank also created standardised “capability services” that all teams can use in their projects.
“These are things like secrets management, which we're migrating to [HashiCorp] Vault, chatbots, CI/CD with Jenkins, Knative - these are core capabilities,” O’Connell said.
“in OpenShift it's very easy to install some of these tools in five minutes and development teams and product teams want to do this, but...it means we'll lose standardisation and lose control, and things can become messy.
“So we say that any capability service our team runs, we build it properly, we make it multi-tenanted, we address security/controls/risk, and we build it for everyone to use.”
O’Connell also highlighted the importance of controls to creating a scalable OpenShift-based environment.
At Macquarie Bank, controls governed functions like user access management and resource management.
“Being a bank, we need controls on all of these things and the controls you need are [at] different levels,” he said.
“I can't go into all the controls, except that if you want a scalable platform, any control you put in, you need to focus on the developer experience.
“You need your control to be frictionless, automated and self-managed [because] if you block the deployment on a release day for a team, and they don't know why, they're going to come to the platform team - my team - and cause a big noise and then there'll be a lot of work to get them unblocked.”
O’Connell provided an example of the amount of work that went into defining a single control - around teams that request more CPU and memory for their applications.
Some aspects of the control focused on “disincentives” to ask for more system resources, without addressing resource-hungry aspects of the application.
“We say, ‘if you're over your capacity, we're going to prevent your deployment so locking you down until you bring it [back] under - that's a disincentive to force [a team] to adhere to the control,” O’Connell said.
“To ensure that they optimise, we need chargeback. So if they ask for more CPU and memory, they get charged. If you don't put in chargeback, they just ask for more and more and more.
“We need them to be able to self-manage so we give them the ability to auto cleanup in nonprod, so they can say ‘these developer environments, they're going to be cleaned up nightly’, or ‘these environments last longer’ and so on.”
O’Connell said that if a team actually did need more compute, the control meant that the allocation could be “self-approved so they don't need to involve [the platforms] team.”
Teams were able to see their current compute usage on a self-service dashboard.
“We also have email and we're just building out chatbot alerting so that when they do approach their limits, we're alerting them so that they know that they're not going to be blocked in production, they're given enough warning before then,” O’Connell said.
“So just for this one control, to make it seamless and frictionless, we need to actually build out six different components and applications just to manage the control of resource usage.”
The bank said it now has 42 teams that have collectively deployed 306 applications to the cloud via OpenShift.