Trust Bank: Building for scale while enhancing the customer experience

最新推荐文章于 2024-10-04 18:35:24 发布

李白的朋友王维

最新推荐文章于 2024-10-04 18:35:24 发布

阅读量149

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134810227

版权

One. Good morning morning. Hi. So um my name is Rae Rai. I'm the CIO of Trust Bank, which is the first digital bank in Singapore. And it's a pleasure to be here today and thank you very much for making it down here to listen to a talk.

So, with me today, we have Xavier who is uh from AWS, a principal architect in AWS and he's been with us since day one and we have uh Scott who is our head of cloud platforms, again, been with us since the beginning.

So we're gonna talk to you a lot about our architecture. We'll introduce you to, of course, uh Xavier is going to introduce you to all the native uh solutions that we use from AWS and how we innovate on top of that will Scott's gonna take you through that.

So, and at the end at the end of this session, we have a video for you that tells you how it comes together and how we create these amazing experiences.

So a bit of a background of about myself, but I started uh with about 10 years in manufacturing way back in 1991. Spent 20 years in financial services in Australia. And then I moved to Singapore to set up the bank.

So when I moved to Singapore, the first question that was asked was, does Singapore need another bank? And you know, quite frankly of Singapore is over bank. 98% of the population is over bank, right? They are all access to banks. I think we have some of the biggest global banks uh operating in Singapore. And with that, you know, they have been investing a lot in the digital experiences. So there was a lot of work to do in order to catch up.

But let's see what happened since we launched, we launched in uh first of September 2022. And since we launched, as you can see, the purple line depicts where we are, right? So we started with about 100,000 users in the 1st 10 days and then in 200 days, we had half a million customers. And today we are one of the fastest growing digital banks in the world in terms of the penetration that we have in the market. We've got, we over 12% of the population of Singapore today. And the other, the other lines that you see there are the other banks that have, you know, been operating for a while and it shows you how they've been growing over the past few years.

So a little more about uh Trust Bank. Trust Bank is a partnership, about 60% partnership of Standard Chartered. I'm sure you've heard of Standard Chartered, which has been operating for about 100 and 65 years. And with that, there is NTUC FairPrice and they've been operating for about 50 years in the Singapore ecosystem. They have, you know, groceries, insurance, and food and one of the largest loyalty points, reward points in the country with about a million interactions with the ecosystem every day. That's more than 20 about 20% of the population of Singapore.

So yeah, and we are the first digital native bank of Singapore.

So let's talk about uh the foundations of a success, right? Some of the key foundations. So we'll talk about three.

The first one is of course about scale when we started the cost of, you know, running um and cost of operating is obviously high, you have lesser customers. So we've been growing exponentially and you can see as you move on your cost of operations reduces, we only have about 100 engineers. So we have to be very, very efficient. And as Werner Werner Vogel said, right, everything fails all the time. So you have to design for failure, you have to be available all the time. The SLA is extremely high when it comes to regulators in Singapore, you have to maintain an SLA of 99.95 for the entire year. And of course, you want to be always on in terms of your architecture.

The second is you've got to be real time. Now, when you think about the digital customers today and the consumers, they have, you know, this expectation of instant gratification, Uber Airbnb, they set the standards, right? You have to be real time all the time and available. And most importantly, if you have stale data, it's of no use to your customers, right? So in our case, you'll find that what we've done is make sure that we use real time data to be very parent. So we know if you have spent you know three transactions, you get a bonus interest. Everything we do in the bank is real time. So they have a habit of instant gratification and the architecture needs to support that

Last but not the least is speed of execution. You must have heard of that saying right? There's a famous code out there that says vision without execution is hallucination. So you have to have the ability to execute and execute at speed and scale. So ultimately, as you execute fast, you start building an experimental capital for the bank, right? For your organization. And for us, what we do is we've shifted, of course, you shift left, you shift security left, you move fast, but you don't fail. As a bank, you cannot fail. So if you want to fail, you fail for us, please fail, shift left and fail early in your life cycle.

Now, exponential organizations have a tenacious ability to execute, right? So for example, if you see Amazon, Amazon would be releasing hundreds of changes per second. I think they've done, they've mentioned this in public per second and that what differentiates an exponential organization versus an organization can that just about executes and has a linear, you know, way to execute and grow.

So I'll hand this over to uh Scott and to Xavier who will talk more and more about that and I'll come around towards the end and I'll summarize and take you through uh some custom experiences that we built. So handing over to, to Xavier, you don't do the architecture.

Oh yeah, I forgot, I forgot that one. Sorry. So about the architecture before I hand it over about the architecture.

So we have um the core of the architectures built around Thought Machine, right? Which is our beating heart. Now Thought Machine generates all these events. So the the entire the arteries of the entire bank is essentially pumping data all the time and events all the time and you have services that are listening to these events, they can act, they can react to these events at a real time and of course adapt and respond to these events. And what Thought Machine does is it's headless, it's a core bank which is headless and all you can do is interact with it with APIs. Now it does one thing and one thing really well, which is it does ledgers, it posts ledgers, everything to do with journal posting is done by Thought Machine, everything else is externalized outside the core. And we have these micro services that are designed to listen of course asynchronous and they have been segregated from a domain perspective.

So we are definitely aligned with domain driven designs and that enables us to go really fast evolve. So for example, you get transactions that are coming through all the time they enriched as you enrich them for and you categorize them, you can actually build all these services on the side and evolve without actually dis disrupting any other microservices that's working in the ecosystem.

So with that, we are 100% containerized, we work with a lot of partners and we make sure that our partners have a container first principle and that enables us to bring our cost of operations down.

So again, this is really important. We have about less than 100 engineers to operate the entire platform. So everything we can do, we re we have repeatable things that need to be automated all the time and made easy and efficient.

So I'll uh without further ado I'm gonna hand this over to Xavier to take us to the next section and then Scott's gonna help as well. Thank you.

Ok. So let's go into the technical um more technical presentation. We will review the design choices and technical innovation for the different components and uh to start we'll go with compute.

So Trust Bank has chosen EKS as its core compute platform. As you can see on the slide, they have nine EKS clusters in production and it's configured for resilience with nine clusters. The blast radius is limited and also the applications are deployed across the pre availability zones. On top of that, they use static stability. That means they deploy additional capacity in order to be able to withstand the sudden loss of a node or an availability zone.

Scalability is another top priority for the bank. So they use horizontal auto scalar to automatically increase or decrease the number of pods based on metrics like cpu memory k fa consumer. Like for example, the Thought Machine core banking components are automatically scaled from 400 to 600 pods for daily peak load processing.

Thank you. So yes, we have nine EKS clusters in production as of now. But this month, we're deploying additional clusters into production and into 2024 as part of a architectural upgrade to our cluster design. This is due to Trust having a north star to achieve across the entire stack, zero downtime deployments to production uh on EKS um especially during business hours and for business critical workloads.

So to achieve this outcome, we realized that uh we needed to be able to treat our EKS clusters as uh cattle rather than pets, we needed them to be more ephemeral. Um hence, the redesign.

So today I wanted to sort of talk about three key enhancements, three key enhancements that uh we've done for the EKS design. The first is gonna be the use of a service mesh that can work uh multi cluster and cross cluster. The second is improving the way that we scale and manage our nodes in uh EKS. And the last one is um tooling to provide automated recommendations for right sizing our workloads.

So let's talk about uh the service mesh. So for our new cluster design, we're making use of Solo's uh Glue Mesh uh product suite. This is built on top of uh Istio and Envoy and it provides us some higher order constructs to help manage both the mesh itself um and the workloads that run upon them.

So up on the screen here, I just have some two examples of what a multi cluster mesh capability can bring for us. So the first at the top is a, a blue green where 100% of the traffic is gonna route through your primary cluster that frees you up to destroy and redeploy the secondary cluster as you wish, once you've verified it to be working state, you can then cut the traffic back over which is great for um the operational uh side of things.

The second is an active, active configuration through the mesh. So what this allows us to do is take one particular workload and run it simultaneously across two clusters. And thanks to the mesh's capabilities, it gives us some more advanced traffic management and fail over. So if a request from downstream flows through say into our application layer, and one of the uh clusters is not able to serve that request, the mesh can transparently reroute that request to the other cluster and hopefully return a successful call back downstream to the consumer.

The second thing to talk about is how we scale and manage our nodes. So since 2021 when we set up EKS and set up the bank, we were making use of the Kubernetes Cluster Autoscaler. And for us, we were finding it was taking north of about 10 minutes in order to scale it, no doubt...

Uh, it also meant that we had limited capability for leveraging Spot. And so we turned to modifying the uh the Auto Scale that we use. So just out of interest uh hands up in the audience. If you're already leveraging Carpenter for your EKS clusters, you get a show of hands. Ok. Interesting.

Um, so the advantage of switching to Carpenter, which is an open source project that's managed by Amazon themselves is that we found it has a more direct method of interacting with the cluster. And in order to manage these scaling events, the when there are unscheduled pods, Carpenter will kick in and it will be able to use uh various uh data points such as your pod resource requirements, the instance types that you specified and maybe even the instruction set architecture that it's compatible for.

And through this more direct integration, we found that we could spin up nodes in about 30 seconds and have them ready for workload deployment in under a minute. Uh, which obviously a big improvement for us.

The other nice thing is uh Carpenter will also hook into the Price List API in Amazon dynamically in order to calculate the most cost effective set of nodes to deploy for your cluster.

The other interesting aspect of this is you can group your workloads on the same cluster by node pools. And this allows you to for example, separate your workloads by the architecture. So you can have a set of nodes for x86 and a set of nodes for ARM and leverage the Graviton instance type to get those uh cost and energy savings benefits from Graviton.

There are some additional features for Carpenter that we're not aggressively using in production yet, but plan to. However, we're already making use of these in non-production. And this is important for us because the monthly Amazon bill for Trust Bank, about 60% of that bill is in our non-production environment. So if we can leveraging our cost savings early, that's uh hugely beneficial.

So now thanks to the speed at which we can scale nodes, Spots back on the table for us to use. It has some interesting capabilities such as consolidation where uh Carpenter is able to calculate the number of nodes that you need. And if you have too many provisions, it'll move your workloads over onto a smaller number of clusters and tear down the extraneous ones, obviously saving on cost.

And the last factor which we're leveraging is ATTL feature on those nodes. So let's say you set ATTL on your nodes for a day, Carpenter will automatically spin up a new node, evict and move the pods over to the new um instance and then tear down the old one. So this is very beneficial for us from both an operational excellence perspective because now we don't need a human to maintain these nodes fleets. Uh, and the second is from a security perspective because we're constantly re-rolling, repaving our nodes.

The third aspect from our EKS uplift is in the platform team providing some tooling to provide automated right sizing recommendations for our workloads. So here on the left, you can see a very simplified example of some over over provisioned pods. And we use two open source projects. One is called Goldilocks and the other is called Vertical Pod AutoScalar or VPA. And in particular, we leverage the recommender mode of VPA. So what this will do is hook into the Kubernetes Metric Server and based on current and past performance characteristics, we'll calculate the ideal set of um resource for that particular pod.

Um, and as an example, you can see up on screen just a, a demonstration through Goldilocks which is providing a user interface into VPA. So this allows our engineering teams to be able to go in and easily see what their right sizing recommendations are. The key point here is we've chosen to focus on dynamically scaling horizontally and for our vertical scaling capabilities like this, we're gonna manually curate them through making use of these tooling.

The other benefit of uh optimizing your workloads is uh it will harmonize with the Carpenter consolidation feature. Uh, and what that means is we're able to reduce the number of nodes that are required to run our workloads as we optimize them. Thank you.

So let's move to the next chapter, databases. I will um in particular focus on how and and why Trust Bong decided to migrate from RDS PostgreSQL to Aurora PostgreSQL. Trust Bank has more than 120 logical databases, storing terabytes of data. At the beginning, they were using RDS PostgreSQL in a multi AZ configuration - primary in one availability zone, standby in another availability zone, and then read replicas in order to isolate read only requests - for example, analytical queries.

After six months, the bank decided to migrate to Aurora PostgreSQL. So what were the drivers for this migration? First one, the opportunity to improve resiliency. Second one, opportunity to optimize cost. And the third one, the need to improve IOPS. In fact, the batch processing time was increasing with the growth of the business.

Before reviewing the impacts of the migration, I will quickly present the migration process. Trust has chosen a migration process which focuses on data integrity. For the preparation, we deployed an Aurora replica. And the source of this read replica was the existing RDS PostgreSQL database.

For the cutover, they paused the customer transactions, promoted the Aurora to primary, and then they changed the configuration of the RDS cluster to use Aurora. After that, they ran a set of tests to validate the new configuration. So based on the results of the test, a go/no go decision had to be taken. The test was successful, so it was a good decision and they were able to restart the transactions. If a no go decision had been taken, then they would have been able to roll back by reversing the configuration without impacting customer transactions.

So overall this process had limited downtime but was mostly focusing on data integrity. Let's dive a little deeper on Amazon Aurora resiliency. So Aurora uses a purpose built, log-structured storage system. Conceptually the storage engine is a SAN which spans multiple availability zones. Data is stored in 10 gigabyte logical blocks called protection groups, and each protection group is replicated 6 times on 6 storage nodes - 2 in the first availability zone, 2 in the second AZ, and 2 in the third AZ. For a write to be considered successful, 4 of the 6 storage nodes need to acknowledge received.

So this architecture makes Aurora highly fault tolerant - it can handle the loss of up to 2 copies of the data without affecting write availability, and it can handle the loss of 3 copies of the data without affecting read. On top of that, self healing mechanisms have been implemented and the storage nodes and disks are continually scanned for errors and automatically replaced if needed.

So resiliency was one of the big reasons Trust Bank has chosen to migrate to Aurora. Let's talk about the cost impacts of the migration to Aurora. The migration happened during the months of May and June. As you can see, there is a cost bubble between these two months and this is expected because during these two months, the two systems were running in parallel.

What is interesting is to see that after the migration, the cost was reduced by 18% and this cost reduction was achieved while at the same time, the number of active customers grew by 30%. So how can we explain this cost reduction? First, it's linked to the migration to Graviton CPUs. As you know, Graviton CPUs are custom ARM processors built by AWS and these processors can offer up to 40% better price-performance compared to existing x86 instances.

The second reason is that with Aurora, Aurora replicas are used as standbys. So you don't have to pay for unused capacity.

And the last thing is for Trust Bank, the cost of Aurora IOPS were 65% cheaper than the provisioned IOPS of RDS for PostgreSQL.

So let's talk now about the impact of the migration on the IO performance. As we have seen before, the storage system of Aurora is completely different to the more traditional storage system of RDS PostgreSQL. What is the impact on Trust Bank's performance?

This screenshot shows the end-to-end latency for payment processing before and after the migration. The red line shows the 99th percentile latency before the migration. And you can see that the SLA was breached, and this was happening even with a high number of provisioned IOPS. And it can have an impact on users because maybe the payment transaction will fail.

The green line shows the 99th percentile latency after the migration. And you can see it's always below 4 seconds. Overall, Trust Bank considers that the max number of IOPS was multiplied by 10 with Aurora. And the consequence of that is that the max p99 latency for payment processing was divided by three.

So to sum up the impact of the migration - better resiliency, better IO performance, and an 18% cost reduction.

Let's talk about two topics together because they're closely related - our real time event streaming platform and our data lake.

So for Trust since 2021 we've had a distributed microservices architecture and facilitating this has been the Kappa architectural pattern. And so what this advocates is for your canonical data source to be an append only immutable log, in this case serviced by Confluent for Kafka on Kubernetes, and utilizing Kafka Connect for real time event streaming into our data lake.

The other aspect to facilitate this is it advocates for a single technology stack to be used end-to-end. And we apply this both at the microservices layer in this event-driven section and also in our data lake to accomplish this.

One aspect that we embraced was data schema automation and management. This helps with governance and automation to facilitate the management of data through the platform.

For Trust, the bulk of our messaging uses the Avro data serialization format and this gives us a few benefits. First of all, it's based on schemas which allows us to create a strongly typed contract between the microservices that are publishing and consuming these messages.

And second, it's compatible with the Schema Registry that runs on Confluent Kafka. So at runtime these schemas can be referenced both for serialization and deserializing. This also helps with managing the schema evolution on the platform. So it gives us a safe way to both govern and manage the changes of our data schemas over time, either preserving backwards compatibility or in the event that we can't, it can have a process for managing that as well.

All of this is driven by a code based pipeline and centralized contracts repository that all of the domain teams will use for their bounded context to define their data schemas. These get pushed through an automation pipeline to do some validation, produce some artifacts that are usable not only at runtime in the Schema Registry but also during dev/test when the developers are engineering for these new data types.

And lastly, as mentioned, all of these microservices will be publishing and consuming these messages through Confluent Kafka which will then be streaming them to the data lake via Kafka Connect.

Thank you. So Trust Bank stores hundreds of terabytes in its data lake. As you will see it's mostly built using AWS native data services.

The input for data lake are the Kafka messages and we are streaming real time to S3 using Kafka Connect.

One it is free Apache Flink processes the data. And for example, the metadata is sent to AWS Glue Data Catalog and also the sensitive data are tokenized with Amazon MSK being used as a token vault.

For business intelligence, the business users are using both Tableau and QuickSight. So both these tools are leveraging Amazon Athena to query the data stored in S3.

The last important component is machine learning. So I'm presenting here the new ML platform which is going to be in production in Q1 2024.

So the data science environment is Amazon SageMaker and we are using EMR to do the training and the inference. The machine learning models are used by Trust Bank to provide a smarter experience for the customers, for example, with automatic recommendation or fraud detection.

So Scott is now going to present the data governance.

Thanks to you. So as mentioned before on the microservice side, Kafka advocates for a single technology stack for end to end data streaming and processing. So we extend this into the data lake as well by investing in a strong data platform that all of our teams can make use of.

It's our belief that having this strong focus on automation and capability end to end both from the microservices side and the data lake side has allowed Trust Bank since 2021 to continue building products and services for our customers without slowing down unduly due to increasing complexity problems that you might face in a platform.

In fact, I think Trust Bank has the most extensive set of products or services available after one year of operation of any bank.

So the way that we can achieve this or one way is through adopting principles or practices of a data mesh into our data platform. So you can see some features listed on the left.

The first two are concerned around ensuring that the domain teams themselves are the ones responsible for managing the data that they will use within their bounded context and manage the life cycle of this data.

The bottom two are around what the data platform team would be responsible for building, which is providing that end to end self-service data platform for for the bank.

So this starts, as I mentioned all the way before on the left with that centralized contracts repository where messages will now flow through on the data lake and are available to be consumed for analytical purposes.

And the data platform team will do all the heavy lifting for our domain teams. So every time we want onboard say a new domain or a new bounded context, we can take away a lot of that heavy lifting that's required to allow them to do so.

This includes things like least privilege attribute based access control for the various constructs here. It'll provision the Glue metadata tables, it'll provision the Athena infrastructure.

And another entire piece is we have an ETL framework and pipeline set up which allows the domain teams to focus on engineering their ETL jobs. And we take care of the infrastructure.

This is important to support this kind of growth that we mentioned before. Because what Trust is experiencing is not only a vertical growth in terms of the number of customers that we onboard over time, but also a sort of horizontal growth with the number of products and services that they can consume.

Ok. For the next couple of sections, we'll cover some use cases on reliability, observable and the developer experience and how they've been important for us.

Ok. Let's talk about resiliency testing. So Trust being a bank means we're regulated. We have strict SLAs both imposed on us and by ourselves. In fact, Singapore banks are allowed by the regulator less than 4 hours of unplanned downtime over a 12 month rolling period.

So not only do we have to architect for resiliency, but we also have to test and verify it as well. And of course, like most organizations, we've got a multi-account AWS organizational structure, which means that when we want to perform some more end to end resiliency tests, that means we have to orchestrate this solution across multiple AWS accounts.

So Trust turned to AWS Fault Injection Service to help us achieve this goal. Some of the considerations you can see behind this decision is, you know, integration with other AWS capabilities such as Service Control Policies and IAM.

We needed it to work with ECS that's where the bulk of our microservices are stored. We want these actions that we compose for resiliency to be either able to be executed in isolation or combined together for a more end to end test scenario.

And as mentioned, multi-account capabilities, we needed to solve FIS which is designed at the moment for single account execution. So this is where we had to build on top.

And also we wanted this tool to provide us a pathway for further chaos engineering type initiatives into 2024.

So let's have a look at what this looks like.

Ok. So you can see we've got three Amazon accounts in this simplified representation, we have some ECS clusters in each one of the accounts performing various duties that we want to synchronize in terms of our resiliency test.

In this particular example, we're using the network disruption action of FIS in order to simulate a loss of an availability zone like a grave failure.

Before we commence this test, we hook into our performance test automation in order to start firing off a number of requests per second through the platform. So we can measure using our observability platform, what the steady state looks like.

Then through some lightweight orchestration, we can then hook into each of the accounts FIS to trigger the experiment that we've designed for each of those accounts to be done simultaneously.

And then afterwards, when the test is completed, we keep running the performance testing in order to measure a return to steady state and validate the resiliency of our platform.

During this whole time, the observable integration that we have the platform will be collecting various telemetry and feeding it back through to the engineers. They're able to record their findings. They may need to adjust the tests as a result of those findings because bear in mind that what we're doing is a simulation.

And so the simulation itself can have some quirks that you may need to work around depending upon your platform.

Ok, let's move on to another reliability use case, which is integration with some of our partners. So fortunately, some of them are also on AWS. But in this particular use case, their primary platform is in a different region from ours.

And what we wanted was to achieve a private network connection with our partner by maybe pushing as many of the responsibilities below the line for our Amazon cloud service provider as possible. So we don't have to take care of those.

So this involves working with the partner to convince them maybe to change their de facto solutions to adopt some cloud native options, you know.

So the considerations here are as a benefit you're gonna be leveraging the highly available managed services of AWS. You want your network traffic that goes globally to be encrypted and you want the end to end connection to be private.

And all of this, of course, we want to be able to orchestrate via infrastructure as code. So in our case, that would be Terraform and we push it through a pipeline.

So let's see what that solution would look like.

So over on the left, we've got, you know, our networking account and on the right, we've got the third party. At the top, we've got our region and the bottom, we've got their region.

So the first thing we do is we create what I would refer to as a satellite VPC in our account in their region. And this is used only to accept the AWS PrivateLink that the partner offers to us to connect the two regions together.

In our account, we use VPC inter-region peering which provides encrypted traffic across the AWS backbone. So this means that for egress and end to end, it's a private encrypted network communication.

Likewise, the partner is going to reciprocate and mirror this solution back our way. So they're gonna have a satellite VPC in theirs. We're going to, they're going to accept our private link offering and that will allow for incoming traffic from the partner to follow the same sort of route where we can inspect and route the traffic upstream into our platform as we wish.

And as you can see, we've deliberately separated this part of network from the rest of our network.

Oh, I should mention before we move on that solution was all done as infrastructure as code. And since we've set it up in 2021 it's pretty much been set and forget. So that's been wonderful for us. We haven't had to maintain it.

Let's move on to observable here. So when we started in 2021 we started from scratch. So we've been through a bit of a journey and I would peg us somewhere in the run category of sort of this maturity journey here at the end of 2023 and moving into next year in 24.

So in the beginning in the crawl phase, this is mainly about selecting your observable tools, deploying them into your platform, hooking them into your infrastructure to collect your telemetry. So your logs, your metrics, your tracers, you're gonna be setting up your run books, you're gonna select some sort of real time alerting tool like PagerDuty, for example.

And then at that point, you'll start to move into what I would refer to as the walk phase of this maturity level. So here's where maybe you're integrating your run books into your alerts. You're starting to perform various types of testing on your platform you start to standardize your metrics on the four golden signals.

So that's gonna be latency error rate, traffic and saturation. This allows you to then start to define some error budgets around your SLOs and SLAs for your micro services and your other systems.

And finally, you're gonna move into the run phase where the interesting thing I find happens here is not only are you deepening your awareness of your system's health, which is represented on this diagram, but you're starting to gain an understanding in your real time observability platform of your business health.

So this manifests in two ways. One is by being able to measure journey analytics, those end to end transactions that represent your customer flows through your platform. This is especially useful for understanding how latency impacts your system at the journey level.

And the second is you start to see a left shift of the customer behavior metrics that you might have previously relied upon in ETL pipeline to produce reports for business intelligence. You can start to manifest some of these metrics also in your observable layer which is very beneficial.

So let's look at an example of leveraging journey based analytics for improving your system health.

So here we have in our observability tool, a transactional metric which are traces that represent customer transactions. In this case, it's a customer onboarding flow and what you can see each one of these bars is an individual transaction.

And then these bars are broken up by the amount of latency that each microservice is contributing to the total latency. And what you can see is there's a lavender section of each one of these traces that represents one single microservice dominating the majority of the latency for that particular transaction.

Through the observability tool, you can sort of click and drill down the traces can be instrumented with information like the database query that was used. In this case, we've isolated it as a service that's making a database query.

We're able to extract that query, perform some analysis. You know, in this case, you might identify that there's a missing index on your database schema. So you can push that change through your pipeline, your change pipeline, leveraging your performance test automation to validate it in non-production first.

And finally, when you get it into production, you can use this same um statistical analysis in order to validate the effect of the change after the release. So in this case, we've gone for almost a second of database latency for one service to less than 10 milliseconds because of adding a missing index.

Ok. Let's move on to the other section which is developer experience - hugely important for a platform. Uh one of the other north stars that Trust has is a sh a shift left strategy. And this applies to the way that we deliver changes to production where we want teams to be able to produce change independently while still maintaining uh stringent control gates over that process.

So typically with software delivery, you'll find there are two distinct loops when it comes to making changes. There'll be the inner loop which is sort of your, um making your software changes, your development, your testing, um your packaging and so forth. And then there's uh a kind of an outer loop that sits on top of it and that's where your security risk and compliance or risk and regulatory requirements um are imposed, which is essentially a risk mitigation exercise to validate the software change process. And what we found at Trust was this outer loop was a significant uh contributor to the lead time for change.

Doom meric that we have for our microservice delivery. So the considerations here is we want to invest in more automation on our pipelines, what we refer to as our golden pipelines in this context. Today, we'll talk about micro services, but there's nothing stopping you from defining golden pipelines for all of your contexts. And these will incorporate through your automation efforts. Uh golden gates where previously maybe some of those manual checks that we were required to do as part of that outer loop can now be integrated and standardized as a golden gate in a pipeline. This helps us with continuous verification of our changes.

So let's have a look at what that looks like in one of our tools. So here we're using Harness for continuous delivery and what you can say. Sorry. On um the first step is the end to end uh delivery pipeline is summarized at the top. That's where your gates are going to be. The second aspect here is for microservices, we can then choose to standardize with the golden pipeline by implementing a canary deployment process. And that will mean that for a particular change to a microservice, we'll deploy a new pod. This is obviously cinetis deploy a new pod into the cluster and either split or route a small percentage of traffic uh to that canary pod which will start generating telemetry for us to hook into.

So we can do some automated verification here. Maybe if it fails some of those key gates that you set around um around analytics, then it will can automatically roll back the canary deployment. The other aspect if you're leveraging an observable tool that gives you this is you can utilize some machine learning capability to perform anomaly detection on the logs that are generated uh by the canary release. So in this example, it's flagging some erroneous log messages that may not be present in the currently deployed version. This gives our engineering teams the ability for a final go no go before rolling the canary out to production or rolling back.

And the last point is um helping to mitigate the amount of effort required to maintain that outer loop, you can have some integration with whatever ticketing or workflow systems that you're using. So in this case, for example, you could update the Jira ticket, change its status, uh maybe close it out what not. Um and that helps both um saving time for the engineers to have to remember to do this and obviously increasing the quality of um the artifacts that you're generating with each release.

The last aspect of dev developer experience i wanted to touch on is um the fact that running a cloud platform and a microservices platform together is obviously going to increase the cognitive load for each of your engineering teams in order to be able to work within it. So what we really want here is a single place for engineers to go to be able to navigate and understand this entire platform considerations here which fairly obvious.

So new engineers joining the team are now able to on board much more effectively. Um it increases collaboration between the squad, between the squads because at the end of the day, this is about providing visibility um and ownership of the various um uh systems that we have within the platform. So now you know who to go to, you can consult the documentation. Um and you can do all this essentially essentially in a self service capacity.

Lastly, we wanted a tool that would allow us to build some more value add capabilities. So we can aggregate maybe some inf infrastructure that the service uses uh into the view. Um we can also provide stuff like application score cutting. So for Trust we've used Backstage - now Backstage uh is an open source project uh started at Spotify, but it's now incubating in the CNCF and we still host it on EKS um on our own platform. Uh it's nice because it's very composable via the various integrations and plugins that it it uh has. And this is what allows us or what it gives us is a user interface on on top of a graph database of which you can aggregate various sources of truth into in order to provide this sort of unified view, all hyperlinked and navigable. So you can easily click through and around and see what's going on.

Here's a simple screenshot example. Um here's a particular microservice. Um and i'm looking at a particular view which is showing essentially the upstream and downstream dependencies of that service. You know what um capa topics it makes use of what data structures it will, it will leave you data schemers, uh what um api s it produces and consumes depending upon your needs.

So for Trust, as you can see this aggregation shows us, you know, we're dealing with four api types. 22 data pipelines are cataloged in there. Um we've got over 100 microservices with 29 bounded context uh owned by 14 domains.

So for the last section let's look at a security use case that Trust is built on top of AWS cloud native. So no matter how, you know, self healing automated, your systems are gonna be sooner or later, an operator is gonna need to go into a database. Um normally you'll have some kind of uh db jump host infrastructure that you maintain. This might be long lived, which immediately puts it under your security and compliance controls, meaning you're gonna have to patch and roll this long lived infrastructure. It's the cost is 24 7.

So for Trust, we wanted to try something a little different and see if we could achieve a more ephemeral solution to this particular use case considerations. You can see here, we want uh nonpersistent virtualization. Uh we want to integrate both identity and network based security concerns into the solution. Of course, all of the uh activity logging, you wanna have, you wanna have session recording because it's virtualized, you're gonna be able to have some data loss prevention control around that. Um and of course, we wanna manage this via infrastructure as code and hopefully results in a cost optimized solutions for this use case.

So let's have a look at what that looks like over on the right. Uh there's a workload account. In this example, we have uh an RDS database, maybe it's an Aurora cluster with a reader and a writer. And over on the left, we have the infrastructure for um the AppStream stack. Now, the first thing that's gonna happen is once you raise your just in time request and you go through your requisite approvals uh going through four eyes and whatnot. The operator is then authorized to federate into the upstream infrastructure via our identity provider. Um and Amazon Identity Center for SSO once this does this, it will spin up the on-demand infrastructure for upstream, which has some events that you can configure to trigger custom processing.

So in this case, we've hooked a Lambda into the start session event of upstream. And what that's able to do is open a network path between the upstream uh infrastructure and the database. You can then connect and configure a temporary read only database user whose identity is linked with that with the user that's federated in which is obviously great for um uh activity logging and access reviews. After the fact, it can also configure the database connection. Since it knows it's a read only uh request, it can configure it to connect to the reader instance rather than the writer instance of the cluster. So that helps further mitigate any operational issues.

Once the operator's completed and executed whatever queries they came there, we can go through an end session and a sort of a clean up operation. So what we can do here is delete that temporary database user, close the network path uh back between AppStream and the database. And of course, this whole time logs are being streamed through Observable platform. Um and uh session recordings will be uploaded into S3.

So our finding here is that um AppStream is a very programmable um solution that you can build your use cases on top of. So this is just one particular use case, but we're also gonna use it as a potentially as an RDP solution for some of the COT systems that have web interfaces. So you can also use that as a, a virtualized interface for those two.

I'll now hand over back to Rae.

Thanks Scott. So in summary, um we talked about, you know how we scale our platforms out, what supports our infrastructure, uh the resilience. And then we talked about the real time experiences that we create using eventing. Uh last but not least we talked about uh the um speed of execution. We talked about the develop experience, continuous um continuous verification. Hope you found that interesting.

Now, before i get to a quick video to show you about, tell you about, you know, all the the cool experiences and how we differentiate ourselves in the market. I think um i'd just like to quickly say that because of the experiences that we've created, right? We've been able to lower our cost of acquisition down as compared to anyone else in the industry today. Uh by more than 80% and 70% of our acquisitions are coming from referral programs.

Now, the the the products that are released by a new bank here. You'll find that we've been able to innovate much faster, release products much faster due to our entire life cycle. So the time it takes to come up with an idea to the time uh it takes to bring it to value has been compressed due to our in, in infrastructure as well as the engineering practices that we have.

So with this, of course, our on boarding speed as compared to the others in the industry are approximately three minutes for on boarding. As you can see, there are the others who also have some pretty good experiences. However, we are definitely uh you know, leading the way when it comes to setting the benchmark.

Now, i do have a video to play here before you go. Hopefully you'll find that interesting. So when we launched, we had um a customer tell us about this story. So he was standing in line right at the grocery shop and while standing in line got a referral from a friend and could actually download the app, apply for an uh uh uh account and at checkout, use his card in the queue. So how cool is that? So let's take you through that quickly, right? it and it's all about customer advocacy, like i mentioned, 70% of our customers have been great advocates for us and all the tech has made this possible.

So let me quickly take you through this video so thank you very much for coming here today. Hopefully.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫