手机银行:背后有1600个微服务?

c8444922b123ef6b654fd6af8ac05ec5.png

本文素材来自外网,由Matt Heath & Suhail Patel 分享。

老G先生整理部分文字。

编辑:强哥12330ee364729e15b03e3645478cf86c.png

eefe9b2e11fb030e5d5f36026615b936.pngMy name is Suhail. I am joined by Matt. We're both engineers at Monzo. We spend our time working on the underlying platform powering the bank. 

We think that all of these complexities about scaling our infrastructure and making sure that servers are provisioned and databases are available, should be dealt by a specific team so that engineers who are working on the product can focus on building a great bank and not have to worry about their infrastructure. They can essentially focus.

559b3dd63fbe3b87973f12f303edca7e.png

3c62dfbb3ef77cf28a8687243ed12f4c.png

400w Monzo bank account

e74cb2e5c38d785809fea1bd53c2002e.png

29bdec781bd6a3df622ba379abf8cc55.png

6d7e73382c7ee8ea66eebf1b56b59668.png

91e43652091c4b2d29cec9e055e178ad.png

14ff9fca7e23957dd8e762ab741881eb.png

Heath: Five years ago, a group of people decided to build a bank from the ground up. There are a lot of reasons for that. We wanted to build something that's a bit different, how people manage their money easily, simply. 

That means you're competing in a space of, honestly, quite a large number of quite big, very well established banks. When we were trying to work out how we would approach that problem, we wanted to build something that was flexible, and could let our organization flex and scale as we grew. The real question is, with that in mind, where do you start?

I occasionally suggest this would be the approach. You just open your text editor. Start with, new bank file. Then go from there.

Five years ago, I didn't know anything about banking. It's an interesting problem to start from something where you need to work out and understand the domain. Also, work out how you're going to provide the technology underneath that.

c5f08cf3d9d76d1301dcf89a980fe1db.png

562f816d52126f55191042538e4d6fcc.png

With those four main things in mind, we wanted to work out, what are the technology choices that we will make to drive those things. We made a few quite early on. 

We use Go as our primary programming language. There's lots of reasons for that. Ultimately, as a language goes, it's quite simple. It's statically typed. It makes it quite easy for us to get people on board. If you're using a language that not many people know, you have to get people up to speed on how to do that. Honestly, if you're working in a company where you have quite a large framework, you already have that problem. You have to get people to understand how your toolset works, and how your framework works, and how they can be effective within your organization. 

Go also has some interesting things such as a backwards compatibility guarantee. We've been using Go from the very early versions of Go 1. Every time a new version of Go comes out, it has a guarantee that we can recompile our code, and we basically get all of the improvements. What that means is the garbage collector, for example, has improved several orders of magnitude over the time that we've had our infrastructure running. Every time we recompile it, test that it still works. Then we just get those benefits for free.

The other things that we chose early on, were emphasizing distributed technologies. We didn't want to be in a world where you have one really resilient system, and then a second backup system, and a big lever that you pull but you don't pull very often. Because if you don't exercise those failover modes, how can you know that they work reliably? 

We wanted to pick distributed technologies from very early on. We use Cassandra as our database. Back in 2015 Kubernetes wasn't really an option, so we actually used Mesos. Then a bit later in 2016, we revised that, looked around, and it was clear that Kubernetes was the emerging market leader. 

Before we expanded into our current account, we switched over to Kubernetes. The thing that we were taking from that is providing an abstraction. An abstraction from underlying infrastructure for our engineers who were building banking systems on top of that.

I think the first version of Kubernetes we've run in production was version 1.2. 

For anyone who has used those versions of Kubernetes, that was an interesting time. There were many benefits to moving to Kubernetes. We actually saved loads of money quite quickly. 

We had lots of machines that were running Jenkins worker pools, and loads of other things that we couldn't easily run on our Mesos cluster. By moving to Kubernetes, we could use the spare capacity on the cluster to do all of our build jobs and various other things. We could more tightly pack our applications. That saved us a load of money. We shut down loads of other infrastructure.

1c53a23fd8db751a52933e036ef2c9e9.png

14feed4f2d3adf0d2cd2c7656c4c9cc6.png

492092bb4c9bcb3e78c01514ab8b3b51.png

56c731e95b4c314daa0b1e0d19618d59.png

1f464544f7a897911e8cc7061f71905d.png

e43838f434d1d4893acafe3e3843cbc1.png

We started off in the early days with a really basic product.

We didn't even have debit cards to start with. Then slowly from that point, we've iterated and added more and more features. 

We've added Pots so that you can organize your money. 

You can pay directly out of them. 

You can save money through those.

You can pick how to do that in the app. 

Or, you can have your salary paid a day early. 

You'll get a prompt in the app if you're eligible. 

Then you can sort that into the Pots so you can segregate all of your money for bills just straight away. 

You just never see it.

Your bill money goes over here. 

You pay your bills straightaway. 

All of these are provided by an API. 

This part is relatively straightforward. We have many product features. 

We have many aspects of our API that we need to build.

e737f9ec9a57dee19576199e329627ab.png

806f2c5436c0182c100e83d6511685cc.png

98dad8eefa4e2e318d076ebc0ae477d2.png

73143c30a9563caceaf7f92573f83391.png

e74bcb854d8d953177e40f8541266c92.png

c8e28dbeb723fb576f0fe0ce256419b7.png

很明显,随着时间的推移,我们公司做的事情越来越多。这意味着我们构建的系统数量增加了。目前,我们有1600个微服务,它们都很小。我们使用了限界上下文(BC),它们对自己所做的事情非常关注。这使我们能够灵活,因为不同的小组可以操作这个代码库的一小部分。随着我们组织的发展,我们拥有更多的团队,他们可以专注于越来越小的领域。

这些系统负责为我们的银行提供一切动力。从支付网络、转移资金、维护分户账、打击欺诈、金融犯罪到提供世界一流的客户支持。所有这些都是我们在基础设施中作为服务构建的系统。

这就是我们如何从一个相对简单的产品发展到一个乍看起来非常复杂的系统。在这个特定的图表中,我们根据拥有和维护这些系统的团队对区域进行了颜色编码。您可以看到,公司内有不同团队拥有的服务。

9d28528427d0646cd30fe9948341a4bb.png

267e23aa9fe32881542d09cc195be33d.png

0e184eb82a5191ccb39006f205d17e47.png

Patel: You want to add a microservice? Where do you get started? You start with a blank canvas. This is the surface area that engineers are typically exposed to. They put their business logic in a well-defined box. The surrounding portion makes sure that it works and is production ready, and provides all the necessary interfaces and integrations with the rest of our infrastructure. One of our biggest decisions as an organization, was our approach to writing microservices for all of our business functions. Each of these units, or each of these microservices of business logic are built on a shared core. Our goal is to reduce variance as much as we can of each additional microservice we add. If a microservice gets really popular, we can scale it independently.

Engineers are not rewriting core abstractions like marshaling of data, or HTTP servers, or integration with metric systems for every new service that they add. They can rely on a well-defined and well-tested and well-supported set of libraries, and tooling, and infrastructure that we provide.

ps:Business Layer 背后有大量的RPC处理和数据服务处理,如何解耦?

When we make an improvement or fix a bug in the shared library layer, every service can benefit, usually, without needing a single line of code change within the business logic. Here's an example, where we made some improvements to reduce the amount of CPU time of unmarshaling data between Cassandra our primary datastore, and Go, which is what we use to write all of our microservices. Some of our services saw a significant CPU and latency drop. This work has cascading and global improvements across the platform. It's a free speed improvement for anyone who's working on business logic. Everyone loves a free speed improvement.

d60e28d03983524e34b2600789fc374a.png

816d6d6b542606fca9e50c7d3272c6c3.png

06edc8a81ebb75a54efe9675260072cf.png

3e7eba43d0f0b92481cb85984717cccb.png

96eae56b079321af1d2f9fea37fcb2b0.png

ee4de4b588a208a7564f2dc1a2d9e7c8.png

How can we compose services together to form a cohesive product, or offering, or service? We take a problem and subdivide it into a set of bounded context. The whole premise behind this is the single responsibility principle. Take one thing, do it correctly and do it well. Each service provides a well-defined interface. Ideally, we have safe operations. Consider that if you are going to expose this interface to the public world, what tunable parameters would you want to expose to the world? You don't want to provide every particular node, because that means that you might have lots of different permutations that you need to support.

As a particular example, here's a diagram of all the services that get involved when you tap your Monzo card at a payment terminal. Quite a few distinct components are involved in real-time when you make a transaction to contribute to the decision on whether a payment should be accepted, or rejected, or something in between. All of this needs to work as one cohesive unit to provide that decision. Part of that is calling key services, like our account service, which deals with the abstraction of accounts all across Monzo. It's not just providing bank accounts, but accounts as a whole, as a singular abstraction at Monzo. Also, the ledger service, which is responsible for tracking all money movements, no matter in what currency, or what environment, is responsible. It is a singular entity that's responsible for tracking all money movements all across Monzo.

This diagram is actually the maximal set of services. In reality, not every service gets involved in every invocation on every transaction. Many of these are there to support the complexity of receiving payments, for example. There is different validation and work we need to do to support chip-and-PIN versus contactless, versus if you swipe your card if you're in the U.S., or occasionally, if you're in the UK and the card term was broken. A service will only get called if it needs to get involved with a particular type of transaction. This thing is still really complex because accepting payments is really complex. Why do we have such granularity? We want to break down the complexity and minimize the risk of change. For example, if we want to change the way contactless payments work, we're not affecting the chip-and-PIN system or the magstripe system, so we can fall back to those if we get it wrong.

简言之,还是单一职责划分服务。

a2dfbe3450723e62a46f7a860b97424b.png

5a0692802f7cc574d3d7722351c6e51f.png

c5e6f16495c528f33191d8a8c0d484f1.png

2019年9月,一项名为“强力客户认证”的欧盟立法生效,旨在提高支付安全性,减少欺诈金额。该法规的重点是在支付时增加一个双因素认证层。为了成功付款,客户需要三个要素中的两个。我们必须改变我们的支付流程,才能做到合规,就像大多数其他银行一样。

由于这个需求,把强用户认证服务进行了加强,关于用户认证的逻辑不用散落,都在这个服务中提供。

a79353a334764875f23393c023683e4e.png

4816a4dbc5f2734f7ef148013b89dc83.png

同时,架构是演进的。比如我们的ID生成服务3-4年都没怎么变过。

有时候我们发现过早的拆分了一些东西,其实并没有业务规模的发展,还要修改2个地方,我们又把他们合并起来。

同时,还有关注旧业务下线导致的服务优化和下线,比如我们曾经有预付卡业务,现在没有了。如果这些逻辑仍然侵入到一些处理环节,就很麻烦。

e5e75c9b767feb66346b8d96e79c4019.png

2d54bf8cfc33f4e1210c0849680449bc.png

88b780cdb07f3dcf9e622ef991bc6675.png

b10096f4010513effcd2f98f6edebf41.png

我认为迭代过程是我们在Monzo通常关注的事情,实际上,无论是从基础设施的角度,还是从产品的角度。我们特别喜欢亨里克·克尼伯格(Henrik Kniberg)的一幅漫画,它讲述的是,我们正试图迭代构建一个更好的产品。

即使在刚开始的时候,如果我们试图建立一家银行,而没有足够的客户反馈,那么我们可能最终会遇到一些不能真正满足客户需求的问题。通过不断地与人们交谈,通过做一些小的改变,然后不断迭代。避免YY。

ps: 

乔布斯说,他从来不做市场调研。如果亨利·福特在发明汽车之前去做市场调研,他得到的答案一定是消费者希望得到一辆更快的马车。意即消费者不知道自己想要什么,企业要靠自己的直觉创造伟大的产品并引领消费者需求。

but,现实是并没有那么多创造性的伟大产品。MVP强调快速迭代和反馈。

03291df2b2cb0c0bc874584fc0e87f01.png

Patel: There's been a few instrumental components that have allowed this ecosystem to flourish at Monzo. We've talked about how we compose microservices and how we develop a set of robust libraries. 

The other key layer is our core platform. The team we work on focuses on providing components like Kubernetes, and Cassandra, so that we can host and deploy and develop containers. Cassandra for data storage. Etcd for distributed locking. Components like Prometheus for instrumentation. 

We provide these components as services so that engineers can focus on building a bank rather than having lots of different teams doing individual operational work with many different components. Even with these components that we've specified, we provide well-defined interfaces and abstractions rather than surfacing the full implementation details about each of these components.

8bba63ebacf614e1491f068fff233f8b.png

One key superpower we've been able to leverage is reducing the barrier of deployments. Engineers can ship to production from their very first week. Just today, right about now, we would have had hundreds of deployments of various services all across Monzo. Once code goes through automatic validation and gets peer reviewed, and is approved and merged into the mainline, it's ready to be deployed to production. We've built a bespoke deployment tool called Shipper, which handles all of the complexities like rolling deployments in Kubernetes and running migrations in Cassandra. 

It deals with services that might look unhappy so that you can roll them back, and deployments going bad. All this means is that we can build and roll out changes in minutes using a single command. Every engineer is empowered to do this at Monzo. Engineers shouldn't be expected to know complex things like Kubernetes and Cassandra. They don't have to write YAML or write CQL commands, which are hand strewn, to deploy their services.

b0a2d3364ce617adb4109e0ec050892f.png

Even really simple things and core things like standardizing service naming. Nobody is deploying a service with innuendo names. Each service is well described in its naming. Service structure, the way we restructure files, where do you put particular files within your code, is all standardized. The vast majority of services use a standardized service generator. All this code is generated up front and the sub-structure is generated up front. No matter what team I go into, I know where I can find the database code. It will be in the dao folder.

I know where I can find the routing logic. It will be in the handler folder. Queue consumers will be in the consumer folder. This allows for much easier collaboration and onboarding for engineers onto different teams.

At Monzo, engineers move around teams really often. We are really a flexible and growing organization. Having this standardization across all the teams is really important. Once you get used to the structure in one area, you can be a power user across the entire repository, across all of our services.

43a8ecafaeddee946c590ea870cc8341.png

cafea41fd9021402812d608c510a27ca.png

If you're working in a language like Go, you can build parsers and understand your existing code, and extract information from code. Go provides this to you right from the standard library. As we've standardized our service structure, we've been able to build tooling and can operate across all of our services. For example, this tool on-screen called service query, which can print out all of the API endpoints for a given service, and prompt it straight from the code. Even if it's not been well defined in the Protocol Buffers, which is definitely an anti-pattern, it can extract that information directly from the code.

We can use the same tooling to do static analysis and validation when you submit a pull request. That means a cognitive overhead for an engineer to peer review, and make sure that this change is safe and potentially backwards and forwards compatible is all delegated to automated tooling. We've reduced the risk of engineers breaking changes when they are deploying their code. Violations are automatically detected and can be rectified during the pull request

80dfa236bbdaf5de01112970a8d530ed.png

Every single Go service using our libraries gets a wealth of metrics built for free. Engineers can go to a common fully templated dashboard, type in their service name, and within the first minute of deploying a new service, have up to date visualizations and metrics about how many HTTP calls they're making. How many Cassandra calls they might be making. How many locks they are taking, CPU information. A wealth of information. This also feeds into automated alerting. If a team has deployed a service, and has not quite figured out the correct thresholds, they can fall back on automated alerting, which we already have, so that if a service is really degrading and causing potential impact, the automated alerting will catch that beforehand. Alerts are automatically routed to the right team which owns the service. When a service is first built, before it's even merged into the mainline, each service has to have a team owner assigned to it. 

This is categorized specifically in a code owner's file, which is monitored and automated by GitHub. 

This means that we have good visibility and ownership across our entire set of services.

2b252f2c0d3b9d100ea6f25a68fc36fd.png

7c9dd14f4089d92de8886eedbca117b1.png

Similarly, we've spent a lot of time on our backend to unify our RPC layer, so when a service calls another service, to communicate with each other. This means that trace IDs and context parameters are parsed across service boundaries. From there, we can use technologies like OpenTracing and OpenTelemetry, and open-source tools like Jaeger to provide rich traces of each hop. Here, you can narrow down how long each hop took, and the dependencies on external services and systems. We've baked in Cassandra integration and etcd integration right into the library so that we can visualize all of that in Jaeger. It's not just about RPCs, you also want to trace your queries to the database, what actual query was made, how long did it take? Sometimes engineers want to follow a request path through service boundaries, and see logs in a unified view. By having consistent trace IDs which are propagated, we can tag logs automatically on our backend, which makes it really easy for querying what happened between service boundaries. You can log information and see in detail what every single request went through.

There is nothing unique about our platform, which makes this exclusive to Monzo. We leverage the same open-source tools like Prometheus, Grafana, The Elastic Stack, and OpenTelemetry to collect, aggregate, and visualize this data. You can do the same on your platform.

be6ad87f106ac4417b2b6a5f7164f552.png

db36daff52a7996af5a66e5f3bb4c571.png

0afd7c8c63df46c40d0ef232c96b1d90.png

55b7c32909aa14e4dd289338325f97cd.png

18951f3a3655dca006d1118c117f9c20.png

4e5d0094b01ac61e2f9c9dca997724d9.png

9ae542c5e4d65e12bd05862a3c65bc11.png

Heath: By standardizing on that small set of technology choices, we can, as a group, collectively improve those tools. 

Engineers can focus on the business problem at hand. 

Our overall underlying systems get progressively better over time. That means we can focus on the business problem. We don't have to think about the underlying infrastructure all the time. At the same time, our platform teams can continuously work on that, and raise that bar of abstraction continuously, so that as we go, things get easier.

c2b85b41d27c2c6c2d42fa62410217bb.png

Patel: Breaking down the complexity into bite-sized chunks means that each service is simpler and easy to understand. The granularity and ownership of services reduces the contention between teams, while risk is reduced as we can make small, isolated changes to specific sections of our systems. All of this is in aid of reducing the barriers to make changes. It allows us to serve our customers better, which is ultimately what we want to do as an organization. We want engineers to feel empowered to work on new and innovative functionality and deliver a better product to customers.

8f686a8f3420d0fcf6c3d914b0dcb97f.png

加入技术琐话粉丝群,公众号回复:技术群。
下载“100页ppt讲清楚云原生 ” ppt,公众号回复:高磊。
下载本ppt,公众号回复:1600。

  往期推荐:

技术琐话 

以分布式设计、架构、体系思想为基础,兼论研发相关的点点滴滴,不限于代码、质量体系和研发管理。

cb74945f29d795e378b29d2926adf739.png

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值