JPMorgan Chase: One data platform for reporting, analytics, and ML_jp morgan chase distributed database storage syste-CSDN博客

本文链接：https://blog.csdn.net/just2gooo/article/details/134794948

Colin Marden: "21, if you could rapidly reliably and repeatedly deploy a data platform with advanced analytics and machine learning capabilities with a click of a single button, would that offer your business benefit? The answer to that question is yes, then you're in the right room. Welcome everyone. My name's Colin Marden. I'm an AWS Solutions Architect and I'm aligned to JP Morgan today. I'm here to tell the story of FSI 3171 Data Platform for Reporting Analytics and ML.

I'm joined by Lana Kosa and Ro Ravi. Rudi Aju who will tell their story. Before we do that, let's take a quick look at the agenda. So we'll start with Lana giving us a brief description of exactly what Asset Management does, the business challenges and reasons for the migration to the cloud. From there, we'll transition to Ravi and Ravi will share the details of his architecture and some of the technical challenges that he faced along the way.

Finally, Ravi and Lana will come together to conduct a quick recap and share with us the lessons learned from their journey. So without further ado, let's hand you over to Lana."

Lana Kosa: "Thank you, Lana. Good morning. Very excited to be here. I'm Lana Koska. I've been in JP Morgan for 10 years. I run global business intelligence analytics team for asset management and private bank distribution business. I also am responsible for client data for asset management which includes all the account data, client master and all the client transactions.

I've spent my career in data and analytics for over two decades dealing in building all different data structures and solutions.

Now, before we dive into technology, a quick view of JP Morgan Asset Management, which is a division of JP Morgan Chase, the diversified global financial firm. We are one of the largest asset managers in the world with 2.9 trillion in assets under management. We have 1200 investment professionals that service millions of clients in all the major markets of the world. Our clients range from large institutions, asset owners to retail intermediaries and financial advisors and we offer over 600 core and alternative investment strategies, market insights, research for our clients in order for them to reach their long term investment goals.

Now, data and our culture around data are critical for our business. This is George Gad, the CEO of our JP Morgan Asset Management. He mentioned that in a recent town hall. Now JP Morgan Asset JP Morgan Chase invests 15 billion every year to fuel almost 60,000 engineers. Technology and data is at the core of our business and an important part of our business evolution. It powers every aspect of what we do. Not just our day to day business functions but also drives velocity scale and technical innovation for us to stay competitive.

Now, that was JP Morgan Asset Management. A little bit more about our area of focus, Ravi and I are part of Client Technology. We serve our distribution business. We build data and analytics platform called AMIQ which originally was built for sales and marketing. But since then, we actually expanded that to other areas or other users across our organization. Our user group is diverse. They have various demands and they interact with our platform in very unique ways. Let me give you a few examples.

We have sales, sales build relationships cover our clients invest on behalf of our clients. They use our data from anything looking at client 360 profile, daily sales prepare for agenda to prepare their daily calls, finding new opportunities or manage risks.

Marketing team. They use our data for lead nurturing for understand marketing spent or event ROI. They also use our data to understand market trends so they can create the content that's more personalized for our clients.

And we have clients who actually use our data to look at their portfolios, find new opportunities, manage risks and also compare themselves to their peers. So all these different users have one thing in common. They depend on our data.

Now, how do we enable all these different people across the organization to interact with our data through one consistent platform? So here's how we stack AMIQ. We have infrastructure and tools that allow us to run a range of different programs. And this is where Bob is gonna dive in in a few minutes.

We have data, we source data from internal external sources with different granularity, different interfaces, different structures, all that. We cleanse all this data normalize it, model it and bring it all together and prepare it for analytics, business intelligence tools.

Then we have suite of analytics and tools that our users to action on our data. So that includes different analytical models that provide anything from business insights to time series trend analysis to more advanced analytic solutions using machine learning.

So, so and the last thing we have the delivery mechanism. So we support different distribution mechanisms. We democratize our data through our self service business intelligence platform. We also allow our users to connect to our data through SQL editors or notebooks. And we also embed our analytic solutions into our internal applications.

Now, let me share with you one of the products we've built on AMIQ platform that would not be possible or too difficult to do if we were not on the cloud. Our sales people cover a lot of clients. They used to spend several hours every morning to prepare for client calls, they would go through different reports, different touch points, excel spreadsheets, they would organize all that data, they bring it all together, you know, prioritize it and that would be their basically morning preparation time. Obviously, as you can imagine, the process was very inefficient and very manual as well.

So our data driven recommendation engine does just that we provide a daily call list that prepares sales to reach out to our clients with the best product or solution at the right time. So how did we do it? So from left to right, we source all kinds of different data, transaction data, digital data news, market, data, product performance data. We organize all this data into one cohesive data set with the common denominator. And then we synthesize into the list using signals we extract from data.

Now those signals range from anything explicit like what is the client preference or what is the client behavior to more advanced like looking at collaborative filtering or pattern recognition or statistical analysis or sentiment analysis, all those things. So then all this data is synthesized into a list where every signal is being weighted and the list is prioritized based on the weight of the signal and the timeliness.

Now this list is delivered daily to our sales people. They can provide feedback in real time where we use this feedback to for continuous improvement of the recommendation engine.

Now, this solution did not just make our sales people more effective and allow them to spend more time with our clients. It also personalized experience for our clients that would help them to reach their investment goals.

Now, quick metrics about our platform, we are at five petabytes and we're still growing, we ingest hundreds of million new records every day and we transform billions of records into data sets that feed into over 600 different applications, dashboards reports where our users rely on that data on a daily basis.

Our team is about 60 technologists and we organize in four pods. We have core engineers who look after infrastructure and tools. We have data engineers to source organize optimized data sets and prepare this data for analytical use cases. They have worked closely with our business who help them to interpret this data.

We have data scientists who derive data insights, develop algorithms and models and we have application developers who build applications to deliver our data to users and clients.

Now what motivated us to go to the cloud? I'm sure it's something very similar to for you who are already going to cloud journey. For us. it was scale. We needed to scale our infrastructure fast velocity, we needed to support new strategies, new ideas, leveraging new services on the cloud. And we also needed to build a versatile platform platform that can support all these different use cases.

Think of this way we have some users that need access to data lake. We have some users that need more dimensional model. Some users need analytical data sets, time series, right? Some bespoke analytical data sets all of that. We end up building different and using different data infrastructure, data services to accomplish that. So obviously was very inefficient. We were copping data everywhere.

So we needed that the solution that actually can tackle all of them. We needed a platform, it can let us insert data super fast into the platform transport, it transform it and also ultrafast on extracting data.

So before I hand it over to Ravi, one thing I wanna mention is we are in financial industry which is highly regulated. We have JP Morgan Chase serves half of US households. We have 300,000 employees who operate in our environment. And with that not every single service, cloud service is available for us. Some of them are limited in the features.

So with that in mind, I wanna hand it over to Ravi who's gonna talk about our journey to cloud some tradeoffs we had to make and how we build our modern architecture, right?"

Ravi Rau: "Thank you, Lana. I am Ravi Rau. I'm the lead architect and engineer in JP Morgan Asset Management. I've been with the firm for five years. My background is in building software infrastructure and data platforms in the financial industry and I'm really excited to show you what we built in the AMIQ data platform.

Alright, this is the systems architecture of our platform going left to right. We are using this platform to ingest thousands of data sets daily from both internal and external data sources. Then we run computations and data transformations that run across billions of records daily. And then we deliver this data to our various stakeholders that access this data through reporting tools, web and mobile applications.

So there were few key decision points we made in this platform, I'd like to share. First is the choice of AWS components. If you see the colored boxes, those are the AWS components that we used natively. So the combination of S3, EMR and Redshift was critical to making our platform very scalable and supporting a variety of use cases.

We do classic data engineering across structured, unstructured and semi-structured data and we also use the same stack to run ML algorithms on a daily basis. Remember the recommendation engine Lana talked about earlier? That recommendation engine is a combination of all these data engineering and ML jobs."

One specific process in the recommendation engine was to do a matching process. Before we got on to AWS, that matching process was taking several days to run and in some cases, it never finished. Now, after we got on to AWS and using EMR, we are able to run that matching algorithm within a few hours daily.

Clearly products like recommendation engine have brought a lot of value to our business once the data is made available.

The second component that I'd like to focus on is Redshift. Redshift has enabled us to deliver the data that we generate from EMR to a variety of reporting tools. The majority of the reporting tools we have need a SQL database to work effectively. And Redshift being a database very well suited for read heavy use cases, it has helped us to run over 6000 reports every day across our business.

The second decision point we made was certain capabilities that we built internally. So the non-colored boxes that you see, those are capabilities that we built internally. The reason we did this was twofold. First is not all the features that we needed to run our platform were available on AWS native components.

So the second reason is that we also have as a large firm, there are already a lot of applications that are there in our firm and we need to interoperate with them. So for the cloud migration to be successful, we have to keep continuity with using AWS components as well as existing applications that are there in the firm.

So we built many of these internal capabilities to kind of bring and merge the two together.

All right, so I'll go through each of the three pillars of our platform in detail - ingestion, transformation and reporting.

Ingestion - the two key goals when ingesting data, we need to be able to support a wide variety of file formats. For we on our platform, we support popular formats like CSVs and excels as well as more esoteric ones like cobal and binary. We parse all of these various file formats into a common format that we use across our data platform. We chose Parquet as the data format, as a common data format across our platform.

The second goal when ingesting data is the need to connect to variety of data sources. So because we are connecting to both internal and external data sources, some of these data sources could be accessible via APIs. In some cases, data may be coming to you as files and with external data sources, you also have to take into account a variety of authentication mechanisms those data sources need. And also in some cases, you may have to open firewalls to connect to these data sources.

So all of this complexity around data formats and connectivity, it's a lot for a data engineer to deal with by themselves every day. So we created an injection framework that abstracts out all of these complexities and gives a low code environment for our data engineers to operate in.

So our injection framework is configuration driven and metadata based framework where we are able to ingest all this variety of sources and also able to adapt to metadata changes as they happen because we are ingesting data from thousands of sources. It's very much possible some of the data providers may be dropping a column or maybe adding a column on the fly. You don't want to always break your process whenever those things happen. So what we do is we adapt to those changes whenever it's possible. And in some cases, if you get a backward incompatible change, we notify our users.

Another consideration when thinking about ingestion is change data capture. You want to be able to maintain history of your data, you want to be able to capture changes that happen in the data for a variety of reasons. One could be to meet regulatory requirements for keeping data. Second could be to train ML models that you're building. And third could be for operational efficiency, you want to know how some of the records change over time.

So typically change data capture is built using relational databases. But given that we already are using EMR heavily, we wanted a solution that works on EMR. So we built a PySpark based pipeline which is able to capture changes and maintain history over billions of records very in a very scalable fashion.

All right, we cannot talk about data ingestion without talking about data quality. That is the thing that keeps, that takes a lot of time for all the data engineers. So over time as our data engineers used our platform, we looked for opportunities for automation and to help with data quality management.

So we introduced automations to do simple things like getting rid of white spaces or trimming white spaces or getting rid of special characters. We are also using ML techniques to do data type prediction automatically. We are also removing superfluous prefixes and standardizing the schema of the data that we're consuming. We are also using ML to detect anomalies in the data.

So overall with all these capabilities, we built a data ingestion framework that's very low code and enables our data engineers to quickly and effortlessly ingest data.

All right, now we have the data ingested. Let's look at how we run computations and transformations on the data.

Key goals with transformation of data is to ease scalability. We need to be able to process billions of records using classic data pipelines as well as ML models. We need to also support a diversity of data structures - structured, unstructured, semi and semi structured data.

So a choice we had to make - the choice of using EMR was very crucial for this. We chose EMR because it is based out of the Hadoop's pedigree and which has proven itself to work very well for data processing over the last two decades. And so we've seen that like we use EMR across the board for all of our data processing jobs.

Now, once the data is processed on EMR, we need to deliver this data to a variety of consumption use cases. As I mentioned earlier, we have reporting, web and mobile applications and of course ML data scientists running ML jobs.

So now a reporting application which typically works very well when they have an ability to run SQL statements over a data store. So we needed a SQL based data store and we similarly with all the web and mobile applications, at least in our case, those were mostly search based applications. So ideally, they would work very well if you have a data store that is well suited for doing search.

So how do we get data from S3 to these two data stores? So what we're doing is we're actually copying data that is processed from EMR and copying it over to Redshift and OpenSearch. You would be thinking, wait, that is copying data. You are duplicating data, you're increasing storage costs, you are probably creating latencies. But that's not really the case.

In reality, first of all, if you were to copy data to a database using a JDBC connection, forget it, it will not be able to load billions of records, it will be too slow. However, for Redshift AWS provided optimized COPY APIs through which you can copy data from S3 really, really fast. And that was a game changer.

Similarly, with OpenSearch, there are optimized APIs available on AWS to copy this data really fast. Another decision we made was any change to data would only occur in S3. So S3 is our primary data store. So that took care of any consistency problems you could possibly have if the data is changing in multiple places.

So all data changes only happen in S3 for us. Third one is cost - storage is not really that expensive anymore, compute is probably more expensive in the whole grand scheme of things. So that worked out very well for us.

And again, the last one is ML - ML jobs that are run by data scientists - SageMaker, TensorFlow - all of these algorithms work very well accessing data that is sitting on S3. So either folks are using notebooks to access the data or they are submitting jobs onto EMR cluster to run more larger or complex ML jobs, all of them work very well in this stack.

So what you can see is we used combination of these components to address a wide variety of use cases. That's been our theme. We wanted to build a platform that addresses a wide variety of use cases.

All right, as we build this platform, you would assume we should have a lot of, we should have, we will have hit a lot of edge cases. So when we are running thousands of jobs on EMRs, somewhere on the cloud data center, there are a bunch of machines, the EC2s where these jobs are running.

So there is a physical CPU, a memory stick and a hard disk and those are physical devices and they will fail. So there's always those 5% and 1% failure rates, you will hit them when you scale massively, you will not hit them when you have few jobs running on EMRs or for that matter, any of the infrastructure in AWS, but you will hit them when you massively scale.

So let me give three examples. One is sometimes EMRs don't spin up. EC2s have a hardware failure and they just shut down and then we have S3 slowness. And if you're doing lots and lots of reads and writes to S3, you will see S3 slowness errors. I'm sure some of you may have seen them.

So our solution to this is to implement certain types of retry mechanisms in each of these three cases. With S3 slowness, we had a trailing retry mechanism which continues to retry over a certain period of time. And with the EC2 failures, we detect them and auto resize the cluster overall.

I think you need to, when you're creating, when you're building platforms for massive scale, you need to start thinking about these edge cases.

All right. Now remember in the data ingestion, we said we'd build a lot of automation to help with data quality. We did something similar even in the transformation pipelines.

So when we looked at transformation pipelines, we noticed that every pipeline typically has few filters, few mappings and a bunch of joins. So we thought, ok, can we somehow externalize this? And what we did is we built a rules engine, it's based out of an open source package. What it essentially does is we are externalizing many of the logical elements that data engineers typically build.

So the benefit clearly is now data engineers as well as business users - business users should be able to specify the logic they want through again through a configuration and UI and off they go, they run their pipelines. Now, this completely enables them to do this without having to go through a release cycle. And they are able to see their results pretty quickly.

Let me give you some examples. Many of your business users are um are sales and marketing, business users. And we have a lot of digital assets across our firm where, where, where we are seeing a lot of signals for like uh clicks or uh scroll rates or uh how far for our clients have uh looked at a video that's been published on our digital uh on our digital assets.

So many of the these uh these events are pretty uh are are relatively simple to identify and using our rules engine. Our, our business users are able to identify, able to consume, consume all of the data and identify the relevant events for them.

All right. So we looked at ingesting data, we looked at transformation of data. So now we have results that need to be delivered to our business. Let's see how we do that.

All right. Data distribution. So similar to ingestion challenges with data distribution are are are also similar. We, you have to connect to a variety of data consumers. You have to support a variety of data formats that they're consuming in.

Now our choice of uh again, our choice of redshift and opensearch enable us to, to address a variety of these consumption use cases. Now, our reporting tools are directly connecting to redshift mobile and web applications are connecting to opensearch. And in some cases when we are sending data externally, uh and if the external consumer, the data likes to use api s to access the data again, that api traffic goes to opensearch.

And and when we are send and when we are distributing data, we also have to distribute data through a i mean, we also have to distribute data uh via bulk uh uh via bulk processes. For example, if you're sending files to our, our consumers of the data, we would send those files via transfer family product and consumers that are already on aws, they can, they can directly use s3 sharing to directly get access to our data.

Now, challenges with uh with distribution are more around when you are uh more around resiliency of your consumers. So both ingestion and distribution of data, you are dealing with external consumers. Now, those consumers or uh uh those th th may not the, the systems they have may not be resilient enough.

If let's say you are sending data to someone and the uh the endpoint or the system that receiving the data is down, you will have a failure and again, you will start, you will start seeing these again when you scale massively. Um so again, to account for these, we s we, we again, have to, we looked into various retry mechanisms again to implement, to address these type of resiliency issues.

You also have another example of re resilience issue is when you're dealing with ex with intermediary authentication providers. So they also may be done. So how you do your retries depends on where the failure is. So you would want to monitor these failures constantly and and in some cases uh adopt new tri mechanisms or escalate to your data, data consumers or data or data producers so that so that they can address the resiliency issues at their end.

All right. Now we talked about our all the three pillars of our platform. And if you notice we mentioned several times that we built internal capabilities uh across our stack to improve our interoperability with our existing applications within the firm.

So let me talk about one such such capability that we built. So given this is a data platform, we uh one common thing that happens is you are there is a process that is ingesting certain data, running a process and then distributing the da and, and creating new data which gets picked up by the next pipeline. So you're in a way chaining various pipelines.

So we so that each pipeline creation and executing that pipeline on an emr we call it orchestration. So we built a framework that allowed users to define the pipelines and provide the size of emr where they want to execute the pipelines and the orchestration framework would take that definition and go and create an emr submit the uh submit the code that needs to run on the emr and then actually run the job and when the, and keep track of the jobs that is job that's running.

And once all the jobs are done, shut down the mr and throughout the whole time uh published telemetry, that's that various user uh um that people can monitor on what's happening. So what you see in this picture is is just that all the steps i described are are are done in uh through lambdas and sqsqs uh that are on this picture now, even though it's a capability that we build, obviously, we are using aws products to build this capability.

So it is still running on aws. Other thing on this uh other thing that's important here is the start and end batch. So that is where the interoperability comes with uh our existing applications. We are uh we need to know when certain batches pipelines have run and when certain pipelines are completed.

So the frame, the orchestration framework sends the telemetry about that to the existing applications that do that for us. So when we build this, we had to make a choice on how we build it. As in do we go build a monolith application or should we go build a serverless solution?

So at the time, we decided to go with cervus to uh it was someone new for us. Um but we wanted to try it out, but i have to say cervus is awesome. We never had to worry about scaling this uh application since we built it. It started running with a few pipelines in a day to thousands of pipelines today. We never had to worry about scaling.

Uh and i think after our learnings, we are very keen on adopting so across our stack and through very through the rest of our applications.

All right. So we talked about our entire platform, uh uh all the various steps in it. Now, there are given jp morgan is a big firm. There are other teams that need something similar or are probably doing something similar.

So like any software engineer, we wanted to make our s platform reusable. So there was a t uh so uh the example i have is uh uh there is, there is a sister team called private bank uh which deals with uh which serves high net worth individuals. And uh it has a very stringent privacy and control uh guard rails and they have a very similar use case.

And we, and to in order to make our platform available to them, we needed to decide how to go about it. One option was to uh was to have them clone our repel and deploy the pla de, deploy the code onto their aws accounts. But we were thinking, ok. is there a better solution than this.

We wanted it to be as simple as you know, installing a package on your desktop. So what we did was we packaged the infrastructure as a code and our frameworks together into a bundle and used our c i cd stag that's within our firm to take that package of everything and deploy it to target aws accounts.

So the private bank team was able to get our entire platform and all its capabilities, all our learnings over the years on in trying to make it more resilient and all that immediately uh through. But immediately when the cs cd pipeline ran and deployed the platform on their on their account, this is definitely very, very efficient in terms of like overall resources because you are, you don't, you, you, you're not redoing uh everything when another team needs it.

All right, couple more things i'd like to mention. So first is cost, you have to watch your cost otherwise your bills will run very, very high, very soon. And i think, and you should keep up with upgrades. So let me give an example.

So our emr unit cost over the last few years has significantly reduced. We uh we, and that's because of uh our upgrade to graviton. And also we started using bucket keys to reduce our k ms cost. So overall over the few years, we are are we are s our, our per unit cost with emrs is significantly less another consideration you should take into account is uh migrate to serve.

We've seen benefits of service across the board. So applications are extremely scalable and they are. And yes, y yes, it does take time and to build us, take a, a monolith application to and make it serve or less. But i, we believe at least it's worth the effort because you get a massive scale and it's very, very cost effective.

All right. Uh with that, i will hand it back to lana.

Ok. So quickly to summarize what we just discussed. We talked about jp morgan asset management and how complex our business is. We talked about different types of users we have on our platform, ravi dain into the architecture um in details, talked about tradeoffs and some challenges and how we solve them.

Uh and we also talked about how we made our platform reusable and we were able to deploy it for private bank in, in in no time.

Um some benefits of this platform. So this platform been live for um three years now. We were able to cut cost in half by number one is removing all this data duplication, right? Remember we had different data infrastructures, we had to copy data to solve different use cases. We removed all that.

We also automated and streamlined different processes. We reduced our maintenance cost and we also optimized our resource usage. Number two, we can deliver to business much faster right now. So in on boarding new data right now, it takes a few hours rather than before. We used to take weeks, sometimes even months.

When we have a new idea or new strategy, we want to prototype something, it's much faster for us to start a new service in your or enable capabilities that we need. And more importantly, we enable innovation at scale.

We build a whole suite of products on this platform from optimize our coverage model, right? We help our business with new strategies, we personalize client experience. Um and with that, we actually show so a very measurable impact to our business in 10 billion additional sales a year.

Now, some lessons learned. So when you either migrating your platform to the cloud or you're building a new platform, don't think what you need today, think about the future, talk to your business users, talk about what if scenarios, what if tomorrow you need a different data to be interested? What if, what is your as a lace? What if you need to, you know, how are you gonna maintain your platform? Right? What is the, you know, where are you gonna spend your time maintaining this or how do you build components that are reusable or easy to replace?

So that's 12 is data quality. Again, i can't stress enough prioritizing data quality, your users will not adapt your platform, they're not gonna use it if they don't trust your data. So build in data quality controls into your entire pi data pipelines.

Well, in the last one, everything is moving so fast, right? Everything is changing. So keep evaluating your platform, keep looking for opportunities to make things better and more efficient. And i'm sure that's why you're all here to get inspired and to learn from each other.

And with that, i think um i wanna thank iws for an opportunity for us to present and also thank you for joining this session. Um ravi and i will be available for any questions and have a great day at re invent.

Oh survey.