How JPMC and LexisNexis modernize analytics with Amazon Redshift

最新推荐文章于 2024-09-08 21:56:44 发布

taibaili2023

最新推荐文章于 2024-09-08 21:56:44 发布

阅读量395

点赞数 12

文章标签： aws

本文链接：https://blog.csdn.net/weixin_46812959/article/details/134589137

版权

Welcome to the session titled "How JP Morgan Chase and Lexis Nexus are Organizing Analytics with Amazon Red." My name is Manan Goel. I'm a Principal Product Manager with the Amazon Redshift team.

I'm part of the team that's building Amazon Redshift, our cloud native data warehousing service. Today, we have a real special treat for you. We have 22 of our customers that are going to talk about their journey in terms of how they are modernizing data warehousing and analytics with Amazon Redshift.

In terms of the agenda for today, I'm going to kick things off and just provide some broad context on what we are hearing from customers around data and analytics and how Redshift helps customers manage the explosion of data that customers are seeing.

I'll bring in Anand Singh from JP Morgan and he's going to talk about how they modernized analytics within their environment with Redshift. Then we'll have LexisNexis come in and towards the end, we'll have some time for questions and answers as well.

To kick things off, the explosion of data is a trend that pretty much every customer across every industry is seeing in their organizations. We've been hearing about it for quite some time - the volume, variety, velocity, and veracity of data just keeps expanding.

I have a quote here that says in one hour now, more data is going to be generated than was generated in an entire year just a couple of years ago. So needless to say, there's a lot of data coming around - the volume of it, the velocity of it, the veracity of it, the forms of it just continuously keep expanding.

This represents both an opportunity as well as a challenge. In terms of the challenge, if you don't have the right infrastructure and technology to manage this data, then you can experience performance issues, security breaches, and cost overruns.

But on the flip side, imagine if you had the right infrastructure and technology to manage this data - what a phenomenal opportunity it is to learn more about your customers, understand their behavior, dramatically improve customer experiences, design new products and services, and outsmart your competition.

But the key is having the right technology and infrastructure to manage the data. That's where Amazon Redshift comes in.

Redshift gives customers the ability to bring anywhere from terabytes to petabytes of both structured and unstructured data together and analyze it with scale and performance. It gives you the ability to run complex SQL-based analytics on top of this data.

As we work on expanding Redshift, we always work backwards from customer requirements. There are three main areas we are innovating in:

Giving you the ability to analyze all your data across data warehouses, databases, and data lakes.
Improving price performance - we continue to see 5-7x better price performance compared to alternatives.
Ease of use, security, and reliability - we want to make Redshift more mission critical.

As a result, we see tremendous adoption from customers across every industry using Redshift for analytics.

With that, I'd like to invite Anand Singh from JP Morgan Chase to take us through their journey modernizing analytics with Redshift.

They can make a materialized view, they can auto refresh it and make sure that some of the queries are picking those materialized view. All of it is happening behind the scenes seamlessly. And that is one thing which we really liked because that becomes your trusted lie in a big way. So that was really very, very helpful for us to focus on value added self because you must have been like coming from the world because we always talk to sa and tb and trying to optimize it. But this is something where it is using machine learning to really do that for you in a very, very self pilot mode. So that's very important.

And then obviously, in terms of performance as man spoke about, there are so many things which has been done. So we operate in one cluster and there are a variety of workload. The sales application want to access the data in two or three seconds. There are some batch operations going on, there are some queries going on enterprise reporting is also happening at the top of the same cluster. So the workload isolation and trying to prioritize those queue and then everything is being taken over by redshift has really worked well because with that, we also get short query acceleration so that it doesn't really sit behind the long ones.

We also have monitoring rules which make sure that nothing hacks your system. So that's very important too. And then the contingency scaling is a feature which was spoken about and that we use quite a lot because this runs every day. So we get 30 hours of credit and that is something which we use to make sure that it can spin the cluster for you and it can tear it down. So any spiky workloads, you don't have to suffer from the performance perspective. So that's very important.

And then last two points are very important from financial perspective. We want to make sure that data governance is very important. It doesn't slow you down, but you have the policies in place as code, you have the granular way to entitle because sales people would definitely be very, very concerned about it. And then the last but not least is being a bit cost custodian. We want to make sure that we are having the good pay a good way to make sure that we are saving some dollars for the firm.

So we have gone with three year ri and that has been really helpful for us to amortize that cost over the next three years. You can always make sure that you have the hot and cold data. So you can model the data which you are required in red shed. But most of the cold data can go to s3, you can define those life cycle rules further so that they can go to go through it so that if we do it properly, that we are able to save a lot of cost.

Now, i just want to talk about our journey to a wf because in 2015, 16, that's when we started having a purposeful exploration in jp morgan as a financial firm to start thinking about cloud, how we can go to cloud. And at that point, the regulatory landscape was still emerging, there was slightly less trust. And in 2017 is when we went to the first bold pursuit of trying to take two applications across the entire bank to cloud and x i, which i'm talking about is one of those applications because they had so many data locality rules by many countries. It was not optimal. We were not able to aggregate that data because some of that data used to exist on, on prem and that didn't work for us.

We had to take a step back, we had to pull off and then finally, we took 18 months of transformation to make sure that we rearchitect our system. So sometimes it has to be lift and shift for some use cases, it could be re hosts for some use cases. It has to be, you have to rearchitect to really exploit the full potential of cloud and that's what happened with us.

So the only thing i would like to say is that, you know, it's, it's more like success is a journey, not a destination. And we have to really calibrate ourselves rep pivot and make sure that we are learning constantly and you know, getting to the destination. So that has been really the mantra with whom with, with what we have been operating now.

Not. So if i'm talking about the outcomes, this has not been possible without the help from aws, that has been really a very close partnership. We have been working with them day in and day out. And apart from the technical aspects of going to cloud and the advantages which we get, i we really wanted to look at it from the business lines. What are the business tangible outcomes which we are talking about? And i want to mention too first, the information available to all our sales force by a few hours early makes a huge difference. It's a force multiplier because that time can be taken by all those people as more air time with the clients. And the second is you are able to run those machine learning models right at the top of red shift by its connectivity with sage maker. So you are trying to segment the client, you are trying to look really as how you can do the personal and those are some of the possibilities which we are trying to unlock with red shift.

Now, on the right hand side, i really want to focus on one thing based on our learning out of this large scale complex migration. Obviously, you need to have a very solid enterprise tooling to be available for your bank to migrate the workload faster resilience is very important. And what we have done with resilience i will cover in a second. But, and the governance and security we spoke about. But what has been really important is the skills of the future, investing in the human capital. That's the make or break. And what we did was we really tried to double down on our certification to make sure that our people are getting that right knowledge because they have the best context in terms of their understanding of the platform, the business. And if we can really take them to that journey of learning those skills, that has really worked wonders for us.

Now, in terms of the architecture, a couple of things i really want to mention about we live in a hybrid model right now. So there are a few applications which are on prem and we have gone to aws with most of the compute and reporting, some of the reporting are still on prem. So it's very important for us to make sure that they are really connected. So for example, anything which is on prem is connected via direct connect link, which is very fast without traversing the internet. But also we want to connect with all the vpc s via transit gateways so that they are connected with each other. And also from amazon perspective, we are doing via vpc managed endpoint initially, it was going through. And that is very important because because it uses the private link. So you don't have to traverse from the internet. And from resiliency perspective, it's very important because if something happens, then the cluster is re able to relocate flawlessly behind the scene to another easy. And that is only possible with manage vpc end point, we don't want to have ip address of the leader node and then make the client re point to the new one. So that becomes a very flawless journey.

Uh the second thing which i want to talk about is as we move a lot of data, data gravity is another thing which becomes very prominent because if your data is gonna be in cloud slowly, the applications are also moving to cloud. All the reporting is also moving to cloud. So it becomes important for us that the way we interact and share the data in jp morgan via the data mesh. And there is a big blog if you're interested. But the whole point is as a domain expert, you are owning your data, you are owning your data product and you are making sure that that is available by the center lake formation which itself can be federated at various level. It could be at the lb level, it could be at the enterprise level and we share data with each other without any copying. And that makes sure that you know some of the anti patterns of data gravity is also taken care of.

Uh we are a data measure is a way to connect it's kind of the connecting tissue. But also if there are a redshift cluster, we have been actively using data sharing. So that if you want to do some calculation between sales and marketing or profitability needs data from revenue to calculate that they are able to access those cluster and talk to each other. So that's important. And the last thing is everything is a code that is something which we have started right from day one infrastructure as a code policy, as a code. And it is all connected via the c id pipeline so that any changes happens, they are well audited within the control pan of jp morgan and we are able to get that scale with confidence.

Now, i've spoken a lot about performance future operational efficiency and being a good cost custodian. What are the things we should have done? But in terms of programmatic interaction integration, we have really worked together with aws to make sure that for example, the sequence number, that's something we are still working on and we use our identity column right now. But that's something which is coming. And we are still testing with jp morgan with aws now global tempt just to make sure that we are able to paralyze because we do a lot of he t hopefully as per the yesterday's keynote, when all the aurora posters are able to connect with red shift. Hopefully, we can get into the zone of no etl or zero etl. That's still a dream. And some of the copy command with more parameterization to make sure that we can actually further improve and optimize.

Uh but what is more important is the future looking towards the future? So us, that's something which you definitely want to go for. Uh we want to make sure that the current way of failure is to a different g and we want to make sure that we are working with aws to see how we can be cross-region. So that is another aspect of resiliency which we are working on. And from inside perspective a l you can run those because of the connectivity of red shift with siege maker. But what is very interesting, which was spoken yesterday was about spark. So you can actually use spark to directly talk to redshift. And there is a lot of code which which the data scientist group they are writing in spark for those complex use cases. So that is another thing which we are looking forward to. And then at some point of time, we want to make sure that we can do some forecasting with q at the top of it with its connectivity to that shift data exchange is another thing which we want to look at because we would like to commercialize some of the data products and that is going to be very handy. And then from security last last but not least with data zones, that's something we definitely want to look forward to and see how we can really use data zones from our perspective to make sure that it becomes a way to distribute data and have a layer of control plane sitting at the top of it.

So with that, i would really like to thank you and uh you know, glad we are all here. And obviously, if you have any questions, we'll be taking it towards the end or maybe after the session. But thank you very much. Thank you.

Thanks, anand. Uh you know, jp morgan has been a great partner for us. Uh in terms of, you know, really excited about what they have done with the red shift and also look forward to working with them in the future in terms of how they're gonna leverage and some of the new capabilities that are coming on coming out.

Uh so next up, i'd like to invite uh deepak senthilkumar. Uh deepak is uh with lexus nexus uh market leader in legal analytics. Uh they went through their journey of data warehouse modernization with red shift. So i'll turn it over to deepak to talk about that journey. Thank you.

One is around the raw space which is we collect the data in the native format that the source system has no transformation. Absolutely. Just the raw database or the raw data structure that's there in the source. And then we apply massaging to it. We handle look up values, we normalize the data aggregate, it, generate the uh you know facts we want and then pass it to the data warehouse. And that being red shift.

And in the last, you could see like the consumption layer primarily, all our analysts use d weaver for running the queries. And we have started to work on data breaks for our machine learning use cases and notebook type of an environment. Our primary reporting solution is tableau and even between this layer c and d, the data storage and data consumption, we are looking to see if we can add a semantic layer.

So that's what i was saying. This is not set in stone, it's kind of evolving. Uh the reason we are looking for that is like, you know, business users, when they want to query data, they are writing complex queries to get the results that they need. And that sometimes, you know, depends on who's running the query or what type of uh filters are used, it could give different results. How many customers does lexis nexis have? That's kind of a golden question. And depending upon who you ask or what levers you wanna push or what dials you wanna adjust, the answer is gonna be different. So that's where we are looking to have like a single source of truth that you could get the same answer consistently and you could see at the bottom it's not clearly evident, but the letter e, that's where we have our data governance, security compliance privacy, all that, that's a work in progress.

We are looking at certain open source tools and we are also interested about the data zone product which was announced yesterday uh for security and complaints. You know, we have multiple solutions there and you know, from a gdpr ccp data retention requirements that we have uh we have custom solutions that we have built.

So initially, i was asking the question, you know, how many of you are looking to move to redshift or considering moving from your current wars to another warehouse. So that's the journey we started back in 2020 it all started with a renewal quote from our on premise vendor, which was, you know, our hardware was reaching end of life. And based on that, we had to either, you know, buy new hardware or look at other options.

So when we were looking at it, we noticed the hardware end of life would mean, we need to buy hardware. The current solution was not scalable because it had fixed hardware in our on premise. So when we had things like, you know, month and jobs, when we have to generate invoices or we want to run certain financial reports which have to run end of year. That's when we need some scale and we could not really scale up or scale down elastically because it was fixed hardware.

And then because of the way the architecture evolved over the last couple of decades on boarding new sources was taking a lot of time for us and the production support operational support, you know, in terms of database backup, database recovery, uh you know, the dba side of work that was needed to manage indexes, et cetera that was taking a heavy toll on us. And that's when we started to evaluate multiple vendors. And then, you know, after a very detailed p with around 20 to 25 use cases that we wanted to cover, we decided to proceed with redshift.

So what was the benefits we had with this whole migration? First and foremost, as i was saying, you know, the cost challenge is always there. So we were able to, you know, operate at a cost of one by fifth of what we had with our on premise vendor and the maintenance, the production support that was needed. All that was much lesser compared to what we had earlier.

And from an architecture standard, you know, like what we have decided was we were previously using an etl pattern where we extract from the source, do the transformation and then load it into the warehouse. And that's where there was a lot of time spent on building those etl jobs. We changed our approach where we change it to an elt pattern, which is we extract from the source, load it into the warehouse and then do the transformation in the wars using sql. And that has proved, you know, to be tremendously successful and we were able to onboard new sources much faster.

And as i was saying earlier, we had multiple business units and you know, each business unit ended up creating like their own flavor of a warehouse in multiple different technologies. Once we were able to create the central infrastructure, now we have started to migrate other business units which are across other geographies and other product lines into our central infrastructure. Thereby, you know, there is synergy in not extracting from the same source multiple times, there is a bit of governance there, we are able to leverage on what other teams are doing. You know, there are a lot of benefits with that.

And then, ok, now we are operating at 1/5 of the cost. But what about the performance? Actually, performance went up? So we got almost three x gain in most of our reporting queries and etl jobs compared to what we had on premise and the number of production support tickets with failed etl jobs and data issues and data errors that we were getting earlier that has dropped almost by 40%. So there are significant gains with this migration and with all the data in one place, with all the business units kind of migrating over to the central infrastructure, getting to the 360 degree view of the customer to kind of see the full picture of a customer is much easier.

So how did we go about this? Right? This is kind of my favorite quote by abraham lincoln, which is if you have six hours to chop down a tree, spend the first four hours sharpening the ax. That was our approach towards the whole migration.

Here are a couple of timelines. The first timeline is basically the preparation work we did. And the second one is the implementation we did. The top line is from 2020 to 2021 and the bottom line is from 2021 2021 the end of 2021 kind of it was all 2021.

So with respect to prep work, what did we do? We decided on three vendors with whom we are going to do a p we we listed all the use cases we wanted to cover and what was our approach towards performance benchmarking and also, you know, pricing negotiations. So the first three months was spent with all the three vendors where we were running parallel ps and we were grading them based on all these different use cases and how complex it was to do the migration.

And then from september, october, november, that's when what we did was we finalized our architecture, we built our own home grown elt tool, which made it a little easy for us. So it's a very lightweight tool where, you know, we just build an adapter for any data source. If it's j dbc complaint, we just, you know, take the jdb c complaint driver and connect it with s3 if it was a source like sales force or zendesk or service now ticket system, right? If it was any other source, which requires a ps, we follow the same pattern but with lightweight jd bc drivers to handle our use cases, which was to read from the resource and load it into the s3.

And once the tool was built, what we did was we looked at our current barrows and we were seeing like, you know, what was really needed, what is the usage pattern? And if there is a long tail, we are trying to cut off the tail so that we can focus on a reduced migration scope rather than trying to migrate everything we have. And one big deal for us was we needed case insensitive collation, which means in our end users were not bothering about cases in the filters.

So that was a very big deal for us. And that was not there in amazon redshift back then. So we worked with the redshift product team, got that added as a, you know, absolute must for us to go into this migration. And they were very supportive of that, you know, as part of the migration, they the product team and the account team, they work with us, got our requirements and added new features that will support our migration.

So we kind of did all this before we started the migration, not when we were in the path of migration. And then another big deal was, you know, like we work with schema conversion tool which is used to convert procedures, database objects, tables views, you know, all those different things from one technology to redshift, right? What would be the other database technology? And there were gaps in the tool.

So we worked with that product team as well address those gaps and then came up with, i would say a 50 60% workable solution with a schema conversion tools. Still, we had to do a lot of manual clean up and we ended up building certain python automation to pass the code and make some fixes so that it's compliant with redshift.

So there was some work that was in progress back in 2021 january. But redshift has actually started to, you know, work with schema conversion tool and many of those features which we did manually people have started to add it. One good example is the merge statements, right? We were heavily using merge statements to do up certs and that was not supported. So we had to convert everything to an update and an insert, but that was announced like a couple of days ago. So, you know, in future, we will be using it.

And then around february of 2021 is when we came up with our complete inventory list, with all the objects that we need to migrate. And that inventory list was huge. It was about 85,000 objects. And that's when we started.

So once we started the migration, what we did was initially, we started up with the environment setup, we created like a vanilla red shift instance which is empty. And then initially we got the database objects moved over and then we started moving the data. And then slowly we started to get, you know, migrate all the procedures sequel. And once it was done, the data is already there, and we were thinking the on premise barrows and red shift on a daily basis so that people can actually see the data, you know, not waiting for us to finish the etl jobs.

And as soon as we had the data, we started to migrate the users. So we actually created all the users who are in our on premise barrows in red shift and we started to train them. I'll talk about that in a subsequent slide about how we address training. But once we finished the training, apparently, the etl migration was also happening. And then that kind of finished around july.

And then we focused on migrating all our reports and dashboards. And once we did that, there were, you know, bugs that were raised, which we had to address. And eventually, what we did was we turned off our on premise barrows like a trial for a couple of days and we were noticing like who's complaining and then we kind of we kept repeating it and then eventually we said, ok, now we are going to completely shut down our rem barrows.

You know, it says here, like by october, we kind of finished it, but there were a few business units which were kind of, you know, running late. So there were like few situations where we had to allow them to use the system, the compromise system, we completely shut it down in around february of this year.

What was our approach? Right. The engineering work to do the migration i would say is only about 30% of the project's effort. The bigger effort is actually communicating with our stakeholders, analysts and users.

So we had user group kickoff meetings, we had constant countdown timers that we used to send to people reminding them like this is coming up, your ids will go off, you need to move to the new system and we had office hours, which we set up like every day where people can log in and communicate, you know, ask questions.

We also, you know, it was very stressful project in general because, you know, we had a short window, right, like 10 months, pretty small for a migration of the scale and we wanted to gamify it a little because, you know, that helped in removing the stress. So we had a couple of contests which was, you know, bug bounty and give up your id early and you are into a lucky draw contest, you know, things like that which reduce the stress for people.

And we used kanban method for our whole project. And uh one thing that was a very conscious was we did not use the query editor feature, which was there with red shift. We went with d because it was open source and we contributed to the community. So we ended up adding around 20 features to d. We were to support red shift, things like file, import bugs around, you know, the way they handle non schema binding views and all that.

So there are, there are multiple fixes we did and training, you know, that's the most critical part. The sequel that people are so used to with the old system is going to be different with red shift, it's not exactly the same. So we, you know, created a very detailed four hour session which talks about how did you do these things in the old system, how are you going to do it in the new system? Both from a sequel perspective, process perspective, regular operations that people are so used to doing.

And then for sophisticated or advanced users, we had an abbreviated version which was for 90 minutes. And with respect to the engineering side, we created sub teams, you know, with clear focus. So there was like a team that was focused on migrating reports. There was a team that specifically focused on getting the database objects done. And then we leveraged aws proserve who acted as a license between us and the product team of aws, which was also very helpful for, you know, this whole project and we were constantly tracking the status.

We had, you know, daily reports and monthly reports and all that stuff. Now, what about the learnings with this whole thing? Right, first and foremost, we could have had a longer runway considering the scale. However, you know, it could have been something which had more time is, you know, how you always feel with these migrations and q a and test strategy. I think we had a bit of a gap there with this whole thing, right? As we were migrating the reports, people are finding issues.

And what we used to do was we'll reflect the on premise data back into the cloud barrows and then fix the etl behind the scenes. So we had a backup option like that, which kind of helped us to go with this approach, but it's a very risky approach, right. So i think we ended up using a lot of our business users, analysts as our q a in this whole project and to prevent them from complaining, that's when we gamified it so that they were incentivized to find the issues and not complain about those issues.

Uh if you are migrating from a different technology to redshift, you need to pay very close attention to how time zone works, time stamps work and the way the number rounding happens. So one thing we noticed after we did the migration was, you know, financial reporting in our revenue for the month, we saw like tens of thousands of dollars which were not matching up. That was because of the way the rounding works in our current arm or red shift versus the previous on premise system, that's different, especially when you do typecasting.

So we learned that and the way redshift handles isolation, a lot of you might be so familiar with maintaining a status tracking table where if you have multiple etl jobs, you update the status of the etl jobs. That's a problem, you know, where you have to handle the locks very carefully. Redshift actually launched a snapshot isolation about six months ago. But when we were migrating, we had to pay close attention and serial is a default one. So we have to pay attention to that however, you know, having said all that, it all ended. Well, it was a very successful project for us.

We are saving multiple millions of dollars with this migration. And as part of this journey, we trained around 275 users. We migrated 400 plus users, more than 1000 reports, multiple etl jobs in the thousands, database object in the thousands again and a lot of data. But what's more important is we did the migration last year where when we went live, we had around 450 users. Today, we have around 650 users. It has tremendously grown in the last year because of the ease of using this ecosystem and quickly on boarding new sources and all that.

Uh what's the future? We are looking closely at the role level security, serverless and data share. We have started to use data share. It's been a very uh good feature that we have been able to adopt and use serverless is something we are looking at closely uh especially for our de workloads and non production workloads. And in cases where we have certain analysts who want to do complex analysis even there as well is a option we are looking at and data share.

We are currently using it to, you know, at a very minimal scale. But what we want to really do is set up a separate reporting infrastructure, separate uh etl infrastructure and then kind of have data share as a mechanism where etl jobs will not disturb your reporting layer or analysis layer. So it's gonna be like a workload isolation set up. So we are working towards that with that.

I'd like to say thank you and i'll pass it back to manan. If you have any questions, we'll be waiting outside the lobby.