FINRA CAT: Overcoming challenges when big data becomes massive

Leah Crawford (Principal Customer Solutions Manager, AWS): Hello, welcome. Can you guys hear me? Great. Alright. Well, my name is Leah Crawford and I'm a Principal Customer Solutions Manager with AWS. For the past two years, I've had the pleasure of supporting the FINRA account. And during that time, I've learned about FINRA, their systems and their scale.

Today, I'm really excited because I get to share some of that information that I've learned over the past two years with you and particularly I get to share with you the story of FINRA's Consolidated Audit Trail, also known as CAT. And we're gonna be referring to it as CAT kind of throughout this presentation.

Now, some of you in the audience actually may have heard of FINRA CAT before. Can I get a show of hands? Who's heard of FINRA CAT before? Oh wow. Ok. That's a good bit of you. Great. Well, if you've heard of it before, then you likely already know that its purpose is to create a single source of truth for all exchange listed equities and options trading activity for US markets. And the single source of truth powers regulatory analytics that FINRA and other regulatory organizations use to identify fraudulent and manipulative activity which keeps our investors safe and protects our markets. So it's incredibly important, it's a very important application.

But whether or not you've heard of FINRA CAT before, today's session is gonna give everybody a deeper dive into how FINRA architected CAT, how they've solved and managed the massive and growing scale of FINRA CAT, and how they've partnered with AWS to be successful.

With me today, we have Scott Donaldson and Steven Diamond. Scott is FINRA CAT's Chief Technology Officer and Stephen Diamond is a Senior Director at FINRA responsible for CAT engineering and operations. Both Scott and Steven have seen the evolution of CAT firsthand and their stories are truly unique and inspiring. So I'm really grateful that they agreed to be here with us today for today's session.

I'm gonna start by telling you about CAT's core architectural pattern, big data, and this is to help you and the audience relate the stories and the specifics that you're gonna hear from Scott and Steven to the systems that you're managing and you're building. But then Scott's gonna tell us about the history of FINRA CAT, how FINRA architected CAT, some of the exogenous challenges that they've had to solve for when building FINRA CAT, and how they've solved them.

Then Steven's gonna come up and he's gonna tell us about some of the specific keys to success and implementations, the optimizations that they've implemented, that have helped them achieve almost 100% SLA attainment in 2023.

Then Scott's gonna come back and tell us about the impact that these optimizations have had on this system and also the markets.

But before we get into all of that, one of the first things you need to understand when we think of FINRA CAT is the massive scale that FINRA manages. It's been said at AWS before that FINRA doesn't do big data, FINRA does massive data. But some of you in the audience may be skeptical and you may be saying, what are we really talking about here? So let me be specific, FINRA CAT manages nearly 700 petabytes of data and will likely be at an exabyte scale within the next three years.

And while we probably can't agree on a single definition of what qualifies as big data, I think we can agree that that's not big, that's massive. And what's more this data is not only massive, it's incredibly complex and unpredictable. When we talk about financial markets in general, we use words to describe them. We use words like volatile, uncertain, turbulent and my personal favorite over the past couple of years, unprecedented. And the same is true for this activity, for the data that this activity produces.

And what FINRA CAT has built to manage the volatility and uncertainty of financial markets is truly unprecedented. To be successful, they've had to solve for the unpredictable, build for the future, and scale beyond their wildest expectations. And they've done it all with AWS.

AWS suite of services that power big data analytics have enabled FINRA CAT's journey. First, Amazon EMR offers open source frameworks such as Apache Spark to power FINRA CAT's big data processing pipelines. Second, Amazon EC2 and their Graviton compute types power FINRA CAT's workloads. Now you're gonna hear more about FINRA's Graviton journey later on from Steven. But what you should know is that AWS's continuous investment in in-house performance compute is leveling up FINRA's workloads.

Third, Amazon S3 offers unlimited scale and automates optimization making it the easy choice for CAT's unpredictable volumes. Additionally, with KMS backed encryption and automated replication, S3 enables security and resilience.

Finally, AWS Lambda enables set it and forget it automation and event driven processing and it's the connective tissue that's underpinning FINRA CAT, generating efficiencies and reducing waste throughout every process.

Now all of these services, these are the core services that make up FINRA CAT, but they combine in a unique and compelling architecture that make CAT perform. And to tell you more about the specifics, I'll hand it back to Scott.

Scott Donaldson (Chief Technology Officer, FINRA CAT): Thank you, Leah. My name is Scott Donaldson. I'm the Chief Technology Officer of FINRA CAT. FINRA CAT is a subsidiary of FINRA itself, which is the Financial Industry Regulatory Authority. And our goal in FINRA is investor protection and market integrity. Our job is to help monitor the markets and enforce the rules and make sure that they are fair and equal.

To that end, technology is absolutely a huge portion of what FINRA does. It's almost a third of the organization in terms of what they have to be able to do in terms of processing the data, to be able to do this across all of the various different exchanges and industry firms and managing large amounts of data.

When we look at the Consolidated Audit Trail or the CAT, before we get into some of the technical nuts and bolts, I want to give just a little bit of history around this. So going all the way back in 2010, there was the flash crash. Now long after that, they did an examination of the markets and the SEC proposed a new rule 613 that basically instructed the exchanges to form a consortium together called the Consolidated Audit Trail and work together to then select a plan processor to actually build it.

Then there was a plan that was built out in 2016. And ultimately FINRA CAT was selected as the plan processor in 2019 and we began and we hit the ground running in March of 2019.

Now the CAT brings together all of this information - every originating order, every quote, every route, every cancel, every execution, every partial execution across everything in the US equities and options markets. That includes dark pools and ATSs as well as the exchanges. And from every broker dealer whether it be retail, institutional or proprietary trading.

Now, when we look back at the markets, generally what we have seen is that they grow from a dollar volume and from an event volume, typically about 20-25%. That has been a historical sort of norm that we had observed. And as we took over and started building the CAT, we were in that realm, we started in early 2019.

But later in 2019, a significant event occurred that not only impacted CAT but impacted everyone in this room and everyone around the world - the pandemic. And with that, there was unprecedented trading and the US equities markets had almost jumped over three standard deviation units over the 10 year historical average and over two standard deviation units in terms of contract volume and market volume, just dollar share of what was going on within the markets during this time.

So while we had just begun building the CAT and implementing it on a multi-year sort of build out plan, all of a sudden the volumes just exponentially increased on us. And to that end, we were literally building the plane, flying the plane and upgrading it and adjusting it while we were actually flying it.

So to that end, there are numerous challenges that we have within the Consolidated Audit Trail. First and most important is the performance - there are very exacting plan requirements and processing requirements and service level agreements where we have to link on every day over a trillion events.

We piece that information together. We have to piece it. We have to find the errors in all of this data and report that back out and then allow firms to correct it to be able to do that. We need very, very fast compute and we need a big data platform to be able to process that.

We also, one of our key principles is being dynamic in that we only provision enough compute to do the work that we need. We don't want to over provision that to be able to process the amount of data.

We need to also because this is a regulatory system, the quality, the accuracy, the efficacy of the data needs to be 100%. We need to have high quality data. And to that end, the system identifies errors, finds incongruities in the data, reports those back to the firms and allows them to fix and correct that - not only within sort of a standard three day correction window, we can get late reported trades to us at any point in time over the reporting horizon, which can span multiple years. And we have had in excess of over 800 trade dates reported to us on a single day, which in a big data processing, 99.9% of it is on a single day, but you still have a smattering of data. So there's a highly skewed environment across that.

And lastly is how do we do this in a very cost efficient manner? It requires a lot of space, it requires a lot of compute. And so how can we do this in a very cost efficient manner?

Now, when we look at the CAT overall, the processes are running processes, billions of records every day, we bring in the information from the exchanges and from the industry, we then run it through a set of validations and process flows with the good. And we report back those errors back out to the firms to allow them to work and correct those errors and increase the quality of the data.

With all of the accepted data, we then run it through linkage processing which has to piece together all of the chains. Now the chains can range from three links to over 10 million links depending on the data. And again, when we get into the within this big data system, one of the issues is that there is nothing really normal about the CAT data. It is not distributed normally.

We have issues that trade heavily on certain days. You have everything from mom and pop firms to large market makers in terms of their equity in terms of their volume, as well as then dealing with high amount of data skew inside of that. So that becomes balancing the problem for your workloads across these clusters very, very difficult. And we've had to adjust the algorithms accordingly.

Ultimately, that data is then stored and then made available to all of the regulators that includes the SEC, NASDAQ, Cboe, Box, all of the various different regulators and entities that are on there. So we provide a data service to them that then allows them to then run their patterns and analytics and look for items of market integrity as well as making sure that the rules are being followed.

When we look at the CAT overall, it is a large analytics engine, right? It provides that semantic validation, it provides the feedback. It's a process of literally over 200 jobs that take the information, process it, validate it, reconcile the corrections across this. And like with any big system, there are many tradeoffs, we look at efficiency, performance improvement and also cost.

And when we look at some of our tenants within our architectural principles for us, data is first, it has to be - it's always about the data and the integrity of the data. And for that, that was our choice of why S3 is the only source of record with anything that we have, we may have caches of information sitting in RDS or in Dynamo or other places. But the only source of truth is always S3 for us with that.

Then we also, our goal was only bring up the amount of compute that you need. So be scalable, be dynamic. And when you process a job and when you're done, the compute goes away, we move on to the next set of jobs in the sequence. To that end, we use things like Lambda and EMR for our validation, we take advantage of both Spot and On Demand market to be able to manage that based on sort of the timing of the data in the SLA.

So we've been very dynamic of when we are opportunistic in using the Spot market to get cost wise compute as well as then when we need to meet SLAs and timing, we'll spin it up On Demand to help manage that On Demand. We'll use reservations or we'll use compute savings plans for that.

On the data processing side, we have a very large amount of compute. So this is the core of the data integration when we do like the linkage processing and then value added on entitlements and enrichments to the data. So within this, we are processing the data crunching it all up and then storing that back out, ultimately finding all of these different chains.

And on any given day, we identify 65 to 150 billion chains of the various different events that we are producing. And so when you look at the numbers, as Leah was saying, some of the numbers are staggering. When we first started, we were processing 100 billion events a day that quickly jumped up to 200 to 300 to 400 billion in very, very short order. Our recent peak, we just had a little over two weeks ago, we processed over 665 billion events on a single day and that's over a single day that's reported to us. And again, most of that is on a single trade day. But then you have a smattering on every day where we can get 100-200-500 trade dates reported to us. So it's highly, highly skewed in that regard to be able to do this.

We have to spin up a massive amount of compute. Our standard compute is that as we bring it up 99.5% of our compute is completely transient clusters run from minutes to hours, some of them, you know, 18-20 hours, the average lifespan about four or five hours for any of the various different jobs.

To that end, we spin up on any given day, an average of 120 to 150,000 compute nodes. And we've had peaks of nearly 300,000. On average, we have about 20 to 25,000 compute nodes running at any point in time and we've had peaks of over 50,000 compute nodes in the us-east-1 region running at a single time.

So now that you have a little bit of context about some of the size, the scale, the complexity of what we have to build and run within the CAT. I'm going to turn it over to Steve Diamond who's going to walk through some more of the technical details.

Steven Diamond (Senior Director, FINRA): Thank you, Scott. I'm Steven Diamond. I'm a Senior Director of Technology at FINRA and my job is to help build all this with a team, many of them are here. So I'm going to sort of go through a scenario of one part of what Scott described we have and I'll survey, we have what's called the linker. It takes all these events that have come in from the broker dealers and the exchanges and links them together to find essentially a life cycle of an order and a trade.

So we get, we get over 100,000 files every day by 8am from all of these broker dealers and exchanges after we've done and we validate them as, as they come in. And then essentially at 8am when we're done validating, we start doing the process that will tell each of those exchanges and broker dealers what errors we have found and what they submit looking across their data. So not semantic, not looking at the at the, you know, not looking at? Oh, did they put a number in where it was supposed to be a letter or something like that? This is not looking across, have they forgotten something? So now you're having to build and look for missing records, which obviously is not something you can do terribly easily. So there's a lot of patterns and processing of that data to look for that as we go through that process.

What makes part of this challenging for us is we have a noon SLA to give this feedback and by noon SLA I mean 1201, we miss 1212 we miss by one second, we miss it's kind of nasty. And it's a large volume of data, right? So Scott said we're getting 600 billion records in a day, not all of them fit into this process, but up to 400 billion in a day fit into this process.

And in addition, we have a four day window that we look at for corrections and looking to see if someone submitted something a day late to match up with, something came the day before to see if they fixed an error or if there's still an error so that can get us over a trillion records in one process looking for those problems.

In addition, there's sort of two pro two basic processes within that part of the data is within a firm or within an exchange. Have they submitted what what we expect from what they send and then we start going across and across is really where it gets comically more complicated.

The other challenge is as we put all this into a cluster and as Scott said, use EMR and Spark are primary processes for this, there's a lot of data shuffling going. It's not as simple as one query. It's multiple data shuffles looking for multiple patterns of where different types of data is missing in this large pile of data. So we at times have hundreds of terabytes of data shuffling between nodes. So it's a lot of data around a very large system.

So where were we when we started doing this? It wasn't good. You can see little side face and going. We did not make the SLA a lot. It was kind of brutal at times. Um a few things were going on one, you know, we're building the system as Scott said, building it while we're, you know, trying to fly, what while we're building the airplane.

In addition, we had higher volumes than we expected. And you know, everything else you're as you're building, you're still learning how does it behave, how did the end users behave? All of that was very hard for us at the start. We even had problems where um because of errors, we had to run our jobs twice or we had to start it from the beginning because where we were with errors it's very painful.

Um our basic environment at that point was EMR five, I think 58, to be honest, um Spark two primarily c5 18x larges and EBS volumes. And we were running clusters of 1200 to 1400 nodes. So it was a lot and it was, you know, expensive and big.

Um they said, as Scott mentioned, we had volume growing 100% 1 year, 40%. The next um this year was calmer, but as we learned a few weeks ago, we had a big spike and that's maintained. So volume is so unpredictable.

Um the other thing we had going on was we were still building. So as we were building, we were building incrementally and incrementally, not just for us but for our customers for the exchanges and the broker dealers. So initially, they didn't have to submit everything. And then as we went further, they had to submit more and then more. So there were some of our releases that increased volume by 30% and that's a 30% in addition to what the market was doing. So we just had explosive growth.

We had 11 release at the end of 2021 that was a 40% growth overnight because of that one release. It wasn't market volume. That was just what the rules now said they had to submit. So it made it very difficult to manage. It was not, it was not a pleasant time.

So the real question is, what did we do? What do we do to help address this?

Um I should actually, before I go there, let me have one other thing that was affecting us, which was one of our bigger problems was we had a lot of bad notes. I'm using that quote and I can explain what that is. It was mostly, you know, three types of things that caused us to have problems, fetch failures, inconsistent task performance, and stuck tasks. And when you're running a 1400 old cluster stuff, which is happening, we were trying to figure why, how to recognize it, what, how to address it.

And because we were running so many nodes and such a large cluster, we were hitting, we were also hit problems with an AWS as things were happening because we're hitting most, we were in us. C one, this process is only one in one AZ. So we're hitting every, probably every server in that region of that type. Very can be very painful at times.

Um so it was common to land on hardware problems or just blips, right? Actually AAA bad node sometimes it was bad, sometimes a blip.

Um and for us, that was a 15 to 20 minute impact at best, sometimes worse. And other times if, if it killed a process or disturbed the process, that's when we'd have to restart it. Well, if a, if a job, we didn't do much checkpoint because we have a, essentially a four hour SLA and our volume to checkpoint to S3 takes too long. We wouldn't make the SLA if we did that all the time and you can't predict when to checkpoint. So we didn't checkpoint and of course, if we fail, start from the beginning and we were doing that a fair bit, which of course is, we didn't make the SLA and it cost a lot more, I mean, run two times twice the cost.

So what do we do? We looked at why we're having the problems. Um and with our um our engineering team and our big data um research team, we started running more tests where the, where were we bound? We were IO bound. That was our biggest problem. And you can, as I described, we're doing Spark shuffles of hundreds of terabytes. Well, that means I'm writing to disk on all those nodes. I'm sending all that data over the network. Well, that gets to be a problem. You just slow down over time.

So we looked at in particular NVMe discs for us. So it's a network attached diss instances that have direct disks no more EBS volumes. And we looked at three different instance types. And you can see from the graph um the one on the left, the higher number, that's what we were running C5s. And we had three different instant types all with um direct attached disc, we got a 30% improvement right off the top.

In addition to that, to move to those sites, we had to upgrade EMR and we upgrade Spark and we got even more performance. So suddenly we're getting much better performance or at least in testing much better performance.

So then it was ok. Which do we use? We looked, we chose Graviton, we worked with the AWS to get capacity because at that point, we're now running 400 Graviton nodes instead of 1200 but 400 Graviton nodes, they're not usually just lying around. So we had to get work with AWS to get capacity to run that. And I'll go in a little bit of how we did that with them later.

And with that, we started getting much faster performance touring in. We actually got I said 30% performance improvement. We probably did better with the upgrade as well. Probably more like 45 50% improvement.

The other thing we gained was we got a lot less bad notes and you can see from this graph, you can see the nice little pink line before really bad after much better and it's gotten much better after that.

Um that saved us a lot of time. Also honestly made it made all of us feel better with a lot less stress watching every day. Are we gonna make the SLA and watching a node fail and lose 1520 minutes?

Um it doesn't take much to impact on our SLA. We'd also at that time implemented automation to better identify bad nodes and pull them. So we work faster. But once we had the Gravitons less of an issue.

Um but one lesson from this, this is only in, you know, this is this graph is actually in 2001 to 2002, 2021 to 2022. Just about 68 weeks ago, we had a, a relatively significant change in behavior. We start having bad notes again. We hadn't changed any code. We're still in the same Graviton machines.

Um it just started happening. We got maybe i think over a 34 week period, we had 10 bad notes. We hadn't had any for months before that we open tickets, open more tickets, keep pushing it eventually came out that it was actually one bad server that had a network device problem, but AWS didn't find it until after they were to correlate all those tickets and find the bad node. They ha they had added monitoring that they didn't have to find that one case.

So we had a new case. So the lesson for that from that recent thing is open tickets. Seriously, they want you to open tickets. I i yesterday i happened to meet some Graviton developers talking about this. They need you to open tickets. So if you have problems, open tickets, they can't fix things if they don't know about them.

And for us it's, we have enough scale that we may see the pattern. But if you don't, if you don't open your ticket, they won't see this pattern between two and three of you. So open tickets and we open them and yet push, you know, keep pushing on your TAMs. Everyone else to say what's going on. What's the answer? Because the initial answers we got were not a hardware problem. Well, it was, but it was very, very localized. We have enough to hit it.

All right. So I'm going to talk a little about how we scaled and then after that, I'll talk about cost.

Um as we talked about, we have, we've went through most of this already. We have unpacked the volumes with big data. It is hard to scale linearly. You can, you know, it's a distributed system. So you have lots of different pieces involved. You have network disk compute and we're shuffling data everywhere. So adding more capacity adds more complexity. That means more data is going across the network. So, you know, and you're limited there or you're limited in IO. So it's hard to do linear scalability with the work we were doing, made it better not there.

And then other scaling challenging compute availability for an SLA we can't just go to the market whether spot we use on demand, you can't just go to on demand market and get 1600 nodes all the time or 400 nodes when you want often, you can, but you, if you have, if you, you can't count on it, there will be days where it won't happen. We've had that happen a few weeks ago. We had where one of other jobs you couldn't find the, no, it was not an SLA driven once we were ok. But it will happen that you won't get what you need. So you have to have plans to manage that.

Ok? So what are some solutions to these? I mean, some of the basic things always look to rework your code. There's always performance improvements in your code you can make whether that's improving scalability um through parallel processing, whether it's just looking where you have challenges in your code.

Um Scott mentioned that we had, you know, an SLA issue a few weeks ago or months ago. Now, I guess when somebody submitted lots of data that had lots of errors, um we actually made the SLA that day but we found a a latent performance defect where we were using an ArrayList instead of a Set that made everything slower. It almost cost us the SLA that day simple code change probably been, had been there for three years. We never had that case hit. It was an edge case probably what didn't performance has that exact case. So it's always code you can change.

Um instant fleet, at least for us with the MR instance, these our option for defining what instances we use don't depend on one instance. If you do, you're gonna be in trouble at some point, especially if you're just doing on demand or spot have multiple, you know, you wanna stay generation ahead, you wanna have different choices just to be more flexibility for us. What we do in particular to get those 400 nodes, we have an ODC on demand capacity reservation. You can create one any time assuming you can find the capacity you crest it. It's generally your capacity and until you free it up, it's your capacity, we actually keep it forever.

We grab those 400 nodes, we don't give it back. Um that's sort of way of keeping control over our capacity. Now, of course, there's costs associated with that. I will talk about that a little bit because obviously, if i'm now paying on demand 24 7, 365 days a year, that's kind of expensive. So how do i maximize my usage and lower my cost? I'll talk about that too because that's obviously an obvious question you should have with that.

All right, the other thing we do is continuous optimization and we'll just sort of get into the cost as part of this. It's always looking at performance cost and the back to stability. Where are you having, where can you do improvements or has the external environment changed and circling back again. So for us, it's how do we, how do we manage that? Right? We can always spend money to make things faster. The question is, is, is the performance gain gonna drive the cost low enough to pay for those performance gains?

So simple performance optimizations, we look, I won't say simple anymore. Um as we talk for optimize the code, always want to keep doing that. Um continuing upgrade, you have to stay current. Um that's partly just the model of aws, right? If you're emr five, go to emr six, you'll gain as you upgrade, you'll have more instance types to choose from over time. You'll have more choices. Um the underlying open source software spark, um the os um the, the um java version, whatever software using that's probably gonna have in addition to um performance have bug fixes that improve stability. So you'll keep gaining every time, every time we upgrade we usually gain might not always gain, but at least you get bug fixes and they're still remaining current obviously. And also for security people, you get more security fixes too. So that makes them happy.

Um and also look outside compute. So we've had other, you know, sometimes it's just, it's not just on your compute on your software, it's your storage, your file types, your compression. So we primarily used orc as our storage. We're slowly moving archaic and better performance. That's going to take a long time, but we're doing it strategically where we can um file formats, um compaction, all things you can do to make things better.

And i said cost optimization. So i get a lot, a lot of time on this one. Um so one we talked about is the o dc r. Um so with the o dc r, we combine them with what's called a cost savings plan. I mean, compute savings plan, sorry, compute savings plan is just a budgeting plan with aws where you commit to buy a certain amount of commute, compute it doesn't matter what kind, what server type they doesn't matter. It's just compute. You pay us, you say you will, you will essentially commit to a certain amount per hour for the life of the contract, which is one or three years and then you get not quite spot, but you get close, pretty close to spot posts for your pricing. So you can save a lot of money. But we also have total flexibility in what you use. And for us, it doesn't matter. It's not tied to the o dc r. But because i have an o dc r, i know i'm going to use that right. I'm already paid that commute. So the cs p clearly covers that. So that just dropped my o dc r costs significantly.

And we could, we look at all our compute and we look where we have, you know, peaks and valleys through the week and we do weekend processing and everything else, we look at, where can we, where's the line where if we commit to the cs p where we know we're going to make it enough that it's worth the cost savings. Because obviously if you only make it 50% of the time, maybe it doesn't work. But if you make it 75 80% of the time, the numbers can work.

Um some other things we did and all around the um o dc r, um o dc rs have a feature which isn't really well known. I don't think you can suspend go dc r so we can call an api, suspend the o dc r saying we're not using it up to an hour later, amazon will take it, put it in the spot. We get spot credits, not equal to the cost of o dc r but some money in spot credits. And then whenever we want with an hour's warning, we call an api and say give it back and then with an hour, it's actually less an hour, but it's committed. They commit an hour, we can use that o dc r fully back to ours. And for whatever time we spent in suspending it, we get spot credits and we have a number of spot usage. So it pays for a nice amount in particular.

I give an example, thanksgiving weekend, all our processing is, you know, based on market days so thanksgiving, there's a lot less processing, there's thanksgiving day and then that following weekend, not a whole lot of processing. So we can suspend it for a day, day and a half spot credits. So it's a nice little feature. Just a small one that um we worked with aws to implement some other things.

Sorry. Um amazon athena and for other things, this is more on the credit side, but we have huge savings on service in amazon, amazon athena from e mri mean huge. We have one service that was over the years cost a few $100,000 and now cost $100 a month to an athena when a performer was running the emr. So just huge savings.

Um and the last one is s3 intelligent tier. If you haven't used it, use it um basically automatically shifts your data from, you know, near immediate access to lower access tiers. But from your point of view, you have access to all of it the same, there's no performance hit just cost less. And of course, if you access that data, they move to the instant archive, which is the lowest tier, it moves up to the instant and you start paying more for it. But if you don't, it's still there and available and we have requirements that make it so we have to have stuff available. So this allows us to have cheaper cost available even though it's data that no one is touching and it's five years old or four years old in our case and it's just sitting there quietly. So we save a huge amount of money. If you're not using that, that should be the first thing you get out of this, go and save yourself a lot of money. It's, there's no reason not to use that i can see.

All right now, scott gets to go and finish this off.

Thank you, stephen. So when we look at like the impact of the improvements that we made both in the compute and the storage has been pretty significant from the compute side. When we look at our overall processing volume, as steven was noted when we used to run on c fives, we were spinning up clusters of 1200 to 1600 compute nodes. After we moved to the graviton tubs and the r six g ds with the nvme dis, we were able to drop our cluster sizes down to 400 to 500 compute nodes to process the same amount of data in the same amount of time that overall reduced our total compute time by over 70%. With that. That has allowed us to avoid literally tens of millions of dollars, $10 million in annual savings, basically of avoiding of compute cost by that efficiency gain.

Along with that. When we look at when we plot out compute time versus data volume, a few things that we've done as we've improved over time over to the left. You kind of see the scatter chart one, a steeper curve like as the volume goes up, the incremental or that unit cost to move to a higher data volume was a steeper curve. It required more compute to process the same amount. And we had higher degrees of variability even inside of that, that generally was due to data skew conditions or very unique data skew conditions.

So as we continue to optimize the code, identify those different scenarios and be able to then optimize the process and those code improvements, we gradually continued to make more and more changes. So two things that you kind of notice as we move through time, one, the slope of the curve greatly improved. So that unit cost to process more data costs less to get to a higher data volume and the predictability, the cluster of that became much more tight, right. So now we are much more predictable and we're much more confident in making our various different slas and plan requirements and we can still scale up to larger and larger volumes at a much more predictable price point.

So when we look at it, what does this mean from an overall dollar perspective, if we kind of look back at where we were at in january of 2021 and to where we are now, we've reduced our, a couple of our kpis that we track on a, on an ongoing basis is what is the amount of compute cost to process a billion events end to end from the original sort of data collection and validation all the way through the final linkage. And when we produce the final data sets across all of our different jobs, we've reduced the amount of compute in its entirety by over 50% from lily. Just over two years ago.

We have a very similar story on the storage side on that by taking advantage of s3 intelligent tearing and moving down to archive instant access. We've reduced our unit cost of what it costs to store a billion events per petabyte by over 65% that is made. And that alone has saved us literally millions of dollars every year by moving to archive instant access.

When we look at our current footprint, we're at about 680 petabytes. And of that, over 45% of it sits in archive, instant access. Another 40% sits in deep archive. And so we only have very little of about like 15 20% that sits either in frequent access or in infrequent access. And the only thing that we generally keep in s3 standard is basically is what i would consider sort of workspace, right? So as the jobs are running, we use that for scratch space, we don't have that. That's the only portion of our data that we don't put into intelligent tearing because it's basically just scratch space as we go through that.

And what does that mean then for the regulation? So one is that this now we have a, a very rich and robust and high quality data set for all of the regulators, whether that be the sec or f ira or any of the other exchanges, it allows them to conduct more thorough examinations and improve their uh market regulation. So things when they look at like best execution or front running are different examples of the types of patterns that they use this data for. They also leverage it in terms of market manipulation, like insider trading and fraud, trying to identify patterns of behavior across this. And ultimately, then this all becomes feedback. Now that we have years worth of data, they can use this in terms of both economic and regulatory policy to see what changes that they may want to propose to be able to make the markets either fair or more transparent.

A couple of examples of some recent ones that were criminal cases that the sec brought. There was a $47 million front running scheme um by one broker dealer, then they leveraged the cat to identify this trading behavior. And front running is is when uh basically a firm jumps ahead of a customer order to then buy or sell in front of that customer. And it was an ongoing scheme and they recognize that pattern.

There was also a uh basically a market manipulation scheme where uh eight individuals were using social media to pump various different stocks and resulted in over $100 million in illicit gates. And again, they used the cat data to help manage and make that case that then they brought against them.

But we're jumping back to the tech, you know, we're not done yet. We are just surpassed. As i said, uh we are planning for what is coming next. We just hit 665 billion events. We, you know exceeded. We've had 16 of our top 20 days in the last three weeks because of market activity, we're quickly planning for a trillion event day knowing that we would need to be able to process that or even beyond.

So when we look at that, we've been taking a very hard look at some of the other features um namely eks in terms of like kubernetes or emr on eks to allow us to more easily and linearly scale the cluster. When we kind of have the emr clusters right now, you kind of do it in chunks and you add nodes within a certain size and then you add that the idea with, with, you know, with the eks and with, i believe it's carpenter, it allows us where we can try to lay in some more rules and make that a little bit more predictive. We're also looking at graviton threes and now we'll with the new announcement um around the graviton fours that were made, we're gonna begin starting to take a look at some of those and see what that does to our overall price performance.

As steven mentioned, that's constantly looking at like at our data processing platforms, we predominantly have been a smart data platform keeping up to date moving that taking advantage of new operations. There was a big improvement that was made, i think it was like last year as a part of this and dealing with like io and data shuffle operations. So it was always taking advantage of that and working through the open source community. But also keeping our eyes available for if there are new data processing platforms that might be able to take advantage.

And then lastly given the huge catalog, literally, we have tens of millions of files that are laid out in tables and partitions across this data. So managing this and the entropy across this while we have all of this data sitting in s3, it is a massive massive data footprint keeping track of all of that information. Um because there are multiple copies of it, right? When we look at the optimizations that steven was talking about making an optimization to make that faster for compute may actually have a detrimental effect on the on the query. On the usage side, it may make it more efficient to process that through. But then trying to query that data doesn't fit those necessary those query patterns. So constantly looking at that and looking at what our data management is of how we are storing that data and trying to manage that data across the various different s3 store. And really the s3 intelligent tearing is something that we're taking a hard look at. For some additional, there's a few links here for some additional resources that you might find useful um to be able to do that.

And with that, on behalf of leah and stephen, i'd like to. Thank you.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值