Fidelity Investments: Building a scalable security monitoring tool

最新推荐文章于 2024-05-15 22:55:32 发布

李白的朋友王维

最新推荐文章于 2024-05-15 22:55:32 发布

阅读量28

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/135120160

版权

Peter Ciaccio: Good afternoon, everyone and welcome to our lightning talk today. My name is Peter Ciaccio and I am a Customer Solutions Manager here at AWS and I am thrilled to introduce my partner Eric Moore, a Cloud Specialist who's gonna run you through our scalable solution. Eric over to you.

Eric Moore: Thank you, Peter. Hope everybody can hear me. Ok. So as Peter said, my name is Eric Moore. I'm here representing Fidelity Investments. I've been with Fidelity now for a little uh over six years uh helping out with their cloud security.

So as the title says, we're talking about building a scalable cloud security monitoring tool, but the title is a little bit deceitful because it's not just a monitoring tool as you'll come to find, we do a little bit more than just monitoring. We're detecting, we, we're responding, we're even doing a little bit of preventing. So we've got a jam packed agenda here.

We'll go on with the next slide. So we're gonna try to cram through all of this stuff here. Hopefully, we're not going to run out of time. So let's go ahead and jump right in.

So what led us here? So we've been heading towards the cloud. Uh, we've been in the cloud now for 5 plus years. If we're just talking about AWS resources, you know, we've got 1500 plus AWS accounts. So if we were going to talk about all the clouds that we have, we're looking at, you know, around 3000 now, just in our 1500 plus AWS accounts, we have 8 million plus resources spanning across every service you can imagine.

So how do you go about actually securing all that? That, that's a lot of CloudTrail. That's a lot of events constantly coming and going. And as you can see here on the next bullet point, we have the SDLC, the software development life cycle separated out in different environments. We have our sandbox development QA and production. So as you can imagine, as these different environments are here, we have different rules and regulations that we have in each environment. So we have different criteria that we're going to be holding people accountable to as, as they move from environment to environment.

So we started out small, we started out using our Service Control Policies which do a lot of preventing, right. So we use those up pretty quickly and at the scale that we're at now, we're still using them, but we can't quite use them for all the scenarios that we have. Now the tool that we're going to be talking about today. We've got probably around 2000 or so different security scenarios that we look at across all of our different environments, detecting and responding and even preventing some of those. So we're going to go over some of those today.

So this is a term, I'm sure you guys have all had crammed down your throats for the past couple of years. Every sales call you've ever been on a single pane of glass. There's a lot behind that word. What does it actually mean? And so when we're talking about a single pane of glass here at Fidelity, it's really a one stop shop tool for all of your security needs. I can see all of my resources within the tool. I can see them in real time as they're coming and going. I can see a lot of metadata about them. I can see the compliance status. I can see that if we've taken any actions on them in the past, that kind of thing. So that's really the context that we're talking about when we're talking about a single pane of glass here.

So right here is the UI from the tool that we're talking about today. And as you can see, we have a lot of different categories for resource types up at the top. And each one of those has columns broken down by the different resource types. Now, here at AWS, you know, you have your EC2s S3 but whenever you see a board like this, it's cloud agnostic. We have all of our CSPs, all of our cloud service providers funneling into this tool. So every name that we have for a resource has to kind of make sense for all of them.

So here we're on a tab called compute. And up at the top, you can see the total values that we have. This is all scope to just AWS resources again. And we're focused on our instance resource here. That would be our EC2s. And we're close to 30,000 when this screenshot was taken, it's quite a few resources out there. And so when we hover over that, you can see down here at the bottom picture, this is the metadata that we are collecting about our specific resources.

So if we were to click on any one of these resource types under any one of these categories, we're going to see all of these columns down here with all of the information that we have harvested about those particular resource types. Now, the information we're harvesting is going to be anything that comes from a list, call, a described call, a git call, anything that's involved with something like boto3 through your API.

So we have all of these different resource types. It's all funneled into this single pane of glass. Now, what are the underlying services that actually got us to a view like this? There's quite a few.

So starting from, I think the most important one that we're all familiar with is IAM - Identity and Access Management, right? That is how you pretty much do anything in the cloud, you take your roles, you apply your policies to them, that lets you decide what resources you're going to act on and what you're going to be able to do to those resources. So that is the critical first step in having a scalable cloud security tool.

So we have, IAM next up is uh we've specifically talking about Kubernetes here, but really the fundamentals behind Kubernetes is what we really want to get to here. We're talking about containers and clusters and the whole orchestration behind that. So if you're going to have a scalable security tool that is able to handle, you know, 8 million plus resources across just this one CSP that we have, you have to have something that you can lift and shift, move all around in containers and the clusters that EKS and services like ECS provide are really the orchestration that you need.

So you can kind of think of EKS as a big technological octopus. It's got all these different tentacles that are out there. It's got a big old brain that's telling it what to do. And all of these tentacles are going out there. It's reaching out into your clouds and it's pulling back information and it's responding to events. So you have all of these tentacles, which you can equate the tasks that are going out there and doing their thing. And of course, all of that is running on auto scalers behind the scenes, a very robust orchestration system with a service like Kubernetes.

So you can be rest assured that you're always going to have that scalability and that reliability for your most critical security application. Now, aside from that, you have to have a source of truth for all of the information and what is the source of truth in AWS? Well, that's CloudTrail.

So if you're familiar with that, that's where all of your API calls are being logged, anything and everything that happens in your cloud is going into CloudTrail. So what do you do with all that you can imagine at a scale like what we're working with at Fidelity, we have quite a few CloudTrails that are, that are producing events for us.

So we have a global CloudTrail system and what we've ended up doing is we're able to funnel all of these CloudTrail events in real time into our tool and we're able to detect and respond to everything in real time in the cloud.

Now, where do we put all that CloudTrail information before it actually lands in our tool before our EKS tentacles are pulling it in. Well, that's where things like CloudWatch the SQS down at the bottom and S3 come in, you have to have some type of storage system out there to hold all of your information and have it be readily accessible.

Now, all of this information again is at the end of the day, it's it's going to a queue, you have it in S3, but that's not the end all be all spot for your solution. So we have what we have in RDS database here uh running on MySQL. So all of the CloudTrail events, all of the information that's being harvested from your particular clouds are all being put into this database which allows us to constantly query it.

Um so we're able to see all the information going in. We can build query filters on top of it, which we'll talk about a little bit later. And that is really the foundation for the tool allowing you to detect and respond to these events.

Now, of course, this is a security tool. So everything needs to be encrypted and that's where KMS comes in, that's going to encrypt everything at rest for you. And that's going to be sitting on your RDS. It's going to be sitting on your CloudTrails, your CloudWatch logs, your buckets, anything that's going to be storing your data at rest. And you also want to make sure that everything in transit is also being encrypted. And when we're using services like S3, that is inherently TLS by default. So you don't have to worry about enforcing transit there.

So, moving on to the next slide, let's talk a little bit more about IAM and the centralized permission management that we have for this particular tool, we have a concept called GRAP which stands for Global Roles and Policies and this is a repository based role management system.

Now, there's a lot of different repositories out there that you can use. I'm sure you all are familiar with GitHub St BitBucket, that kind of thing. So it's really, that's the concept that we're trying to push here. It needs to be repository based for your roles and policies. It needs to be tempts, you need to have it in repository so that you have access control around everybody who is looking at it and all of the source control around that information.

And so once you have it in these particular repositories, you can use deployment pipelines. In our particular case, we're using CloudFormation template stacks in our pipeline to deploy the rolls out there. So any time we have a new account that gets created within our ecosystem, it goes through the standard creation pipeline which involves our GRAP to pull out the rolls goes through the CloudFormation templates and actually deploys those roles into the accounts.

So any new account that we create in our environment is going to get the standard set of roles with our security tool role being one of those, there's a lot of different types of pipelines that you can use. There's Concourse, there's Jenkins, but the core of the matter is you should have something that is standardized through a pipeline. You shouldn't have your engineers out there having to click through and do everything by hand that just introduces more human error potentiality.

So again, you have to have a source of truth for everything. So we talked about, you know, our sandbox environment dev QA production, all that good stuff. So we're using AWS Organizations as the source of truth. So as all of these new accounts are getting created and brought into our security tool, we're able to label them appropriately, right? Because we want to act on sandbox resources a little bit different, maybe a little harsher than how we're acting on production resources.

So we have to make sure that all of the accounts as they get inducted into our tool are labeled properly. The last thing you would want is a production account being labeled a sandbox in your tool when you might have something, you know that's going out there and deleting resources.

So moving on to the detect and respond elements of the tool, we talked about RDS and how all of the information is kind of landing in there as the sweet spot. So once it's in RDS, we can apply query filters. And in our case, a lot of our stuff is built on Python. So we're using SQLAlchemy but really the core of the matter is it has to be uh the information has to be stored in an area where you can add query filters to it.

It has to be readily accessible. It has to be um the integrity has to be there. So RDS really made a lot of sense for our particular purpose here. So once it's in RDS and we're applying these SQL Alchemy query filters on there, we're able to save the views within the tool. So we don't have to keep going back and recreating things on the fly. And once we have these saved views, we can organize them in a way. And I can say, hey, show me all of the controls that I have for Sandbox, show me all of the controls that I have for my QA accounts and it makes it organizing things a whole lot easier.

You can set the severity levels of these particular views. In our particular case, we have critical scenarios that we send to our Security Operations Center, our SOC team. So if you're out there creating public buckets, well, our SOC team, our SWAT team in the cloud is going to come for you. So, and that all comes from the severity levels that we have set.

And another good thing is of course, seeing the trends over time and with a tool like what we have now where we have the ability to create the filters ourselves. It gives you this customizability aspect of it where you can get stuff that's native out of the box from the particular vendor. But you can also add things in what we call plugins. So we're out here creating all of these different plugins for the tool. It can be detective query filters. It can be responsive actions, it can be custom harvesters that are going out there if they're not getting something to you natively that you're expecting. And by doing that, it allows you to expand your reach and really do anything you need in the cloud.

With this particular tool, we have yet to come across a scenario that we are not able to detect and respond to or even prevent. So the template scanning, I could do a whole presentation on just template scanning. This is this is the preventative stuff that we have going on the lending tools that we have. So there's a module that we have within this single pane of glass, all of our development teams out there, we try to get them to tempt everything, right. So that by tempts things you can of course run it through these linters. And before anything even gets deployed in your cloud account, you're able to pretty much detect if there's going to be any issues.

So now that we have all of our engineers tempts everything scanning it through, we have very little actual non compliance because we're catching it before it even happens and we have the ability with all of our custom filters to add and subtract to our template scanning. At any given point. We can create warnings for folks as they're scanning, we can go as far as blocking their deployment pipelines. If they have it set up to something like Jenkins and Terraform, you know, they, they send their Terraform to us and we run it through our linting. We say, oh no, you've got a public IP you're using with your two. That's a big, no, no. And we stop their pipeline from actually deploying the resources. So it's pretty nifty.

So we can do all that on the detective side. And then we have just as a robust response side as well. And we have these things called reaction triggers which are kicking off the automation that we have. So there's a couple different triggers depending on what's actually happening in your cloud account. You can have resources being created, you can have resources being modified, you can have resources being destroyed. All of those events are going to translate into CloudTrail. It's going to flow into your RDS. It's ultimately going to flow into your RDS database and your tasks are going to see that change happen and they're going to react to it.

So you could have automation set up that says, hey, every time I see a EC2 get destroyed, I want to go in and uh you know, check a couple of different records, do this and that and maybe send out some emails to folks that kind of thing. So the options are really limitless when it comes to the responses that you can do. You can see some of the types of actions that we're doing. Now at Fidelity, we're deleting resources, we're modifying resources. We're opening notifications to folks whether that's an email or an incident ticket, a service request, whatever we're creating resources. But it's really the custom aspect of the tool is what we're really trying to push here.

When you're looking for a tool, whether you're building one yourself or if you're out there in the wild trying to buy one, you want to make sure that you can customize it. Because if everything that you're getting is just native and out of the box, you're going to be at the mercy of the vendor essentially. So if you've got something critical that you need covered and they're not quite doing it for you, it's really, really nice to be able to have that custom aspect where you can go to your particular engineers say we've got a very particular use case and they can have something spun up for you and sometimes a day, sometimes even less.

So we'll move on to a specific example, event that i have here. I spun this stuff up in October of this year. I'm pretty sure we're all familiar with a public EC2. So what we have at the top here is the CloudTrail uinawf. So here at the bottom, you can see a run instance call. That was me creating the resource with a public ip. And one of our accounts shortly after that, within a five minute window, you see a terminate instance call happen. We've got some redacted data in here. That's the blacked out stuff and that's just our user name showing the role of the particular tool that we have that's actually doing the termination.

So that's what it looks like in the CloudTrail. But in our particular tool that we have down here, that's our UI. Uh this is pretty much what we can refer to as kind of event driven harvesting. So as the information is happening, as I'm creating the resources run instance is happening in CloudTrail, that exact call is now being translated to the tool. You can see we have a run instance call down there within the tool. And because we have all of our information flowing in here, you know, within that five minute window, you can even see we had another event happen within the account. That's just business as usual.

So if you were to look at this, this menu here, at any given point, you would just see basically everything you would see in CloudTrail for all of our accounts across all of our environments that we have flowing all into this single pane of glass. Now, as you can see the top call, there is our terminate instance, matching the call that we had in our original CloudTrail of top showing that within that five minute window, we were able to detect, we had a public instance and we were able to respond to it and terminate it.

Now, we don't just want to detect and respond to events. One of the things that we have to do in security is try to educate people, we don't want them making the same mistakes over and over and over again, right? It's a vicious cycle. So in this particular case, this is a screen showing our, our automation of the particular tool. So on the right hand side, you can see we have query filters, we have actions and part of the remediation that we were doing. In this particular case. For the example was we have a custom action that i wrote, personally called auto remediation email notification. This has custom routing that's specific to Fidelity and all the tools that we have. But the general gist of it is whenever we're responding to these events, we try to send a notification message to the person who actually did it that way, we're not going to product managers, we're not going to the directors or anybody above, we're going straight to the source, straight to the engineer who's actually creating the event, trying to educate them at the source.

So here, this particular instance that we have is instance with public ip we do, we have a little extra verbiage there just to keep ourselves organized in our environment. But as you can see here, we have our query filters. We have three that are in play right here. The first one is, you know, instance with public ip attached that one's pretty straightforward. We have another one about resource life cycle state. And in our particular case here, we're making sure that our instance is in a running state before we're acting on it. If the resource is in a leading state, there's no point in trying to delete it again. Right? And then of course, we have extra filters in here called resource not and cloud account id. Now that's a whole can of worms in and of itself because that's an exclusion. And you can imagine when you're dealing with 8 million resources, 1500 accounts, 2000 plus security scenarios, you're going to have some, some outliers, right? You're going to have some people that need to have a public EC2, you need to have a public bucket. So you have to have a whole exemption process. And if i get, if i get invited back next year, i can do a whole, you know, a whole talk on exemptions just in and of themselves. So i say i only have 30 minutes here. Let me keep flying through this stuff.

This is an example of vent of what we have. So this is the email that I would receive that I did receive after I created my public two. So here we have, you know, the cloud account, we have some redacted information here, but we have a general gist of what we did. So we talk about the actual auto remediated resource. What was the name of it? We have specific ids in here, the cloud account, we talk about arms. We give them an event reference for all of our internal information that we have. And then we give them an actual description of the security event and what we did. So that just goes along with the actual um the actual education that we're trying to do.

And so the last slide here that we have is the after effects of this. So we talked about seeing trends over time, we talked about building reports and identifying problem areas. But what this tool is really allowing us to do is provide efficient remediation of scale. And so we're increasing our security awareness, we're educating all of our developers in real time. Um and it's just, it's just a win, win for everybody involved.

So um with that being said, we just have some team recognition here, you know, a famous quote, I'm sure you're all familiar with Rome wasn't built in a day, but this is what we say around the office. You know, Rome wasn't built on a 9 to 5 we have guys all over the world. Men and women completely dedicated. They eat sleep and breathe security. That's all we do. You know, we've got multiple locations here in the US. We're over in Ireland, India, China. So we're always following the sun. Somebody's always awake watching this stuff, making sure that all of our data is nice and secure.

So with that being said again, I'm Eric Moore of Daily Investments. Appreciate you all spending your time.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Fidelity Investments: Building a scalable security monitoring tool

Peter Ciaccio: Good afternoon, everyone and welcome to our lightning talk today. My name is Peter Ciaccio and I am a Customer Solutions Manager here at AWS and I am thrilled to introduce my partner Eric Moore, a Cloud Specialist who's gonna run you through
复制链接

扫一扫

Fidelity Investments: Building a scalable security monitoring tool

“相关推荐”对你有帮助么？