Enhance your app’s security & availability with Elastic Load Balancers

Hello, everyone. Welcome to Enhance Your App Security and Availability with Elastic Load Balancing. I'm John Zobrist. I lead customer success for ELB. With me is Satya Rama Se who leads product.

Today, we're going to cover two topics - availability and security. We're going to follow this format:

  • We'll talk about how we do it internally at AWS
  • Then what we've built into the products - how you can leverage them for similar things

Throughout the presentation, we're gonna use these icons:

  • 💡 - A section where we're sharing internal details on how we think about things
  • 🆕 - A new feature we've launched in the last year
  • 💭 - Design considerations you can apply in your own architectures
  • 🔗 - Links to documentation (we'll pause so you can scan the QR codes)

By the end, you should have a good idea of how we think about availability and security as a service, and how we've built that into the products for you to leverage. You'll also see new features we've launched over the last year.

Let's jump right in.

The QR code here is a link to our security best practices site specific to Elastic Load Balancing. We also have an availability best practices guide coming soon.

We're going to cover three areas for availability: scaling, health, and operations.

Elastic Load Balancers scale to support virtually any size workload. We'll talk about how we do that, and how our systems work. One thing we do is provision redundancy - we'll explain how you can do the same.

Application Load Balancer scales first up, and then out. When traffic increases, it will replace smaller node sizes with larger ones. Once we reach the largest node size, we start scaling out - adding more and more nodes in each AZ.

The key things to know:

  • We scale up aggressively - triggering scaling in seconds, delivering capacity in minutes
  • We scale down cautiously - avoiding scaling down during periodic drops in traffic

We also scale on multiple dimensions - adjusting for almost every aspect of traffic.

NLB and Gateway Load Balancer use a different scaling system, since they leverage AWS Hyperplane. Hyperplane lets us distribute traffic transparently using one IP address to many different hosts.

The main benefits are:

  • Static IP address - doesn't change for life of LB
  • Persistent flows - failures don't impact existing flows
  • Transparent scaling - we can swap out hosts without disrupting flows

This happens regularly as traffic scales up and down.

The difference from ALB is NLB scales independently per AZ. Hyperplane is very zonal - it stays within a zone. So NLB will scale transparently in each zone based on local traffic. Still aggressive up, cautious down.

When deciding how much to overprovision, we aim to have at least 1 AZ worth of extra capacity. This helps handle surges and failures. If an issue occurs, the other zones already have capacity to take the traffic.

You can do the same with your workloads:

  • Lower thresholds to scale up sooner
  • Longer cooldown periods to avoid scaling down too quickly

For example:

  • CPU threshold 35%
  • Target 50 requests per instance
  • Scale down at 15% CPU

This will likely provision ~33% more capacity than you need, but helps handle surges and zone failures.

Now let's talk availability features we've built in:

  • Health checks
  • Target group fail open
  • DNS failover
  • Cross-zone load balancing
  • Anomaly detection

Just like you health check your targets, we health check all LB nodes constantly. When we detect a failure, we replace the node, restoring traffic quickly.

For your target groups, you configure:

  • Frequency of checks
  • Number of checks for healthy and unhealthy
  • Interval between checks

Things to consider for health checks:

  • Only include dependencies that would make the request invalid if they failed
  • HTTP checks are better than TCP - validate app not just network

A feature we launched last year is Target Group Fail Open Threshold. This is the % of unhealthy targets that triggers routing to all targets, even unhealthy ones.

This helps in overload scenarios - failures don't concentrate load further.

For example, with 50% threshold and 10 targets:

  • 9 healthy - only route to healthy
  • 6 unhealthy - now at 40% healthy - fail open to all

This can help recovery without bad customer experience, since some targets may still respond.

Useful for overload scenarios versus cascading failures.

So consider changing the default from 100%. So when you're looking at your target group and it's failing open, the next thing in front of that to help route around impairments is the DNS. So ELBs have Route 53 health checks. Every ELB internal and external has a Route 53 health check that you get for free with the service. And what that health check does is removes any IPs from DNS if they become unhealthy or fail their health check.

If you have a target group where you've configured this threshold to be lower for your target group fail open, it may not trigger DNS failover for just the only the zone that it's in if there's a zonal impact. And what that means is if you have, say target group in one zone that leads you over the threshold for your DNS fail open, all of the zones will start failing open. And when all of the zones fail open, they're failing their Route 53 health check. And you really need another endpoint or place to direct that traffic.

So if you've got an ALB and you've got another ALB, maybe another region, something that you've already configured in Route 53 to fail over to then having cross zone on with this lower threshold can help. However, if you don't have anywhere to fail over to, you may want to consider using cross zone off. The primary benefit of this is if you have a problem in one zone, the other zones can continue to receive traffic and only that zone can wait out.

So this is useful to fail away from an availability zone. And the default for this threshold, it's a separate threshold from the fail open is 100%. We also recommend looking at changing this to something lower just to get out of a zone. And again, this works best with cross zone off unless you have another place to fail over to like another load balancer in another region or another service.

So let's talk really quickly about Application Load Balancer, cross zone off. So this was a feature that we launched last year and it's just what it says, you can turn off cross zone load balancing. What that means is when the traffic comes into a zone, it won't leave that zone from the load balancer. So the load balancer nodes in that zone will only send traffic to the same zone.

The advantage of this is you get zonal isolation and you have an easy way to fail away from the front door. If there's a zonal failure, the key thing to keep in mind when you're looking at this is do I have enough capacity and are all of my resources that my targets need all of my dependencies available in every zone? Ideally they are and you can have the same number of targets in all zones and scale the same way. And then you can use cross zone off.

If you don't have that with cross zone off, you may get an imbalance of traffic where some targets are getting more traffic than they can handle because there's fewer of them. So this integrates really well with Route 53 Application Recovery Controllers feature for zonal shift. Zonal shift lets you tell Route 53 remove this zone from this load balancer's DNS.

Now we're doing this all the time because those Route 53 health checks if they start failing are gonna fail away from the same IPs. But with zonal shift, you can configure this, you could do it for testing, you could do it because you had other alarms or something that was triggering you to get out of the zone. But it lets you have an easy way using Route 53's highly available control plane to send a signal and say get us out of this zone completely for this load balancer.

So while we're talking about zones, we want to talk about a feature that we just launched this year. It's Network Load Balancer, zonal affinity and it's just like it sounds your zone that the client is in will be the zone that the Route 53 DNS record returns for that request. So when your clients are in EC2 and they're zonal, they can connect to your NLB and either get 100% zonal affinity or you can pick 85%.

Now, the default is 100%. All of the DNS records are 0% zonal affinity. And if you have a use case where you want to save on or or have lower latency and save on costs for cross zone traffic, this can help keep things in the zone. It still works with all of our DNS health checks, all of the Route 53 zonal shift, you can still fail away from that and it will shift traffic to the other zones.

The 85% of the time configuration is the most useful because it keeps 15% across zones. So you won't be surprised if it does fail over because something happens and you fail away from that zone, you'll already have 15% of your traffic going across the zone. Uh, 100% is a good setting too if you really want to keep things isolated to that zone. Uh, but definitely check that out.

So the next thing we want to talk about is anomalies. So health checks are great for detecting hard failures. We know something broke. We know our dependencies aren't working. We're going to fail our health check or we're unreachable, something crashed. But a lot of failures are gray failures where there's something underperforming or not doing as well as it should be.

And internally at AWS we've got a system that looks at all of the nodes in a load balancer and compares them against each other. And if one of them is an outlier, it, it proactively replaces that one with a new healthy node. Now, when we're doing these replacements and we do the same when we do scaling, we use a graceful replacement strategy, which means we launch the new node, we put it in DNS, we pull the old node out of DNS and we let it drain and we will let those drain until they have zero connections for five minutes. So we'll let them drain for a potentially long time.

Now, this system has saved us from so many gray failures and helped all of our customers have a higher availability posture. We thought it would be awesome if we could do this for their targets. So this year, we launched Automatic Target Weights for ALB. And what this does is monitor the target responses and targets who send errors back or who are unable to be connected to it shifts the traffic away from them.

So let's say your back end starts returning 500 errors and the other back ends are not, we will shift rapidly, the traffic away from that and we use a multiplicative increase or decrease. So we, we, we scale that way down. Now, this is not a health check, it is not going to cause the health check to fail. And we don't ever go to zero because we still need to send some requests to that target to make sure that when it recovers, we can detect it and scale back, send it more traffic.

So if this target starts recovering, we do a slow additive increase. So we're not going to ramp it straight back up to 100% of the distribution, but we'll slowly add more and more as long as it's continuing to pass valid traffic and valid responses back.

Now, the cool thing about this is it's enabled for all routing algorithms on all ALB target groups, which means if you have an ALB today and you have a target group, you have two metrics, one that will tell you if it detected anomalous host. So you can go look at your zonal metrics for your target group. And if you see this anomalous host count metric above zero, you can go investigate and say, why did the ALB think this host was having a problem? And it might lead you to saying yes, we found the problem, we can replace that one host or root cause the problem that happened.

If you're seeing that, then you may want to turn the feature on so that you're routing based on the weight of the target. And this would have shifted the traffic away from those targets. We'll do this for up to 50% of the targets. And the key thing to know is that we include unhealthy targets in that count. So if you have half of your targets, unhealthy, we won't be doing any anomaly detection. We won't be shifting traffic away from the other 50 because you've already scaled down how much capacity you actually have. And we don't want to reduce that further when you do have this enabled.

And we say we didn't send requests to a target. We'll emit a metric for the number of targets that we didn't send requests to. And that's that mitigated host count in testing and within internal users, it's been an awesome success for places where health checks aren't enough to detect and replace an impaired target.

So here's a QR code to the What's New post that goes into this in further details. I will pause a second for that.

Okay. So we also do this on Network Load Balancer and it's slightly different. It's because we're using Hyperplane, we have a little more visibility into the host as the traffic goes through them. Because the health checks for your NLB and Gateway Load Balancer are actually going through the ENI and that the same traffic goes through our health check system tracks that traffic as it goes through all of the hops internally with all of those internal Hyperplane systems.

And if some of those aren't passing their health checks internally, this will actually shift traffic out of that zone using that same Route 53 health check which will fail. This will remove the, the AZ from DNS and help your clients route to the healthy AZ. If all AZs in your NLB or any are unhealthy, the DNS record will fail open. And that's really only useful in cases where you want it also to be considered as failed and you have another zone, another load balancer or another resource configured in Route 53 to take that traffic.

"We prefer that strongly. Uh we only alert when we need a human to come and actually do something. And a key thing before we're going into production with a system is we write run books and we review these run books in our weekly operational meetings.

If something happened that we didn't cover in those and the run books need to be easy to grok or understand at 3 a.m. because you imagine you're a developer, you're in the middle of the night, get alerted for the system. You've got to have something that you can actually understand and reason about.

So with operational visibility, we break this into three areas. We've got canaries which are going to send artificial traffic or artificially use the system and show us that some things are working. We've got metrics that are going to show us how well our customer experience is as well as how big our volume is and how much we're doing. And then we've got alerting where we're going to page or alert operators to come and help fix things.

So our first canary we have is our data plane canaries for elb, we have data plane canaries in all regions, in all zones and every data plane canary in the region sends traffic to every zone for every load balancer type, for every type of traffic i pv six i pv four tcp, udp, hcphps, everything and this system is non mutating. So we don't change it, everything that changes on. It has to be transparent to the test and the tests are just always running. If they detect an error or a problem, it pages us and we can jump on and see if there's something that we need to do to fail away from that zone or check to make sure our automatic mitigations are working.

The other canary is our control plane. canaries. These are full life cycle tests that we run in all regions in all zones for every type of load balancer that we have and every type of traffic, we launch a new load balancer. We configure it, we make sure it works for sending traffic. We change things, we make sure that works and the change went through correctly. We delete them and make sure the delete and everything cleaned up. We run that every minute for every load balancer in every zone. If anything goes wrong, this is one of our first alerts to say, hey, customers may not be able to say create load balancers.

Now, when we look at metrics, we break them into two categories. We've got our positive metrics and these show you how well your workload is working. Things like request, count, volume of traffic, maybe success codes from your logs are all useful things to look at and say we see the system is actually doing things you may want to alarm on this. If it goes to zero or goes out of an anomaly band using cloud watch anomaly checks, then we have negative metrics, negative metrics are usually an indication that something's breaking or doing something unexpected. For eob this is frequently five xxtls negotiation errors, back end connection errors, things like that other systems, you could have database errors, you could have something that's actually trying to do something. And then when it fails, your system emits this metric when we're looking at our metrics and this goes back to that 3 a.m. one of the first things we look at is zonal.

So we build everything so that the zones are as isolated as possible from each other. And we expect that one zone could have an impairment that wouldn't affect any other zones. So what i tell customers all the time is go look at your eob metrics in the zonal dimension. If you, if you get alerted for a five xx or another problem and see if that is only in one zone or if you're seeing it in all zones, if you're seeing it in all zones, because we have strong zonal isolation, it is less likely that that is the load balancer which i know sounds counterintuitive, but generally this is what we find.

So if you see it only in one zone, that's where you may want to open a high severity case and say, hey, we're seeing these errors. It's impacting our service. Please help and support could go look at that quickly see if it's one node and replace the node gracefully.

So when you're alerting on negative metrics, for example, elb five xx, there's a few things you want to keep in mind, you want to have key metrics and have alerts on pretty much every aspect of your key metrics. But you don't want to just alert on one data point for one metric. So i generally say prefer short data points like one minute and many of them and leverage things like m of n. So you have something like five of 78 of 10, something. So you could detect a problem. It doesn't have to be sequential.

Then we also recommend and we do this internally, we keep two or more levels of alarming. So we have the level of alarming that says, ok, something might be wrong, but it's probably not worth waking anyone up. Maybe the automatic system will fix it and we have a higher level where it's ok. This is getting a lot more severe. We need people to engage immediately and that's the ones we set to page us.

So in this example, we've got 10% is our lower threshold of five xx and we're doing seven out of 10 minutes. So it's gonna catch those longer, less frequent errors that maybe it's something somebody could look into when it's daytime in their time zone. And then we've got a 25% where we've got a shorter window, but it's still more than one data point. And this is something that we want to page people and say, hey, our alarm rate or our error rate has really spiked. We need to look at this right now because it could be impacting customers.

Now, when we think about how to page or what to do, when you're alerting operators, again, we prefer automatic mitigation. But the key thing is the alert has to be actionable. So if you have an alert on a metric and you get page a great place to say was this actionable is look at the run book. And if you don't have a run book, is it really actionable? And if it is probably want to create a run book, but you probably also don't wanna page people if there's nothing they can do about it, we do pre plan everything with run book.

So when we're going a new system into production or a big change, we'll go through and have a operational readiness review. And part of that is to say, what are your alarms? What are your key metrics and review and make sure those are set correctly as well as what are your run books for when those break?

The other thing we do is automatic escalation. So if an engineer gets paged and they're not making progress, they're trained to escalate. But we also have our systems configured to just automatically escalate if the ticket doesn't get resolved or put into a status that shows that it's actually progressing towards resolution another key component is our dashboards.

So we do regular operational reviews and we do these at the team level weekly. Every team has an operational review where they go and look at their dashboards. Dashboards are grouped by topic. So we'll go look and say this is this part of the system. Here are all the re metrics along with lines saying this is where we've defined our problem level one, this is where we define our next level problem. And if we see something on those that we didn't get alerted for, we go dig in and say, why didn't we get alerted? Maybe we missed an alarm, maybe the threshold was wrong and also check if there was customer impact.

You can use cloudwatch dashboards and this is a screenshot of a cloudwatch dashboard we created for an application load balancer. You can see it shows us some positive workload metrics where we've got our process bytes, our aggregate numbers of requests. We've also got the timelines for those and then we have things like response codes to show our two xxr three xxr four xxr five xx. This is the kind of thing we would go look at and say we see that we had a five xx spike for 20 minutes and that we didn't get alarmed. So maybe we go look at that alarm see why we didn't get alarmed. But cloudwatch dashboards are amazing. Recommend ever use them. We really like them internally and use them all the time.

So let's talk about deployment when you have enough customers. I'm sure you all know there's never a good time for downtime. You can't plan on your deployment causing downtime. And so we don't do that when we make changes, we try to have graceful processes and we bias heavily towards not having anything that would say this is an expected period of down time because there's never really a good time for anybody to be down, especially on purpose.

The other thing we do is we do blast radius where we're going to start with a very small cluster, a single box, maybe, maybe a small cluster of boxes and then we'll slowly expand and we'll have a tight feedback loop of are our metrics looking good and looking good for our health metrics for our positive data. We want to see that the volumes didn't change, nothing dropped off for our negative metrics. We want to see that we didn't have a big spike if all that happens. Well, we expand our, our deployment, we don't deploy to the same region in the same in multiple availability zones in the same day.

And we try to use deployment stripes, meaning we're not going to deploy to us east one and us east two in the same day. Now as we get faster and faster with our deployment, we gain more and more confidence and we do ramp that up because we do have a lot of regions and a lot of, a lot of hosts to deploy to part of that, we have strict change management.

So i'm sure a lot of, you know, change management systems, you've got to plan for what you're going to change. And then follow the, the run book, essentially you create for the change. Most of our changes, we bias for automa automation and we deploy with things like code pipelines. This is all c i cd e continuous integration and a full automation so that humans don't have to watch it at every step or trigger the next level of deployment.

The other thing that we changed a long time ago was we said, instead of deploying off hours, we want to deploy when the team that owns the service is awake alert and likely to be at their desk so we can respond quickly and escalate to engage more engineers if the engineers who are on call don't know how to progress. That means we generally don't deploy outside of monday through thursday 9 a.m. to 5 p.m. local time for the team.

So let's recap, we don't deploy with expected impact when we do make a mistake deploying. And we have a conversation with a customer who may have been impacted. One of the questions they will often ask is, you know, what did let us know next time we deploy? And what we say is we, we really don't do this. We need to make sure that our deployments don't cause impact. Whenever we have a deployment that causes impact, we'll do a correction of errors document and it's a follow up document to an event where you go look at what happened and why.

The other thing we talked about was automatic mitigations, things like health checks, auto scaling groups are really great for saying your target went unhealthy. We replaced it automatically. We didn't need to engage a human.

We talked about engaging the human when it's valid to do so when you should have how you should plan for it with run books and make sure that those are easy to understand and reason about. And then we review operations regularly. So every team has standing meetings once a week where they go over their own operational dashboards and these all go up the chain until we have the entire organization reviewing all of the key high level metrics for every service.

So that's it for my part for availability. I'm gonna hand off to satya who's going to talk about security. Thank you all. Ok. Uh thank you john. So i'm sathya ramaseshan an ile product for elastic load balancing, quick recap of the agenda"

I'm going to use a format very similar to the one John used. We're going to talk about how we do things. So these are internal details on how we do some of our implementations on security.

The intent here is to share our thought process when it comes to solving our collective challenges in security. Then I'm going to talk about some of the features that we've launched. So these are things that you can configure in order to meet your security requirements.

Uh, legendary cap is the same as what John said. The two things I want to call out are the internal details, the AWS user slide, AWS user icon rather. So those will show up on the slides that have internal details and the rocket launch for features.

So let's get started. What I first wanted to do was to talk about the key design components of an AWS load balancer. We have two, we have the control plane, to use an analogy, it's like the captain of a boat. This is where our business logic lives that is onboarded through APIs which the control plane converts to configurations that is pushed down to the workers.

Workers or the engine of the boat is our data plane and our data plane consists of several compute nodes that processes all the business logic for all of the traffic that we receive.

So to put it all together, in this analogy, the service is like the ship, so it's like Elastic Load Balancing. The control plane is like the captain. The data plan is like the engine.

I wanted to set up these terms upfront because I'll be using them throughout the rest of this talk. So I figure we all get on the same page.

So when it comes to security, we recommend that you think about defense in depth with your load balancers and what that really is is a layered approach to security where you have both network layer controls as well as application layer controls.

And when it comes to load balancing, we recommend that you think about it in five areas. The first one being IP based access controls. So think about this as coarse grained access controls or network level access controls.

The second one is encryption in transit. So think about this as preventing man in the middle attacks.

The third one is reliable authentication. So these are finer grained access controls. So application layer access controls, authentication, authorization, things of that nature.

Configuration correctness. So the best way to think about this is after building and implementing a really secure architecture, a misconfiguration could still cause exposure. So how do we get ahead of that?

And then the last one is to extend your defense by using third party solutions or solutions from other teams or custom solutions in order to achieve an even higher security posture.

So I will be going into details on each one of these bullets and I'll talk about some of the internal things that we do and then the features that you can configure.

So let's get started. Uh, IP based access controls. So our service, much like many services in the cloud, is built on distributed compute where our compute comes from various physical servers that live on a number, that are, that are that live in a number of availability zones.

So from a security perspective, what that meant for us is to have complete awareness of all of the IPs that our resources are consuming. And in order to accomplish that, we use security groups in our control plane and our data plane.

We really like security groups because they are simple, easy and intuitive to configure and we really appreciate how useful it's been for us.

We've also heard that security groups on NLB is a top customer ask and we launched security groups on NLB earlier this year in August. Now with security groups on NLB, you can filter your traffic such that your load balancer only sees traffic from trusted, known and approved IPs.

It works for both IPv4 and IPv6 traffic. We've also enabled security group referencing such that you can lock your target groups to only receiving traffic from your load balancer. So this prevents extraneous access to your applications which could potentially cause exposure.

If you're a Kubernetes user, you could configure this using the AWS Load Balancer Controller.

Moving on to defense in depth using encryption in transit. Um this has been a very key area of focus for us and we focused heavily on this because a load balancer is the front door to most AWS applications and we believe that having strong encryption at your front door carries over that benefit to everything that sits behind it.

So our journey in this space actually started a few years ago when we experienced the TLS heartbleed issue together. And what we learned from that experience was there were flaws with several algorithms due to the software implementations of TLS.

So we wanted a durable solution for this. And what we did was we built our own TLS library and open sourced it in 2015. The library is called s2n. S2N was built to be small, fast and with simplicity as priority and it does not implement some of the rarely used cipher suites and extensions from TLS which was the main reason for the TLS heartbleed issue.

We continuously validate this library to make sure that it maintains its high security posture and also having deep ownership of it allows us to quickly react to any issues that we see with s2n.

In place, we launched TLS 1.3 on ALB earlier this year in March. Now both ALB and NLB support TLS 1.3 and they both use s2n.

At a high level, as far as benefits of TLS 1.3 over 1.2 go, there are two. The first one is, it provides stronger encryption and it mainly it does this by not including some of the less secure ciphers that TLS 1.2 did. And it also has a mandate for perfect forward secrecy where your session keys are safe, even if your long term private keys were compromised for whatever reason.

The second benefit is performance. TLS 1.3 accomplishes the TLS handshake in one round trip, which reduces latency and therefore you get better performance.

When it comes to configuring TLS 1.3 on your load balancer, you have one, you have, you can choose from seven predefined security policies.

Staying on TLS, another popular request for us was to enable PFS on the load balancer. And to do this, we invested first to create our own cryptography module for PFS. Uh the module is actually called AWS KIP Crypto or AWS LC. And we launched it earlier this year in April.

It is open source, it's owned and maintained by AWS Cryptography. We built AWS KIP Crypto from a fork of BoringSSL but we added various performance enhancements to it. For example, we sped up some of the algorithms. And in our internal testing, we've actually noticed a 27% decrease in handshake latency for Amazon S3 when using this module.

We recently received our PFS 143 certification from NIST in October. And we plan on continuous validation so that you can easily deploy the new features and enhancements that the module provides with AWS LC in place.

We actually launched PFS support on both ALB and NLB last week. In order to launch this, we integrated s2n with AWS LC. So s2n handles the TLS handshake portion, that is the message exchange between the client and the load balancer. And AWS LC does the underlying cryptography.

Just like TLS 1.3, you have the flexibility of using predefined security policies and you have eight to choose from.

And then the final point I want to make here is when it comes, we also enable end-to-end TLS connectivity through PFS. So if you want to have end-to-end connectivity, you could configure TLS to the target and we'll use the AWS LC module to establish the connection between the load balancer and your targets.

Moving on to the next topic, reliable authentication. So you guys, you all saw this diagram earlier where I talked about how we implement security groups in our architecture. But within our control plane and data plane, there are several agents that share a particular node and in the spirit of giving access to only things that need it.

There are places in our architecture where we establish mutual TLS between agents. We use mutual TLS to encrypt and authenticate traffic between two agents. And it's been extremely useful for us.

We've also heard that mutual TLS on ALB is a top customer ask. And I'm thrilled to say that we launched it earlier this week in Dave Brown's innovation session.

You can now safely offload authentication of x509 certificate based identities onto the load balancer with our mutual TLS support. We support both third party CAs as well as AWS Private CAs.

So the third party CAs support makes it convenient for you to migrate your existing mutual TLS implementations to the ALB if you'd like to.

We have support for revocation so we have a durable way to block access to compromised certificates if you need to do that.

We also provide certificate metadata in the request that is proxied to the target with which you can build your authorization logic.

Lastly, we've introduced a new connection log which can be used to troubleshoot or audit connections if you need to do it. This is the QR code for our What's New post for mutual TLS.

I'm gonna pause for a few seconds in case you wanna navigate to that place on your phones.

I'm gonna go over some additional details on mutual TLS at this point. I'm gonna first start off with the setup. So in order to configure mutual TLS on your load balancer, the first thing that you need to do is to create a resource called a trust store. This is a new resource that we've introduced on the load balancer.

It has two components. The first component is a CA cert bundle which is essentially a root or intermediate root or a set of intermediates. You upload this to the trust store in PEM format.

The next thing you can do is configure a revocation list. This step is optional, but we use the well known CRL or Certification Revocation List format in order to support the revocation feature.

Next, you attach this trust store to an ALB listener and then you just turn on mutual TLS on it in verify mode. Once the setup is complete, when we receive a client certificate, the ALB will walk the chain of trust and we make sure that we resolve to a root that was configured in the trust store.

If a revocation check was enabled, we'll go ahead and perform that check at that point. If everything passes the user is authenticated. And now we establish the TLS encrypted connection between the user and the load balancer.

Once that's done, we parse the certificate and we add certificate metadata in the form of HTTP headers which is then proxied to the target. Now when you take a look at these headers, the first four are subject, issuer, serial number and validity. Most customers will likely use these to build auth logic. But if you have a need for custom fields, we've included the leaf certificate, so you can parse it and build your logic based on it.

Ok, moving on to configuration correctness.

Um, like I mentioned earlier, after building and implementing the most secure architecture configuration, misuse could still result in exposure. And so the way we think about this is, how do we limit human access where we can in order to avoid misconfig?

So what we do is we typically avoid patterns that can lower our security posture, like static credentials. And instead use IAM roles, which have temporary or ephemeral credentials that are scoped to the deployment stage. And for the dev test environment, by doing that, we keep our production credentials in the production environment and we limit human access to secure resources.

Having said that, we all operate complex services and time to time we do need human access, at least to configure your load balancer and other resources. So from that standpoint, we think about how do we prevent misconfigurations in a way that is not too burdensome on the engineers? And the answer for that is to have guardrails where certain conditions are met in order for a particular action to take place.

So that sounds a lot like an IAM policy and we use IAM policies extensively in order to prevent misconfigurations based on our own experience and feedback from customers, which is misconfigurations are a massive pain point.

Um, we decided to launch additional condition keys, five of them targeted at improving your load balancer security. We believe that this is a durable solution to solve this misconfigure problem.

So to take a step back, until now, all of the features that I discussed were to offer protection from the actual traffic that your load balancer receives. This is a control plane protection. In order to prevent misconfigurations specifically, I'm gonna go over two of the security IAM policies condition keys that we've enabled because it ties to a couple of feature launches.

I've already talked about the first one - security groups. So now you can create an IAM policy with a shortlist of security groups that have been approved, such that every new load balancer that gets provisioned or existing load balancer that gets mutated uses one of these security policies. So this policy uses these security groups. Rather, these security groups can come from your central networking team or your audit and compliance team. But once this policy is in place, you can be sure that your load balancer will only see traffic from trusted, known or approved IPs in this policy.

You can see that I've used star as the resource, but you could always make it so that you can use a particular load balancer ARN. So several teams that are operating the same account can use these policies in a way that it strictly meets their own security requirements.

The next one in the space that I want to talk about is condition keys with TLS policies. It works just like the security groups one. So with this, the high level is you can have a list of TLS security policies configured in an IAM policy such that your new load balancers and the existing load balances that are mutated have to use the policies that you have configured.

In this example, I'm actually showing some of the TLS 1.3 policies that we have. So if you use this, you're guaranteed that all the new load balances that are provisioned will always have a TLS 1.3 policy on it. And we can get creative with that with this particular feature. For example, if you have a requirement for FIPS, you can just put our TLS policies that have the name FIPS in it. And now you're guaranteed that every load balancer that spins up is FIPS compliant.

In addition, anything that you mutate will also have the FIPS policy on it. So the overall, we highly recommend that you couple these IAM policies with its corresponding security features to improve the overall security posture of your architecture.

Next, I'm gonna go over an example architecture in order to tie everything together. So let's say you have a VPC, you have your clients on the internet, you put your internet gateway on your VPC. The next thing that you do is to provision a public subnet, you always put your load balancer on the public subnet. I'm showing a LB, for example here, then you put your targets in your private subnet.

Security groups on the ALB - so ALB only sees trusted or known IPs referencing on your targets. So you lock your targets to only receive traffic from the ALB. You enable TLS on the load balancer, we're using s2n for that mutual TLS new feature. You authenticate your client with it.

In the space of authentication, we have an integration with Amazon Cognito to do user auth based on OIDC. If that's interesting, another security integration we have is with AWS WAF. In case you want to add additional layers of protections, we talked about IAM for preventing misconfigure.

So you can use Config for pulling changes to your configurations and CloudTrail for looking from an audit standpoint to look at the activities in your account.

So moving on to the final topic, extending defense in depth with third party solutions. So in my previous slide, you all saw me introduce the integration of WAF with ALB. It's a very popular integration for us and it's also a pattern that we really like where you're able to insert a different service in order to extend or improve your security posture.

The Gateway Load Balancer was built exactly for this and it is to extend your security posture using third party appliances. So the background on this is as follows - a few years ago, we heard from several of our customers that they would like to use their existing security appliance because of their familiarity with it and to maintain their existing investments. And by security appliance, I mean things like firewalls, intrusion detection systems, intrusion prevention systems, etc.

So what we wanted to do was to provide a scalable way to integrate with third party appliances such that they are easy to deploy, they can scale the way you'd like it to scale with AWS, and have the availability posture similar to an AWS native service.

With that in mind, we launched the Gateway Load Balancer in 2020 with a number of partner integrations. Even within AWS, the AWS Network Firewall actually uses the Gateway Load Balancer underneath.

So to compare and contrast, the ALB and NLB provide security for the applications behind it and it provides security for the traffic that is addressed to it. On the other hand, the Gateway Load Balancer is a complimentary service that provides security for traffic that goes all over the VPC and it uses third party security appliances in order to provide that security.

More on Gateway Load Balancer - in way of intro, sometimes I say GWLB, the GWLB is a protocol agnostic bump in the wire layer 3 gateway and a layer 4 load balancer. It's a layer 3 gateway because you configure a next hop in a route table. It's a layer 4 load balancer because it does connection level load balancing. In addition to target health checks in order for this bump in the wire implementation, we use GENEVE for encapsulation and VXLAN.

From a benefit standpoint, the Gateway Load Balancer provides horizontal scaling and fault tolerance to keep your appliance highly available. It has the ability to be across VPCs and accounts which provides some very good architectural flexibility. And then the last thing is it allows our partners to provide their appliance as a service instead of as an AMI. Because if you have to use an AMI, you have to maintain and upgrade, scale it etc - you can use the GWLB in two modes.

The first mode is the one arm mode. With this, traffic from the source goes to the Gateway Load Balancer endpoint onto the Gateway Load Balancer. So this is a pattern that is the exact same as what we have at Private Link today - the endpoint to the load balancer and then it gets load balanced to the targets. After that, the reverse path is the exact same thing in reverse order.

Something like this, we call this the one arm mode because the traffic when it goes to the target goes to a single interface here, you see it's ENI0. This is the recommended config because the Gateway Load Balancer sees traffic in both directions and can do stateful inspection. It is also the easier config.

The next mode is the two arm mode. In this, traffic from the source goes to the endpoint all the way to the load balancer, gets load balanced to the target and it exits the appliance through a different interface to go to the destination, something like this. So it enters in ENI0, leaves at ENI1. This is a little bit more of a complex config, but it does give you some flexibility. And the main flexibility it gives you is you can change the 5-tuple.

And so at the target, if you want a NAT or do something like that, you're able to. The only thing that you need to remember is when the traffic returns to the Gateway Load Balancer, your appliance has to reset the 5-tuple back to the one that the Gateway Load Balancer originally saw.

From a use case standpoint, the dominant use case is internet traffic inspection. So in this diagram, you'll see traffic coming through the IGW onto an endpoint. And then from there, it gets forwarded to the Gateway Balancer onto the appliance for inspection. When the traffic returns, it goes back to the endpoint and then on to the destination.

Another popular use case or topology is pairing the Gateway Load Balancer with TGW. Here we're able, I'm showing an example where you can accomplish both internet as well as inter-VPC inspection. For the internet traffic inspection, the traffic is coming through VPC number 4 and then it goes to the Gateway Load Balancer onto the appliance pair up top. So you can configure your appliance such that you are protecting yourself from well known internet threats.

And then the bottom one does east-west inspection. So you can have a different config that is tailored to your east-west traffic inspection. So the fact that the Gateway Load Balancer can be across VPCs and accounts gives the ability to have this central appliance VPC that can be used across multiple VPCs. That's a really nice benefit and a very common architectural pattern.

From the time we launched, we've really focused on these inter-VPC and internet use cases. But as more and more customers onboard to the Gateway Load Balancer, we heard that they would like to inspect traffic from on-prem in a cloud native way, because they had hybrid deployments.

And in order to accomplish that, earlier this year, we launched a virtual gateway integration with the Gateway Load Balancer. It supports two use cases:

The first one is traffic coming through a VPN. So your on-prem traffic uses internet connectivity that is encrypted over IPsec, comes to the VGW or the virtual gateway, gets forwarded to the Gateway Load Balancer endpoint onto the load balancer, finally to the appliance target.

The other use case is the integration with Direct Connect. So in this, if you want private connectivity, you can use Direct Connect between your on-prem and your AWS environment. And in this case, what happens is the traffic from on-prem goes through the Direct Connect gateway onto the VGW and finally reaches the appliance target.

Like I mentioned earlier, the Gateway Load Balancer is a partner-forward product. We have several partner integrations - I'm going to call out a few from an advanced security perspective, we have integrations with Palo Alto and Fortinet. From an analytics perspective, integrated with Netscout. From an orchestration perspective, we've integrated with Terraform.

All of these are available through the AWS Marketplace. That brings us to the end of this presentation. Thank you all for being here and we really appreciate your feedback. So please do fill out the survey for this particular session when you get a chance.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值