Must-have network diagnostics and troubleshooting tools

最新推荐文章于 2024-07-12 13:44:14 发布

李白的朋友王维

最新推荐文章于 2024-07-12 13:44:14 发布

阅读量63

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134835690

版权

Good afternoon, Vegas. Wow. Fourth day in at Rein Rent. I'm simply amazed. So thank you for coming to our session today as part of Rein Rent 2023.

I want to welcome you to our session on must have network diagnostics and troubleshooting tools. This is my first ReInvent and I do realize that we are standing in the way of a huge crowd and replay. So we're gonna make this session very quick and short and sweet.

I'm Ruskin Dra, a Solutions Architect from a land far far away called New Zealand. I am by trade a C# .NET developer and I've been programming in .NET since before we had type list. So this is back in the .NET 1.1 days and I have a guinea with me. He'll introduce himself to you of guinea.

Hi, Las Vegas. My name is Evgeny Wo I'm a specialist, a leader also based uh in the land far, far away in Sydney in Australia. I lead specialist SA teams across a PJ. Prior to this role, I used to run networking SA teams for about four years. So you may have read some blogs from me or seen me before to invent. That's my fifth or sixth R invent. So, yeah, definitely not a stranger but uh back to Ruskin to take it away and I'll come back and see you in a few minutes.

Awesome. Thanks Uni, so just a couple of housekeeping tips first. So this is a 200 level session. We'll be talking about some networking tools and services um and some debugging and diagnostic tooling. We might dive deeper into some parts, but on the whole, it is a 200 level session, we've also got some cues on the slides if you'd like to take photos and some QR codes for those of you who like to reference material further and learn about things in your own time.

So why are we here today? And what brings you to the session? We've spoken to a lot of customers, we've spoken to a lot of you and we've seen patterns emerge. One of the things customers are telling us that the AWS network and networking services solve a lot of challenges for you, but it's still a bit of a black box. Another thing we have also heard from customers is that the network is often quite easy to blame and it takes days, sometimes even weeks to prove its innocence. And we've all heard of the famous networking idiom. It's always DNS. And finally, we have heard from customers that choice and options are great. But when it comes to network diagnostics and tooling. How do we know which one to use for what?

So let's look at the agenda for a session today. We'll be defining the problem space. First, we do want to thoroughly understand what we are trying to solve today. Next, we will look at some ways in which we can observe and visualize our network, followed by detecting and analyzing potential problems, debugging and troubleshooting, understanding root cause analysis and afterwards improving our network, we'll summarize our findings and close off our session.

So before we embark on our journey, today, I wanted to put forward a mental model of what we'll be talking about during the session. Most if not all of us here are here to talk about networks. So networking is in the middle of a flywheel like any problem. It's good to understand a problem thoroughly through observation. Once we understand the problem, it will help us detect potential issues if either while they're happening or proactively, even before they happen. Once we detect issues, then we can go ahead and understand root cause analysis and debug the issue allowing us to improve our network topology. Once we improve our network topology, this will result in better performance leading to more users and ultimately leading to more traffic allowing us to see even further patterns to improve our topology more.

So let's dive a little bit deeper into why we are here. You must have heard this phrase from our CTO at least 100 times by now, everything fails all the time. It is so simple, it is so true, but so daunting at the same time, only if you're not prepared. So let's level set a little bit on how we have distributed our large network in AWS.

So as you might know, we have regions and availability zones, regions are geographic centers comprised of transit centers and availability zones. Further availability zones are comprised of multiple data centers. So it is critical to understand the setup because when it comes to network troubleshooting and diagnostics, it is important to know what parts of the network you need to monitor. And more importantly what parts of the network you can monitor.

So as we look at some moving parts in AWS with regions and availability zones, let's have a look at some ways in which your workload can get traffic from the end users. One way is to expose your workload to the internet. Another potential way is to expose your um workload through edge locations. If you're leveraging CloudFront distributions or Global Accelerator end points. Alternatively, you could also expose your workload to corporate data center users through your wide area network and Direct Connect regardless of how your users are accessing your workload. There are quite a few moving parts to consider when you're looking and s uh tracing a packet all the way from its source to its destination.

Let's dive a little bit deeper into the internet flow as a user when you access the workload in AWS. The first thing you need to know is where to send the request. So there's some level of name resolution happening next, you route that request either to your internet service provider followed on by its peers and further down internet exchanges. While all that is happening, there is network address translation happening in the background from private to public and public to private. And not to mention the myriad levels of security going on in the background as well. You got firewalls, access control list intrusion detection and prevention systems.

So as you trace a packet from a source to destination, you need to find a way of reasoning about and troubleshoot issues if things fail. And as you saw earlier, just as Mr Werner Vogels put it everything fails all the time.

So once the packet is in AWS, we have hundreds of interconnected devices which allow your packet to get to your workload. This allows us to uh give you the performance availability and reliability you expect from the AWS network. Now this is our underlay network and it is simply presented to you in the shape of a software defined network paradigm which you know as a VPC.

So this might look intricate on the surface and in some ways it is, but we have the right expertise, the tools, the data points, the metrics to help us diagnose and de and debug issues. At scale if and when they happen. And today we're gonna be talking about some of these services which extend this capability to you to help you during your diagnostic journey as well.

Let's go one little deeper. So at AWS, we recommend getting your traffic onto the AWS backbone sooner rather than later using our edge locations. So once the traffic is in AWS, it is redirected to the transit centers which are then connected by redundant dark fiber to our availability zones, intra and inter a traffic leverages a concept called DWDM which stands for Dense Wavelength Division Multiplexing. It's essentially a fancy way of saying how do you combine multiple wavelengths of light into a single fiber cable? Now as a customer, you don't need to worry about any of this. The underlay network is fully abstracted for you and taken care for you as part of the shared responsibility model.

So even though this underlay network is abstracted away, there are some key learnings that I would like to share with you today with hundreds of interconnected devices. It is critical for you to exercise your network and all its parts and all its routers at AWS. We have learned that failing over to a router or a part which hasn't been exercised carries significant risk at our scale. Now, you can apply this principle in your network topology by example, leveraging active active VPN connections using ECMP or active, active, direct connect, connect uh direct connect connections using BGP communities.

We have also learned at AWS that we want to measure metrics at all levels of the spectrum. Say, for example, your network is performing within a certain latency threshold most of the time but occasionally exhibits a higher amount of latency. Now consider the impact of that higher latency on critical workloads or users experiencing that issue. Also as you grow, this issue is exacerbated. So make sure you consider metrics at all levels of the spectrum.

And finally, at AWS, we design our networks for scalability, scalability, not just for growth but also for failure. What happens when a link fails? Can your other remaining links handle the load? So make sure you consider this otherwise you will get a cascading failure of your remaining links.

There's a great talk done by this uh on um at Rein in 2018 by Distinguished Engineers, Tom School. So for those of you who like to dive deeper into how we manage our AWS network, there's a QR code on the top, right? So definitely check out his talk.

So we looked at AWS, we looked at traffic flow. So now I want to look at this from a perspective of two people here. So before that, I just want to get a gauge of how many of us in the audience are application developers. So show of hands, please. Who would you consider application developers? Cool, including myself? Awesome. Thank you for that who would consider themselves as network engineers here. Awesome good mix.

So with that, I wanted to introduce you to Sarah. So Sarah is an applications developer, she converts requirements to software. But as all of you might know, software inevitably sits on top of a network. So Alex uh Sarah has conversations with Alex who is a networking engineer. Alex makes sure that the network supports any workload sitting on top of that. But as you might already know software and requirements change often. So Alex is to make sure that the network adapts to those changing requirements.

So as far as Sarah is concerned, her perspective is pretty much from a workload. So let's have a look at uh what she goes through. So Sarah has a workload and it's deployed in the Sydney region in a production AWS account. Things are going quite well for Sarah. She deploys a second workload. Now, Sarah realizes that the best practice dictates that she tests her workloads before she deploys them. So she deploys the same workloads in a non production account, things are progressing even better. And now she has deployed a automation pipeline to help her with deploying those workloads. And she speaks to Alex and says, hey, can you set up some connectivity? And Alex sets up a peer in connection for her things are progressing even further. Now she needs to access AWS services and Alex sets up some VPC and points for her and also smooths her over to a transit gateway connection or a ping connection because Alex understands that it scales better.

So as you saw here, Sarah's perspective is mainly workload centric. So let's have a look at Alex's perspective. So for Alex, he needs to be aware that there is a networking footprint on prem and in the cloud, excuse me. So some of the teams in Alex's corporation have deployed a workload and a default VPC for some prototyping. They're leveraging Amazon DynamoDB as well. Prototyping is going quite well. They've deployed their workload and the VPC and they need some connectivity. So Alex sets up peering for them. Another team comes on board and deploys yet another VPC and ask Alex for connectivity. Now, Alex is scratching his head here because the CIDR addresses are overlapping. So he needs to debug and diagnose how to set up connectivity. The team who sets up this VPC is under some stringent release deadlines. So they set up the workload. They deploy a security group with some questionable naming conventions, routing conventions and uh pretty permissive security group rules, accessing the internet. Uh Amazon is three bucket YV BC eight point, et cetera. The corporate users need access to this workload

So Alex goes ahead and sets up a VPN connection. This workload is really popular. So they're deployed into a new region and Alex sets that up as well and also sets up Transit Gateway to scale better. The company itself is doing really well. There's a new branch office coming up in San Francisco. So Alex is quite busy setting that up.

After a while, the corporate users are demanding faster access speeds. So he needs to move away from VPN on to Direct Connect. And they're also requesting better experiences using DNS names. So he set up some Route 53 Resolver.

So all in all, Alex and Sarah have been quite busy but you can see different perspectives happening there. So let's have a look at some obstacles they faced along the way.

As you can see, network topology and networks tend to be quite dynamic, they grow quite organically with changing requirements and they're quite expansive spanning on prem infrastructure and cloud infrastructure. Networks also tend to be quite diverse. Different people use different naming conventions, different routing conventions and you'll also have different consoles for accessing your on prem infrastructure versus infrastructure in the cloud.

And finally, all networks imply some level of governance. All networks have a purpose that purpose is a contract to the workloads the network supports. Often this contract is implicit or implied. And as networks grow, these contracts become a tribal knowledge making change difficult and networks quite brittle.

So let's have a look at some services, some techniques and processes which will help Alex and Sarah through these obstacles.

So we looked at this flywheel earlier, let's start off by understanding the network through observation. There are three main facets to observing a network. We need to visualize it. First, I myself am quite a visual learner and we all know pictures speak 1000 words. So if you can represent your networking topology, its attributes and its connection status visually, your battle is half won already.

Next, we want to capture all the events to understand what's happening within the network. And we wanna capture this data frequently and extensively. Finally, after we have all this data, we want to understand it through analysis and dashboards.

So the easiest way to visualize your VPC, which is the fundamental block of networks in AWS, is to use the Resource Map view in the AWS console. So this is available to you right now. So from this example, I can quickly see that in my VPC, I have a S3 endpoint which is consumed by four route tables which in turn are associated with four subnets.

You can also use AWS Network Manager to visualize your network topology. AWS Network Manager has two options to visualize your topology. If you're using Cloud WAN to automatically manage your core network edges or if you're using Transit Gateway or Transit Gateway peering, let's dive a little bit deeper into whether if Alex was using Cloud WAN how would he go about visualizing this?

So in this example, Alex has set up Cloud WAN which is managing its core network edges in three regions - Oregon, Singapore and Sydney. He also has a VPN coming in from the Auckland office connected to the Sydney core network edge.

Now pay close attention to the screen because there's a very cool animation coming up showing you how this logical representation is represented in Network Manager. So Alex has a holistic view into how his network topology is set up in Network Manager itself.

Now, let's say, for example, he was leveraging Transit Gateway and Transit Gateway peering and it works in a similar fashion. So again, Alex has Transit Gateway set up across three regions. He's got peering and he's got two VPN connections in Auckland and in San Francisco. So visualizing this a Network Manager will give him this again, Alex has a holistic view into his network.

Quickly note the orange lines are Transit Gateway connections and the green lines are functioning VPN connections. Say for example, something happens to the VPN connection. Alex is notified immediately through visual cues that something is wrong with his VPN connection. He can choose to dive a little bit deeper and realize that one of the tunnels as part of his VPN connection has gone down. Pretty cool!

Now, I'll explain to you how AWS components are shown on the screen. But one thing I haven't quite explained is how did Alex get his on premise components to show up on the network topology graphs and maps?

So for that, Alex leveraged devices and sites in Network Manager. So what Alex does, he simply sets up a site to represent his on premise network location. Next, he sets up a device and associates that to the site. Finally setting up an association to his customer gateway.

And Alex is quite smart here, right? So he can do all of this to the console, but he realizes that he needs some automation to take care of changing needs, so he can do all of this using the AWS CLI as well.

So now that Alex and Sarah have visualized the network topology, they move their focus off to visualizing network traffic flow through logs.

So Sarah has deployed a workload and a VPC. Now, Alex and Sarah are starting to capture logs. They have captured VPC Flow Logs, they capture metadata off the traffic flow. And since the middle of last year, they're also capturing flow logs from Transit Gateway. And I also recommend capturing flow logs from your load balancers, from your firewalls, from your Route 53 as well to give you a holistic view of what your network is doing.

Once you collect all these logs, you need a way to store them and AWS has a few options here as well. You can use CloudWatch Logs to store your logs, query them, plug them on dashboards. If you want to store them for a long-term at a low low cost use Amazon S3 or if you want to integrate into third party products, you can leverage Kinesis Firehose.

Once you've stored them, you need a way to analyze them. So again, AWS is a few options here. You can use Amazon's OpenSearch offerings like Amazon Managed Service for Grafana or Amazon OpenSearch. You can also plot them using Amazon CloudWatch dashboards or Amazon QuickSight or query the logs itself using Athena if need be.

So, speaking of dashboards, if you haven't already started, I would highly recommend you to use CloudWatch automatic dashboarding capability. So with CloudWatch automatic dashboarding capability, you can set up dashboards for your metrics quickly without doing a single piece of clicking around.

Over here you can see an example of a dashboard I quickly set up for my Transit Gateway and NAT Gateway. Did you also know CloudWatch also supports cross-region and cross-account dashboards as well to visualize your distributed network topology as well?

Another option for you to look at is how do different users and contributors impact your network flow. For example, I wanna know how contributors are using a Private Link endpoint. So what I can use is Contributor Insights here. There are three easy steps.

First, I need to identify what logs I need to use. So in this case, I'm using VPC Flow Logs. So that's what I specify next. I define the fields which I'm interested in in the VPC Flow Logs. And I also define two keys which are unique enough to create the contributor. And finally any filters I might need. For example, if you deploy your VPC endpoint in three availability zones, you're only interested in those three ENIs. So that's essentially what I've done there.

And finally, any aggregations, either sum or count. So this results in a dashboard like this. So now Alex and Sarah know exactly who is accessing the Private Link endpoint the most. So from this graph, you can see that there is an instance on 0.87 which is accessing the Private Link endpoint.

So if Alex had to go and troubleshoot or diagnose any issues with that particular Private Link endpoint, Sarah's workload running on that instance would be impacted.

Next up. This is an example of a visualization of VPC Flow Logs using QuickSight. So in this example, I was helping a customer understand and answer a cost related question they had. So the customer was seeing some unusual cost which was showing up as cross AZ cost for a network topology or a network architecture which was deployed only in one AZ.

After drilling a little bit deeper into it, it was realized that the workload was accessing an EC2 resource using a public IPv4 address which was causing that metering to occur. So as you saw visualizing your topology understanding the logs helps you detect problems quicker.

So as I wrap up the observe section, there are three key resources which I would like to share with you. The first one is a GitHub repository which has a CloudFormation template which allows you to go and create Contributor Insights and dashboards.

Next up, there's a workshop which takes you step by step to set up Amazon QuickSight dashboards for your VPC Flow Logs. And finally, we have a whole library of Athena queries to understand traffic patterns using your Cost and Usage Reporting data.

Awesome. So now that Alex and Sarah have a good grasp of the network, let's see how we can help them detect potential issues.

So before Alex and Sarah can start troubleshooting and debugging issues, they need to be alerted to the fact that there is something happening. So there are three potential ways of Alex and Sarah getting notified. In your experience, you might already have experienced some of these - a user coming to you tapping on your shoulder and say, "Hey, my workload or my network's not performing. Can you please help me?" Or it could be an alarm or it could be an anomaly detection, whatever it could be.

Once the alert happens, you can then use Amazon EventBridge to fan out and represent your notifications across multiple channels. This will then result in Alex and Sarah to investigate the issue. They can go in and have a look at Personal Health Dashboards or CloudWatch dashboards or QuickSight dashboards. They can even query using Athena or even they can go in and have a look at the issues using CloudTrail, for example.

So speaking of alarms, in October, we released CloudWatch Recommended Alarms. So this allows you to create alarms quickly using AWS best practices for AWS-generated metrics. So what in the world does that mean?

So remember the example earlier, Alex wants to be told when his VPN connection or one of the VPN connections loses a tunnel. So he can use Recommended Alarms just quickly set up an alarm for that. So from the console itself, he can either use the console or use infrastructure as code.

If he uses a console, have a look and notice that some of the fields are already filled in for him. So he doesn't need to work very hard to understand what sort of metrics and thresholds you need to put in there. And if he wants to use infrastructure as code, again, notice that most of the fields if not all are automatically filled in for him, so he can just copy, paste this code and put into it, put it into his automation suite.

Now, Alex and Sarah moved their focus to understanding how can they detect issues for users accessing the workload across the internet. So for this, Amazon has Amazon CloudWatch Internet Monitor.

So what this allows Alex and Sarah to do is understand how internet weather events are impacting Alex and Sarah's workload or end user experience.

So, remember I told you earlier that we have a large network of edge locations and transit centers which allows your traffic to get from the internet to the region in which your workload resides. So to ensure we give you the availability performance and reliability you expect from the AWS network, we monitor these and this is the data which is leveraged by CloudWatch Internet Monitor.

So when you set up an Internet Monitor, we know which portions of the internet communicate to the region in which your workload resides. And we can actively monitor that we establish performance and availability baseline and then create awareness for you and health metrics if users are experiencing particular issues getting to your workload.

So Alex and Sarah can proactively take any remediation steps if needed. For example, they can deploy their entire workload in a new region to bypass a potentially bottlenecked internet service provider.

Diving a little bit deeper. How does the Internet Monitor work? It's made up of four key components.

First thing you need to do is tell the Internet Monitor what part of your network is receiving internet traffic. As of today, we support Amazon VPC Network Load Balancers, CloudFront distributions and Amazon WorkSpaces.

Next, you need to tell Internet Monitor what percentage of traffic you want to monitor. Then what we do is overlay that with the data we have already collected for the transit centers and the edge locations. Ultimately giving you a set of optimizations right in the console itself.

So what does this look like from a dashboard perspective? Alex and Sarah get this view. So from this view, they can quickly see that from 9am to 1pm the users were having problems accessing the workload from the internet, diving a little bit deeper...

they can see how the bikes transferred and the round trip time were also impacted. and if you have a look at the bottom, we give you metrics at all levels of the spectrum. p 95 is all the way to pf. and finally, alex and sara alex and sarah can drill down even further and see how particular location was impacted and how much traffic at that particular location was impacted and how the users were treated.

so now that we understand and help alex and sarah detect problems across the internet, let's move our focus into helping alex and sarah detect problems in aws itself. so for that, we have three core services.

so pc reach ability analyzer allows alex to verify whether a source can communicate to a destination. we also have transit gateway route analyzer which allows you to prove out routing within transit, gateway route tables and network access analyzer which allows you to allows alex to understand and remediate unintended network access across his entire network topology.

so using reach ability analyzer, alex can quickly prove that an instance can communicate to another one across a vpc peering. for example, you can also prove out that the same is possible using a transit gateway attachment. he can also prove out that communication works even if a network firewall is in place. and if you have a look on the right hand side, the reach ability analyzer also tells alex which rule group in that network firewall was exercised when that communication was happening, taking into consideration intermediate components.

so i realize that networks come in all shapes and sizes. so make sure you consult with the documentation to understand a reach ability analyzer caters for your flow.

so now we have helped alex with understanding how to verify connections between 22 points. how can we help sarah as well?

so with the latest release of amazon q, sarah can now ask questions, simple humanlike questions in natural language and understand how and verify how connectivity works. so if sarah ask a question related to network troubleshooting, she's often offered a preview experience which we'll look at in the next slide. so just a note here, this is only available in the north virginia region. so if you're testing it out, make sure you select your region appropriately.

so now sarah, i can just go to amazon q network troubleshooting and ask it. can the public access my application running on the load balancer? amazon q for network troubleshooting? simply interprets this question and replays it out and making sure that it has captured sarah's intent clearly, then what it does it actually runs the analysis for sarah as you can see on the screen. so remember all sarah had to do is type in the question and at the end it summarizes the output for her as well. it's pretty cool.

so i definitely want you to try this service out and let us know any feedback that you have. we appreciate that.

finally, with reach ability analyzer, uh vpc reachable analyzer allowing you to verify point to point communication. network access analyzer allows alex to prove out unintended network access. so what do i mean by that?

so in this example, alex wants to verify that there is no egress traffic going out to the internet which is not going through a network firewall. so all he needs to do is tell network access analyzer what his access requirements are. so the first step he needs to do is select his resources. so in this case, he sets up the sources of e pc and the destination as an internet gateway.

next, he needs to sell network analyzer that he doesn't really care about any findings or any network access, which is, which has intended access, which is essentially going via the network firewall. so he sets up some exclusions. please don't tell me about any traffic which is going via the network firewall access analyzer then uses automated reasoning to show any findings to alex if that rule is ever violated.

so in this case, alex is quite happy there are no parts of the no parts of the vpc which have egress traffic, which is not going via the network firewall.

so that brings us to the end of the detect session. and i want you to leave with a few resources here as well. the first resource is a blog to help you set up personal help dashboard and set up different channels for alerting. next, we have a uh workshop which allows you to do different analyses for different scenarios in your vpc as well. and finally, if you're interested in, interested in more in how internet monitor works, the documentation has that covered.

so with that, i'm gonna pass it on to ev guinea for the rest of the session may be a round of applause. that's been like 55 slides in 30 minutes.

all right. so now we can visualize, observe our network, we can detect issues. let's talk about network debug.

so in order to debug efficiently and fast, we need to have a mental model. now, a mental model that i've used for nearly 20 years and it has served me really well is to think about our applications and our network in terms of layers.

so this is a jenga game that you probably played with your kids or when you were a kid yourself. think about application being the very top layer of that tower and every layer below that is our network. if something happens in one of the layers below the application, everything above that is affected.

so one such model that we can use is the tcp/ip conceptual model. so at the bottom we have network layer, the network layer is all concerned about sending bits across the wire. this is where terminology like mac addresses v lens fiber comes into play. and then we can think about services like ibis direct connect at that layer above that we have internet layer. this is where we are concerned about things like ip addressing and routing. amazon vpc is our software defined network that runs on top of the network layer managed by abu abu direct connect also has ip addressing. so it kind of spans across a couple of different layers.

now above that, we have transport layer transport layer concerned about protocols like tcp and udp, things like ports being defined. and that's where we can start using network services like a network firewall to filter out traffic. and finally above that, that's uh sars domain application itself with the application. that's where we configure things like encryption with tls certificates loaded into load balancer listeners defined and the actual application is is running.

so let me walk you through a troubleshooting example where we're gonna step through the layers and give you some examples of how that plays out.

so let's walk through a direct connect, troubleshooting to start with. so we have a problem with direct connect. how do we troubleshoot the network layer reconnect, troubleshooting is probably alex is the main sara is unlikely to get involved there. this is where typically we have a fiber going from a router operated by alex into a router managed by abu. he might be plugging the fiber directly if he's in the same data center as our direct application or maybe using a partner in the middle.

if he's using a physical connection at some point, the physical fiber needs to go into a port on a switch. this is where things can go wrong. there might be no light on that port. the connection might be down. there may be actually a light but light is outside of a good range or maybe there is a good light range, but there is no mac addresses uh being kind of populated into the arp table. so many things can go wrong.

i'm not gonna cover every step of troubleshooting here, but you could go and apply connection kind of check uh the light level with the tester. uh maybe involve your data center folks if they're managing this infrastructure for you. um sometimes the fiber strands are just dusty, they kind of need a bit of cleaning. uh you could also log into your router if you are alex and kind of check your ar table. see if you're seeing a mac address if you want to see detailed troubleshooting steps uh for network layer with direct connect. uh check out uh the article behind the qr code.

now, troubleshooting here, we can leverage dashboards and cloud watch metrics. the first useful one is the connection state. this one quickly allows us to identify if connection is up or down. ideally, we would set up alarms beforehand, using the recommended alarms that uh ruskin talked about. but uh we can also uh just ad hoc uh log in and kind of do a bit of debugging here by checking out uh those metrics.

we can also check things like the lights level. uh there's some uh good ranges published here in front of you uh from our documentation. um hopefully it's within the ranges. but if it's outside of the ranges that indicates uh some problem, uh maybe with the dust or maybe with a faulty uh fiber somewhere along the way.

now let's move one level up uh through our jina tower internet layer. this is actually where i would start. i wouldn't necessarily look at my connection up or down. uh it's more efficient for me to start in the middle of the tower and kind of determine whether i want to go down or up the layers. this is much faster and much more efficient.

and so how do we typically check things here? well, we can use a tool called ping uh that uh most of you are probably well familiar with. in fact, most of you should probably run some monitoring software that sends a periodic ping request using ic mp packets. and that could help you to detect the downtime really, really fast and things like packet loss or any intermediate events with direct connect.

we can just simply ping the other side of the connection, the ip address of that and verify whether there is a reply. if there is a reply, we can continue to troubleshoot and we can check if the b gp is up. b gp stands for border gateway protocol. this is something that we use to exchange routing information on ip networks. we use that for uh direct connect. we can also use that for vpn and this is a bread and butter for alex. and uh you know, for him using uh for connectivity with external parties.

one example here, alex can log into his router that he might have access to and uh run a set of debugging commands again, if you want some uh further information on how to do that, i suggest you scan the qr code. uh there's an article there with the detailed information on how to do that.

now, it's not unusual for our customers to run a vpn connectivity on top of direct connect as well as on top of internet. this gives you a tunnel with encryption. and so i see many customers utilize vpn s. now vpn s are notoriously hard to debug and troubleshoot. and i've spent my fair amount of time troubleshooting them back from back in the times i used to carry a pager and being on call.

uh with vpn s, you can leverage logging to troubleshoot things once you establish that there is a basic connectivity and you can think the other side of the of the tunnel, this is where we can temporarily enable vpn locks on a west side of things. and we can also capture logs from the other side. this site is customer gateway or we call it cgw for short.

now, it's often useful to have both sets of locks available side by side and start to comparing them as vpn negotiation goes through phases. and we can point out by stepping through those locks at which point does it fail? which gives us a clue if there's a misfigure or maybe something happening to the packets and packets get blocked.

now moving further up assuming that we've got our basic connectivity established. now we can check whether we can get to the application. this is where things like ping is still usable. but i would suggest to go one step further and use tools which send what we call tcp ping.

so the way that work, they send a syn packet and they wait for a reply. uh essentially leveraging the tcp handshake in order to determine if there is a connectivity. one such example is h ping three. there's a few other tools out there that do exactly the same thing. but this kind of checks whether the application is actually responding, sometimes ac mp is blocked, you're trying to ping. there's no reply doesn't necessarily mean that there is no connectivity, right?

so tcp ping definitely must have uh tooling in your toolbox. another one is nc or net ct. it also can send uh a tcp ping equivalent and few other things dns is another thing that you probably want to check here

If you've got basic IP connectivity, maybe users just can't resolve a DNS name. And so that's where things like DNSLA are useful as the tools that you can utilize to verify. But you can also check logs if you've got them enabled like your query logs over the next three or four slides. I'm going to take you through like my essential IP network toolbox, right? These are some of the tools that I've been using over the years and they still serve me today no matter how sophisticated networks are. I still utilize those basic tools to troubleshoot things.

Netcat is super useful quite often. I don't have access to the application, but someone tells me, look, it needs to listen on port 8080. It's a TCP application and I don't need to access to talk to Sarah and kind of have access to the app. I can set up a simple listener using Netcat myself and then use Netcat on the other side as a client to send a packet and verify whether connectivity is successful. And this kind of tests all the layers through to network, internet, transport up to the application layer at this stage. So super, super useful, very simple, very lightweight tool - kind of just needs a t2 micro kind of EC2 Linux instance or t4, whatever kind of you prefer to set up for network troubleshooting.

For DNS, dig is my tool of choice. There's nslookup as well that runs on Windows and Linux. If you haven't used those tools, I definitely recommend to set up like an hour in the next couple of weeks and like load those tools. I've learned them a long time ago, I've been using them. This one probably like every two weeks. I need to like, look up what this domain name resolves to and then figure out like, is it cached on a resolver? Like what's going on when someone tells me, hey, this stuff is down, right? Definitely have tooling in your toolbox - dig and nslookup, they can resolve different types of DNS records.

So for example, you can resolve a record for IPv4 hostname to IP address type record or A record for IPv6. And I'm seeing a lot more IPv6 networks out there. So definitely good for you to know which of these tools support IPv6 and how do they work? For example, if you're using ping, there's a ping6 equivalent, you can't just use ping to ping an IPv6 address. You need to use ping6 as an example, right? Catches people out because first time they see an IPv6 network, how do I troubleshoot or debug this thing? Right.

Cool. And a couple of more tools - iperf - so this one is a bit more sophisticated, right? I'll put a QR code there so you can scan and kind of check out how to use this tool. This one, we can use it to measure the maximum amount of achievable bandwidth available between two endpoints on the network. Quite often people would ask me, I've got two EC2 instances, you know, they supposed to do this much, this much bandwidth. How do I actually like test it and prove it because like I can't get that much seemingly with my application? And iperf is super useful for that. You can set it up in kind of like a listener server mode on one side, on one EC2 instance and as a client on another EC2 instance or maybe inside your on-prem network and start sending traffic. It's a synthetic traffic. You can define amount of flows that you want to send, whether it's TCP or UDP.

And then send it across the network to give you an example, I've been testing 400 gigabit instances when they first came out, they came out with four network interface cards. And I wanted to kind of proof - yep, you can easily do 400 gigabit from one instance to another. So for that, I put them into an EC2 placement group. I bind the ENI to all four network interface cards to make sure that all four of them are being utilized. And I had to use at least 10 network flows as part of my definition of iperf command.

Why is that? In placement group, we used to limit your traffic flow, a single TCP flow to 10 gigabit. And if it's outside of placement group to five - now with the ENA express, it supports quicker if it supports the ENA express. But yeah, we've utilized that in order to kind of to bump up to 10 gigabit for a single flow. And I was easily able to prove that I can get 400 gigabits of throughput between two instances.

And the final tool I want to talk about is tcpdump, right? When all else fails, you kind of go - yep, I've got to do a packet capture. You do a packet capture on typically a source destination simultaneously. You can use a tool called Wireshark, import your packet capture into that. That's a GUI tool. And that helps you to do things like follow a TCP stream and you could utilize that to troubleshoot further.

It also has filters and a bunch of other useful stuff. You can also look at output just inside the command kind of line inside CLI. But it's kind of useful for you to just see the basics whether the packet being sent, received on the other side and the reply being received on the on the sender. Yeah. So that's my starter toolbox.

So how does it stack up together? So we start at the internet layer typically. We do some basic troubleshooting things like reachability analyzer. We use Amazon Route 53. We check out dashboards, metrics, things like that. If things are not working at that point, we might have to go layer down and start troubleshooting the network. If things are working there, we can go layer up and we can start looking whether we can connect to the application. Maybe security groups are misconfigured, network firewall is getting in the way - this is where we can use Network Access Analyzer, we can set up traffic mirroring or do a TCP dump ourselves.

And finally, within application layer, there's quite a few tools. In fact, it's probably a talk for at least an hour just to talk about things we can do inside application layer and troubleshoot and debug that. If you want to debug applications and kind of learn about debugging there if you are like from a networking background, there's a QR code for a workshop you can do at your own pace that teaches you about observability techniques like tracing within your applications on AWS. So check it out.

So it doesn't matter how many tools you've got, dashboards, alarms, sometimes you will feel like I've run out of options and if you get to that place, never ever hesitate to contact support. The very experienced engineers who are familiar with both AWS networks and applications running on top, who can help you. In fact, the organization probably already paying for a tier of that AWS support and so leverage the folks there not to just help you to do like a one-off debug, but also ask them to give you some suggestions around how to improve your network.

And since we start talking about improvements, let me move on to kind of the final part of this presentation. So earlier, Ruskin mentioned CloudWatch Internet Monitor that allows you to monitor your internet connectivity to applications inside Internet Monitor. We've got something called Traffic Optimization. What Traffic Optimization does is it shows you what your profile of traffic looks like in terms of the amount of traffic sent from different geographic areas around the world to your region.

And it also gives you suggestions if you were to stand up your application in multiple regions, what would that time to first byte or TTFB look like? And you can see here highlighted if we were to set up our application, not just in US East 1, North Virginia region, but also on the US West coast in US West 1, we would see nearly half of, of latency. And that's a massive improvement, in terms of that first byte latency.

Now, if we are running a web application, we could simply click Show CloudFront improvements. And that would tell us what our first byte latency would look like from different locations around the world if we were to enable CloudFront. So as you can see here across the globe, we would now see nearly, you know, a lot lower latency down to 25-30 milliseconds from 100 and 12 in the specific example. So a big improvement. You could leverage this service to convince, you know, folks on your team to kind of maybe to turn on CloudFront.

Another thing you can leverage to improve your network is a mechanism, structured mechanism to look at failures and work out action items on how to implement changes and improvements around your network. At Amazon, we do a lot of this. We have this mechanism and it's called Correction of Errors or COE that we follow religiously.

COE is a structured analysis of events that impact customers. So I say the structured, the structure starts with a summary and impact where we ask what happened, when did it happen and why did it happen? We have exact timeline - when did the event start, when did it finish? And all the events that happened in the middle and we supply metrics and graphs. In fact, if metrics and graphs are not there and cannot be added into COE, that's a massive alarm bell in itself. So the first action item would be to actually create alarms and graphs which would allow us in the future to see this kind of impact.

We also leverage Toyota's Five Whys, something that Toyota introduced and leveraged to improve manufacturing quality, also applicable in IT industry. We use that to ask ourselves questions until such time that we cannot longer ask the why question. This allows us to get to the ultimate root cause of the problem and fix it.

And finally, we have a list of action items - as we draft the COE, the action items are being created and assign owners and get executed in order to improve our network. If you'd like to learn more, I linked in this QR code, a talk by one of our Principal Engineers where she explains in details of how we leverage this process to improve our networks.

Another way to improve your network is to just have a look at what else is available in AWS, at AWS. We listen to our customers and we create new services and features which make it simpler for you to deploy and operate your applications.

One such service is Amazon VPC Lattice that we introduced at last re:Invent. It allows you to secure and simplified cross VPC connectivity without having to create VPC peering or leverage things like Transit Gateway. You can utilize VPC Lattice to connect your components using identity and access management policies that allow you to do authentication and authorization of traffic between the application components.

You can do traffic management and scale where things like scaling of your compute and network is built in. And HTTP and gRPC protocols are supported. And finally, it allows you to connect services like containers and EC2 instances, Lambdas all together into a single service network seamlessly without having to connect VPCs together.

It has inbuilt IP overlap kind of measures so it allows you to connect overlapping networks and even networks that could be in IPv4 space to IPv6. So everything is built in VPC Lattice, it solves it seamlessly for you. So this is something that would definitely help Sarah to run her network very easily without having to engage Alex as often.

If you'd like to learn more about VPC Lattice, definitely scan the QR code and there will be extra information for you. Now, we talked about a lot of different services today. There's no way that we could go deep into any one of them without trying to cover the full picture.

Earlier this year, we introduced a Networking Core course, which is online course that you can do at your own pace, it's completely free. And at the end of it, you'll be able to test and receive a digital badge that confirms your core knowledge of AWS networking. So if you'd like to know about the services we talked about today and more, definitely check this out.

So let me recover the key takeaways. First, we talked about understanding our network - how do we observe and visualize it using existing tooling, why logging is important, configuring alarms and dashboards to easily observe and detect problems on our network. We talked about different tools that both Sarah and Alex can use in order to debug and determine whether the problem occurs on our network efficiently and effectively using TCP/IP conceptual model.

We also talked about that a network is not a black box - there's many tools that give you visibility into it and you can leverage them, but you can also talk to support if you feel like you're getting stuck. And finally, we talked about using failures on our network to drive continuous improvement and innovation and improving our topology of the network. And this leads to better performance, more users and ultimately more traffic and more success for applications hosted on our network.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Must-have network diagnostics and troubleshooting tools

Good afternoon, Vegas. Wow. Fourth day in at Rein Rent. I'm simply amazed. So thank you for coming to our session today as part of Rein Rent 2023.I want to welcome you to our session on must have network diagnostics and troubleshooting tools. This is my f
复制链接

扫一扫

Must-have network diagnostics and troubleshooting tools

“相关推荐”对你有帮助么？