Discover seamless observability with eBPF

最新推荐文章于 2024-07-05 14:38:20 发布

李白的朋友王维

最新推荐文章于 2024-07-05 14:38:20 发布

阅读量51

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134867616

版权

Hey, everyone. Uh, to those who are here with a beer and everyone who wants to join us, welcome. I'm uh Shahara Ella. And I will, I will tell you a bit about how to discover seamless observable with uh eBPF, which is a technology that if you haven't heard of before, uh you're gonna hear a bit about now and uh it's a game changer in, in this domain.

Um, so what is Ground Cover? Ground Cover is built basically to re invent uh observable and uh uh you know, you see this ambitious slide, what we're saying basically is that uh we're, we're trying to reinvest observable for cloud native technologies and challenge the legacy approaches that uh most of you guys are probably using like DataDog or uh or Dynatrace (all the respect of these guys). But uh we're here to make a change and to do that, we're harnessing a lot of different technologies that we're gonna cover in a second one of them being eBPF that is basically changing the way we collect the data.

So, I mean, why, why even do that? Right? Uh these technologies have been working for 15 years. Um a lot of you have been using logs, metrics, traces application performance monitoring for a while. There's a lot of ways to do that. Uh but things are changing like slowly in the market. And I think that a lot of you would agree that application is starting to become a burden in, in everything that you do.

Uh we already know that over 70% of organizations have like five plus tools. You put your logs with some vendor using Prometheus for metrics. Uh you're working with another vendor to try and get tracing. You're exploring, exploring OpenTelemetry or trying to find new ways to kind of innovate in getting more data and better actionable data around your observable stack. And it's a kind of never ending journey, always more tools, more, more ways, more methods, more hard work.

And uh we, we also know that these solutions eventually aren't really cheap, right? It it's taking up to 10% of your IT budget. So if you're paying something to your cloud provider, you usually pay about 10% of that and even sometimes more to the observable vendor. And as you grow as you scale, it starts to not make sense and it becomes painful and it becomes a burden over time.

But regardless of that, regardless of the the high cost around uh observable, it also comes with a lot of high effort. I mean, I don't know who's here ever instrumented an application with an SDK like DataDog or OpenTelemetry or anything that you can use. And I don't know who of you uh decided on the metric cardinality that you want to create or collect over time. That would make sense or decided on a sampling rate for a tracing application.

Um kind of back end that that's a lot of hard work. A lot of uh you know, deep expertise, hard work, changing your code working to get it done, moving from application to application and making sure that every new service introduces kind of the framework of observable that you wanted to get that uh that that working and it doesn't come without a price, right? Because hard work means partial coverage.

So if you got like the average company has 100 and 50 different micro services, right? And we've democratized the way people choose their, their uh development stack, you have the data team writing in uh Scala and the back end team writing in Java. How are you gonna instrument all your applications and keep up with that over time? So what happens is you're partially covered? And when you combine that with cost, we see that most organizations, almost 70% of organizations don't really have an application performance monitoring tier. You guys are working with logs, you guys work with metrics, infrastructure metrics usually and you get by. But uh in most cases, you don't have application level metrics like response types of your application error rates in the cardinality of an API that you expose to your customer.

Uh and no traces to troubleshoot with and kind of understand exactly what was the flow that led you to a problem with your uh customer facing API. So he enter Ground Cover and part of what we're doing is part of what's also shifting in the market. And let's talk for a second about how this problem can be solved. And again, this is not just Ground Cover, this is a shift in technology and the way people approach of the ability, one of the things that we're doing differently, and uh you're gonna see probably more of that in this conference. And uh if, if you can take anywhere from this uh discussion is kind of learning about this technology and how it can change the way you address your observable uh stack.

Today is eBPF. eBPF is a very uh weird acronym but in a sense what is the does is allow you to collect very uh intimate data about your application, what your application is doing, what traces is it sending in and out and doing all that without changing your application code? Basically, if before to figure out what db query will you take, were you doing to an RDS? And how much time would it take? You would have to instrument that piece of code, right? You would have to integrate OpenTelemetry or New Relic into your into your code and do the hard work to get that trace in place right now. With eBPF, you can do it outside of the application out of band, a different pod running in your Kubernetes cluster can collect all that for you.

How eBPF is doing that from inside the Linux kernel. It's a technology already used in high through networking. AWS is using it as part of as part of an infrastructure. It's using security to kind of reinvented how to collect security events from Linux machines and now it's already introduced into observable. So basically what it means is imagine you have a Kubernetes cluster running, you know, 4000 pods at the same time, each of them running different languages doing different stuff, a different scale. The alternative would be to instrument each and every one of these applications to understand the dependency map who's communicating to who, what's, what's the uh the metrics that are running over there.

And right now with eBPF deploying eBPF on each of the nodes in your Kubernetes cluster allows you to suddenly see all that containers at once immediately. And uh part of this means that you can suddenly on board into a deep application performance monitoring solution in a minute without changing code, changing configuration, labeling your services or do anything that requires you to change the run time.

Uh and that's dramatic when you're talking about the highest scale cobert environment that runs a lot of different services. At the same time, what it also promises is 100% visibility. Because if before it was only the services you chose to instrument, you chose to integrate the SDK to the services you decided are would be the most important. You always left something out. There was always the legacy service, no one want to, to touch, right? And uh the code of that is already fixated and written in stone. There's always that control plane that sto uh service match uh engine x, whatever you run in your cluster that you can't instrument with uh with eBPF.

Basically, it's all equal, it's all the same pieces of code running your production. You suddenly get to see all of their application metrics and all of their application traces even if it's not your code. And also the footprint if before integrating an SDK would mean I might be affecting my response time. Maybe right now with New Relic or DataDog or OpenTelemetry, maybe I'm adding some percentage of delay to my response. Maybe my users are experiencing more round trip time than before. It's hard to tell, right?

And if you're an e-commerce company or a gaming company or a fintech company or anyone dealing with high triple transactions that can cost you money and can and can create eventually uh a disturbance in your service. eBPF is working from the Linux kernel using uh the kernel resources basically to do all that it's super efficient with a very minimal footprint on CPU and memory. And it's also part of the premise behind this technology.

The other thing that uh Ground Cover is doing differently is the way we address data. So I mean, assuming we established that eBPF is a really interesting way to collect the data. So suddenly you get tons of data which you haven't uh been able to get before. Uh that's great. That's a great start. But then what I mean, you all know that you pay per volume, right? You push logs to somewhere, they charge you by volume. You can't predict how much you're gonna commit on. You pay extra for these surges, same for metrics, same for traces.

So if a new technology now introduces more data, who's gonna pay for that? Where are you going to store all that? So in order to address that we al we've also kind of reinvented how uh we think the data's processing and storage should look like in a modern cloud environment. And what Ground Cover does is separate the control plane from the data plane and basically store all the data that that come that comes from this uh from our eBPF sensor, from logs metrics and traces in uh a managed environment in your cloud environment after it's been processed, distributed inside the eBPF sensor.

So all these traces flow distributed to our eBPF sensor to across all your nodes, they're being processed smartly captured reduced in volume and then stored in your cloud environment. So you don't pay for shipping them out to some other vendor. And you also don't pay the cost of letting someone else manage your data. And and what you gain is complete uh data in your control, but also a significant cost reduction compared to what you would uh pay before.

So one of the things that Ground Cover does differently and I think that's the only solution in the market that does it today is we don't charge anything by volume. We collect tons of data with high granularity using eBPF but eventually only charge you by the size of your environment. So push as many logs as you want, push as many traces as you want, use whatever metric or you think makes sense, we're gonna make it work, we're gonna scale it for you. But eventually uh you don't have to pay for this volume and, and that's a game changer in the way you kind of handle this trade off. And I'm sure that everyone that deployed observable before was in this trade off. Ok. Let's reduce some of this metric anality to save costs. Let's not report logs from dev to save costs or whatever we're trying to eliminate that equation.

It also provides full data privacy. Organizations are more and more aware of where their data is being stored, store your logs and traces in a third party s provider that can be painful when you're dealing with uh PII and a lot of information from your customers flowing through your P that you might not want one exposed. And it's also fit to a cloud native back end. We, we're using technologies that were born in the cloud native uh domain like ClickHouse and VictoriaMetrics. They're cost effective. They're easily scalable, they're easily backed up and, and used in a cloud native environment like Kubernetes.

Um part of what we bring to the table and we're gonna walk through that in a second is that installation is now 60 seconds. You can on board to a full APM from logs metrics to traces in 60 seconds by just running a DaemonSet running the eBPF program, deploying the eBPF program basically on all of your notes. So it also changes the dynamics inside the organization if before on boarding into New Relic or DataDog or any other solution would require an integrated effort from the R&D team, right? Everyone from every team would have to integrate this solution, deploy it into production and then see the value right now.

Uh uh uh the one champion in a DevOps organization can install it on the infrastructure and get come back to the R&D team with a full coverage in of their observable stack, which is dramatic.

Um what we provide is uh a full stack observable in one place from log management to infrastructure monitoring all the way to a full APM. And uh if I'm gonna summarize that before showing you a quick demo, it's a frictionless on boarding. It's 100% coverage all the time. Introduce a new service you get, it's covered, instantly change your code. It's covered. Interestingly, introduce an API, you're covered with full traces without doing anything.

Uh you get up to 90% of the cost reduction and also control your data and, and uh get promised uh privacy. So let's let's move into a quick demo. So I can show you that what I'm talking about is real, right?

Uh so this is Ground Cover, as we mentioned before, part of uh the on boarding would look like something like this basically would install a CLI that wraps the Helm chart. You can install it through an EKS add on. We're also supported in, in a in the EKS add on ecosystem. Uh you would first get a full Kubernetes native uh infrastructure monitoring. So you will see all your deployments basically in Kubernetes and each of them will basically contain two verticals that would be very hard to get. One is all the requests per second error rate latency. Basically all the red signals that you would work hard to instrument and get. And the other would be infrastructure monitoring like CPU and memory at a workload level at a pod level, at a container level, whatever you wanna start drilling down into, you would also see logs from the containers and you would also be able to kind of go through the deployment YAML and figure out what actually what, what is actually running in production. If you're a developer interested in a specific workload, you're uh you're trying to, to find, you would also be able to drill down into that service and see, you know what it's been up to who is, who it communicates to what are the APIs it exposes. And again, all of this is out of the box without you doing anything without you instrumenting any code to get that, we create a full dependency map, you can filter by namespace by workload or anything that you would like.

On top of that, we overlay all the metrics that we just created from the traces captured with eBPF. So you can see all these edges no matter the protocol we support HTTP and gRPC, MySQL, Kafka, Redis even DNS whatever is flowing through your cluster. You, you get an overlay of all these metrics and you can also drill down into a specific edge and start getting these traces and figure out what's actually flowing on these live trace on this live uh edge that you just explored.

Another interesting thing that, that, that we do is create an API catalog which basically kind of samples together or or basically detects all the endpoints running through the cluster. So imagine you're using an RDS. Probably a lot of you are using uh managed AWS RDS and your clients inside that Kubernetes are using various uh you know, PostgreSQL in front of the RDS. What you would get from Ground Cover is basically all these endpoints, all these SQL would be aggregated for you with all the metrics of the request per second error rate and so on. So you can do stuff like sort by latency and just find the slowest SQL inside the cluster. And if you dive into that, you suddenly get real spans and real traces that will allow you to troubleshoot by drilling down into the actual uh trace, seeing the query that was running there, the parameters around it. What was actually going through that query? And again, all of this is picked up with eBPF you didn't do anything to get that information and we correlate it with the container information. What was the container doing at the time? What node it was running on? What are the logs around that? Are there any interesting logs that you can double down on when you're looking at this specific trace? What was happening in that container at the same time? And you can investigate that that log in invest investigate even the context of the metrics around that was the container close to its limit and it was almost fraud or out of memory or was the node busy. So you should kind of worry about a noisy neighbor situation, other workloads, disturbing your uh performance.

So that's one thing we also do. Another interesting thing is that we kind of aggregate all the issues in the cluster. So you would get stuff like there's a 500 on /product/* that's the API we found out and this is the client that is creating this problem. And again, if you dive deep into that, you get examples of the actual payload of the request and the response, you can troubleshoot with a real example picked up by eBPF. And again, all the context around that. So you can easily figure out what's going on. Here is the remaining transcript formatted:

This is also a full log management solution. So you can uh drill down into specific uh basically logs uh emitted from your containers, drill down into any figure out what's the related logs around them. And again, even see the traces around these logs. And this is all correlated inside our sensor, collecting the logs, the metrics and the traces at the same time. And again, also the the infrastructure metrics to correlate with that, you can also search through any trace flowing through the cluster. So if you're interested in finding something very unique, like a specific uh user agent flowing through the cluster, you would find these traces. And again, you would be able to figure out what's going on and even get the specific request that kind of triggered the problem.

Uh you can build dashboards set alerts, anything that you would expect for a full APM and you get all that by just installing a DaemonSet on your cluster. What we try to do is provide a frictionless on boarding. You can install it on any cluster and get a multi cluster experience. You can install it uh using uh infrastructure code and load a new cluster already with Ground Cover and pre ready dashboards and pre ready alerts. So you can like get going in two seconds and uh that's a lot of what we offer. And again, part of the, the difference is all the data you see here isn't priced by volume.

So the only pricing KPI we use is the number of nodes in your cluster covered by a eBPF sensor. You have more logs, that's fine. We're gonna help you get the proper retention and scale to support that. You need more custom metrics. That's not something you, we will charge you charge on and then again all in the same experience.

So um that's, that's it for the demo. If we could just go back to the presentation for a second. So just to sum up, we, we talked about uh how eBPF can give us Kubernetes infrastructure monitoring, which is native to our environment. We talked about a live network dependency map who's talking to who what's going on in the cluster. What can I do with that? We, we figured out how to troubleshoot with live traces without instrumenting code. And we saw examples of how it can help us.

We talked about a full log management that correlates with your metrics and your traces and provides you with a Kubernetes native uh experience even in your log management environment. And we've also talked about the possibility of creating custom dashboards, custom alerts and whatever you want to set to kind of start getting actionable data from your environment.

Uh you can visit us as, as at booth uh 563 and see a live demo there. Uh we would be happy to tell you more about what we do and how we treat data and why eBPF is so interesting. Um and thank you, if you have any questions, I'll be glad to answer them.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Discover seamless observability with eBPF

Right?
复制链接

扫一扫

Discover seamless observability with eBPF

“相关推荐”对你有帮助么？