Analyzing streaming data with Apache Druid

最新推荐文章于 2024-05-16 10:42:40 发布

李白的朋友王维

最新推荐文章于 2024-05-16 10:42:40 发布

阅读量23

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/135120359

版权

So I'm going to give just a short talk to talk about how you can analyze streaming data and really almost any kind of data with Apache Druid.

Why do you care about streams? Well, you're here at Reinvent, so you probably know what a stream is and you know why you would care. But there's a lot of things that we can do with this:

What is real time operations, understanding what's happening in the moment? I've got streams data coming across, so I'm able to see and analyze what's happening in some seconds with Druid as we'll talk about in a bit.
I have true stream ingestion, which means every event goes into that database and can be queried as soon as it goes into the stream into Kinesis or Kafka or whatever stream you're using. So in real life, that's usually about 10 to 20 milliseconds between the event happening and you having it ready for query.

That leads us into context aware decisions. And that's where we get the difference between a real time analytics database and a stream processor. There's a lot of great stream processors that let you see what's happening in my stream could be Kinesis Analytics, for example, or Spark. But usually that's only useful if you're putting it in context. So not only what's happening right now, but how does that compare to an hour ago, a day ago to last year's Re:Invent to something else in context? And that lets us marry context aware decisions.

We also use this a lot for interactive conversations so that when you have large data sets, which you're gonna get from all the stream data, can I go through it and look through it quickly and to be able to do that effectively, that means almost every query has to be sub second. Because if I'm gonna go look around large sets, if all I get is a spinning wheel of death, it's not very useful.

So a few examples of who's doing this in the real world and how you may be actually using it indirectly right now:

One example is ThousandEyes - a great example of an observable company. They put lots of sensors and open telemetry data, pull it all back and let you on demand see what's happening in your systems, whether it's on the cloud, off the cloud, multi cloud, whatever. They're using Druid to be able to present these dashboards, because one of the other things about Druid is it's very high concurrency - I can run thousands tens of thousands of concurrent queries on a Druid database, which is good because if you're ThousandEyes, you don't know how many customers are gonna do a query at any given time. Doesn't matter with Druid, I can hit as many as I need.
Atlassian is another fun one. Anybody here use Atlassian Jira Confluence? Several of you. Whenever you open Confluence, you notice it says here's what you've been doing and here's what's trending. That's actually a Druid query happening by Confluence on the back end. If you're a Jira or Confluence admin, you can hit the analytics button and get all these real time dashboards to see what people are doing. So Atlassian, it's about 3 billion events a day coming in on the streams for all the Atlassian users that they're analyzing and making available to you in real time.
And then you know, if you're gonna talk about streams, you've probably heard of Confluent, which is, you know, founded by the people who created Kafka. They are using this for what's shown there is an app called Health+. So if you're a Confluent customer, they're using real time analytics to let you see what's happening in your environment. A lot of their customers have 10,000, 70,000 topics. And this way you can see quickly how are they operating, what's working, how much data is flowing - all of that is available because we can do real time analytics.

So why use this? These are the four really big use cases where we see people doing it:

Operational visibility at scale - so, you know, one example uses the New York Stock Exchange, as you might imagine they get network attacks a lot plus, you know, $3 trillion a day roughly in equity trades, they needed a way to be able to keep up with that flow and be able to see and mitigate network and other attacks. That's what they're using the database for. Why don't they go use something like DataDog or Splunk? It's just too big for those out of the box products.
External facing analytics, like we were just talking about - if you want to present something to your customers, your advertisers, your partners, you don't know the concurrency. So Citrix does that, we mentioned a couple of others. One of my favorites is Nielsen Research. Same thing - they don't know how many people are going to do a query. But with Druid I can do as many as they need.
Salesforce is using this only internally to Salesforce. But since they have 80,000 employees, that's a lot of people who can then quickly look at all of the information about operational effects, who's using what service, what kind of performance they're getting - are people getting that right? What they expect to get?
And then real time decision - Druid actually originally came out of the ad tech world. So we still have a lot of people using it for advertising or decision - in the case of Reddit, this is going to the advertisers. So if you're advertising on Reddit, you can see how many people are in which subreddit or ask the question who's relevant to you. So you know, for example, if I am interested in reaching Java developers in Los Angeles, we can do real time queries on how many people on r/java are in Southern Cal. And we can also do queries on how many people on say r/LAClippers are also Java developers. So we can pull those together and let people target effectively cause you know, advertising, most ads annoy you - a few ads, that's what you want. How do I get the right ad that you want? So that's what we use it for.

We know that streaming is everywhere, whether it's Kafka or Kinesis or things based on Kafka like Amazon MSK, Red Panda, other things out there. It's interesting - some numbers came back that if I look at high value data, so that's not like medical scans or TikTok videos, but that subset of data that's actually business data, right? So structured data, semi structured data even including things like emails and documents, about 20% of that today is actually being streamed and at current rate and speed in three years, it will be more than 50% even of a much bigger number, which means streaming, the amount of data that streaming is growing by about 80 to 100% a year.

And you know, if the sink is the batch, it's not real time anymore. Every database today can ingest streams, but most of them do it with micro batching - stream comes in, put it in a batch, then put it in the database. But what did you just do there? You added latency. So now I'm not looking at the date in real time anymore. I'm looking at what happened a minute ago, five minutes ago. That is why, you know, we are designed explicitly for streams.

So every event immediately available query on arrival, guaranteed exactly once delivery, no duplicates, no missing messages - when it shows up late, it gets put in the right place. So we can see that.

So very high scalability and part of our design is as the data comes in, it's written to S3. So it's literally continuously backing itself up. So when I have an outage, my RPO is zero for that matter, there's zero plan downtime in this design. So you're only gonna have an outage if everything fails and even then you won't lose data. So it's a highly reliable environment because when you're dealing with streams, you're usually going to be 24/7, right? It's not a world where you have a downtime.

So what makes this real time analytics? Pretty simple - what is I want to be able to do sub second queries at scale. So it's at any scale, could be gigabytes, could be terabytes could be petabytes. The smallest cluster is running on a laptop. The biggest cluster I know of in production is 15,000 nodes with 400 petabytes of data. Can it scale bigger than that? I don't know. My boss won't let me set up a 500 petabyte cluster. There's a budget issue or something, but we're pretty, there's no reason it shouldn't.

Second is that high concurrency. Now, technically, any database can be high concurrency. If you throw enough resource at it, right? You can always shard it enough or throw extra copies. The point here is it's high concurrency without having to be very expensive. So I'm able to get 100 plus concurrent queries on a single server and I'm not talking a huge server. I'm talking like an Amazon world like r6g extra large will be able to do 100 plus concurrent queries and in a medium sized cluster of thousands. And at the same performance as one query.

Third one, as I mentioned earlier, both real time and historic. So I can load data with streams, but I can also load data with batch. So you usually do that for the old data or for you know, data that comes in batch wise. So you know, sometimes you might be getting external data that will show up in files, suck it in with batch very fast, very easy ingestion.

And again, nonstop reliability. So, Druid in 2010, there was a group of developers at an ad tech company and they needed a solution that could ingest a billion events in less than a minute and query that billion events in less than a second. So they tried to do that with Hadoop. Couldn't do it. Hive and stuff couldn't do it. They tried to do it. Postgres, couldn't do it.

So because they were young and stupid, they said we'll just make a new database. How hard can that be? It turned out it's actually kind of hard, but they did it successfully. And because they were in their twenties, they played too much World of Warcraft. So they named it Druid after because it's a shape shifting database. In their twenties, they thought that was really cool.

Now that all of our co-founders are like in their thirties and have kids are kind of a little embarrassed by that name, but it's still a good name. It works well within there.

This premiered in 2012. There was a great meeting at Netflix actually in Silicon Valley that Netflix hosted, that had both the Druid creators who described their project and the Kafka creators who described their project and the Netflix people said, yeah, this is what we need. If you two open source those things we're gonna use them. So Netflix was the first place to actually deploy Kafka and the first place to actually deploy Druid. And there's still a huge user of both of those packages.

Where today there's more than, well, we think there's more than 2000 organizations using Druid. It's open source. I don't really know, it's a good guess and we're using it as we talked about across many different industries and for the uses we talked about - visibility at scale, fast moving event streams, observability, product analytics, being able to do use cases for rapid data exploration, being able to do those external analytics and being able to do that real time decision.

Also a lot of use in IoT, as you might imagine - IoT a lot of streams, right, all those devices, other stuff and streams coming up and you often need to be able to make decisions really fast for that data. You know, if I am monitoring an oil pipeline, you know, knowing that half an hour ago, it would have been a good idea to turn on the fire extinguisher is not very helpful information. I need to know that right now. So we use a lot of that.

So I've been talking about Druid, I work for Imply. What's Imply? You've seen this model before, right? People create open source software and after a few years ago, hey, maybe we should like make a company around this. So the people who created Apache Druid founded Imply in 2015.

So what does Imply do? Well, interesting thing about Imply, it's not our goal to make money. We're not against making money. We are a company, we're, we're not a nonprofit, but our founders really believe that we don't want to live in a world where only a few big companies control all information. And how do you have a world where information is actually owned by the people who should have it - open source - and have a strong and active and developing open source community. We want open source to win. The purpose of Imply is to help open source win particularly around open source Druid.

So let me make this clear - open source Druid is the fully functional version of Druid. It's not like we have a crippled open source version and a good version, you can pay for it. But if you choose to pay us, there are things that we can give you in addition to open source:

A commercial distribution with enhanced security because when you use open source, you have to install the the the all of the dependencies and the other pieces in there. If you buy the Imply version, it's all prepackaged. And we add some extra security pieces that are not an open source - and why are they not an open source? Because they're not open source security. They're commercial pieces.
We also offer a couple of versions of cloud deployment. One of them is a full database as a service. So we call that Polaris. So that's as you'd expect - everything is consumption based, you know, you pay based on how many queries you're running and how much data you're ingesting. By the way, free trial - don't even want a credit card. This is something that interests you - go to imply.io. You'll see a button that says free trial. Give it a try if that interests you.
We also have what we call Hybrid Manage because some of our customers say I can't run this in your VPC. I need to control all the security, run it in my VPC. So then we'll give you the app on Kubernetes and we'll do as much management as you want to let us. So if you let us, we'll manage the security, we'll manage other pieces, but it's your choice how much you let us in the door? And we added some management tools on top that do performance management, easier to deploy that report back - not the data, we never see your data - but the metadata so that when you call us for tech support, we don't have to waste two hours explaining what's going on. We can pull up and see what's happening in your environment.
And then finally support, right? Druid has maybe the best support in open source. There's a very active Druid Slack, the average response to an open source request for help with Druid is about 20 minutes, but that's still open source, right? If you're using open source and it goes down, you're just hoping somebody will help you. So if you're paying and there's people who are paid to help you and, you know, 24/7 support, including right up to the original creators who are usually not working the support lines. But if you have a problem, that's tough enough, you'll have some of the people who are writing the project and, and running Druid doing that.

And then, the last thing to mention here is this is ready to use on AWS. So it's in AWS Marketplace, both the Enterprise version and the Polaris as a service, it inherently connects with Kinesis and MSK. All of the ingestion transformation, cool things we do that I didn't get into like interpolation - so that's automatically smoothing and helping interpolate missing values on streams, the sub second query, the visualization.

So again, you can use our visualization which is built in or you can use QuickSight or Superset or Grafana or whatever visualization makes sense so that your data analysts, the applications, the other things that use it are ready to go. Here is the remaining transcript formatted for better readability:

So you want to find out some more:

Druid.apache.org is where you can get the official for open source Druid.
If you want to find out about Imply - imply.io and /polaris if you want to write to the free trial.

We also have a whole bunch of tutorials, lessons, articles other how-to things, what we call our Developer Center. So you can find that in imply.io/developer.

So I was told I had 20 minutes, I am now at 19 minutes and 30 seconds. So I could take one question if anybody has one.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Analyzing streaming data with Apache Druid

So I'm going to give just a short talk to talk about how you can analyze streaming data and really almost any kind of data with Apache Druid.Why do you care about streams? Well, you're here at Reinvent, so you probably know what a stream is and you know wh
复制链接

扫一扫

Analyzing streaming data with Apache Druid

“相关推荐”对你有帮助么？