What’s new in Amazon OpenSearch Service

hello, everyone and welcome to the session on what's new in amazon open third service.

Presenting with me today is Carl Meadows, Director of Product Management at OpenSearch, Bill Burkett, Senior Manager at Electronic Arts. My name is Mukul Karnick. I'm the General Manager for OpenSearch at AWS.

It's great to see all of you on this Thursday to hope you all have had a great conference and I'm looking forward to the replay party tonight. We have a lot of content to cover some exciting demos.

I'll begin with talking about the open source OpenSearch project. Then I'll, I'll talk about how to get data into OpenSearch. Carl will talk about the search and observ use cases. Bill will show us about how they migrated from self managed Elastic Search to Open Search at Electronic Arts. And then finally, I'll talk about the innovations in the Amazon OpenSearch Service.

Before we get into what OpenSearch, like the Amazon OpenSearch Service. I wanted to talk about the OpenSearch open source project.

So the open source OpenSearch project was launched in 2021 and since then, we are seeing some incredible growth last year at this time, we are at about 100 million downloads and now we are at more than 300 million downloads. So the rate of downloads is accelerating, which is great to see.

We are also seeing a lot of partners adopt OpenSearch. Last year. around this time, we had 40 plus partners of the project and now we are at 7070 plus partners. So more partners joining the project, which is great to see partners that are isvs partners that are solutions providers and partners who are major cloud providers. So it's great to see a great spectrum of partners.

We also have partners such as SAP and Intel who are now committing code to the open source project as a result, we now have more than 1500 commuters to the project and more than half of them are outside AWS. So it's, it's good to see that the project is gaining momentum and as it has a commu as a community that's beyond AWS, we also have OpenSearch that now available on major cloud providers.

So we have, of course, the native AWS service, which is the Amazon OpenSearch Service. We have a native service for OpenSearch on Oracle and we have um OpenSearch services on Azure and GCP through our partners, uh uh Bonsai and Ivan. So it's available now on all the major cloud providers.

Let's look at the Amazon OpenSearch Service. The Amazon OpenSearch Service helps you securely and cost effectively manage OpenSearch at scale. If you are using OpenSearch for search, then you can improve your the relevance of your search results. If you are using OpenSearch for log analytics, you can do that in a cost effective manner.

We are like the OpenSearch is integrated with v various AWS services including the a iml services. So you you can use some of those more easily with the Amazon OpenSearch Service. As a result, we now have tens of thousands of customers using OpenSearch to process trillions of requests per month and storing hundreds of petabytes of data. So really a lot of customers using OpenSearch at scale.

We have customers like Adobe using OpenSearch for their ecommerce platform or we have customers like Zillow who are using OpenSearch to search for real estate listings. So the large number of customers across different industries are now using OpenSearch.

So what is search about search is about efficiently finding data from efficiently finding insights from your data. Prior to the digital age, people still needed to find information from books and other other kinds of uh documents. And so they came up with different kinds of indexing and cataloging schemes.

So here's a trivia question. When do you think was the first cataloging scheme invented any guesses? How about 500 years back? Who all think it was 500 years back? Can I get a show of hands? How about 1500 years back? How about 2000 years back. Yeah, it was invented actually in the third century BC by Greek poet Callimachus. And since then, a lot of discovery and invention has happened in cataloging and indexing schemes.

The modern text based search is based on an inverted index. um In OpenSearch, we use Lucine. And so a lot of sort of search is based on this uh this set of invention with the a with the boom of a i and ml, we are now seeing customers use semantic search and natural language search to do uh to do search.

We are also starting to see people trying to leverage the best of both worlds. So using text based search, using natural language search and trying to combine them, we have some exciting announcements in this space. Carl has some exciting demo as well. So it would be good to go over that OpenSearch is a really good tool and and at the heart is a search engine given that it also is very helpful in the log analytic space.

We are trying to get insights from your log data. And so uh but like we have some really good log analytics capabilities and some good innovation in that space as well, which Carl will talk about now that we know what OpenSearch is all about. Let's look at the ecosystem of OpenSearch.

You have different kinds of documents, log data, transactional data that you need to get into OpenSearch. Typically, you will have a data engineering team or an engineer in a data platform team who's responsible for getting data into OpenSearch. And they'll build some kinds of pipeline to get data into OpenSearch.

Once you get data into OpenSearch, if you're using OpenSearch for search, you will have a search engineering team who's responsible for tuning the results um and making sure that the results are relevant. If you're using OpenSearch for log analytics, you have a typically a dev ops engineer or who's looking at dashboards trying to find what happened and get to root causes.

And if you are using OpenSearch in a mid or large company, you will have a platform team that's managing the OpenSearch clusters for all the different teams. And so you have a lot of different personas who are involved in the OpenSearch ecosystem.

So let's look at one such person now who is the data engineer and what are the challenges they face? They are typically looking to build a pipeline to get data into OpenSearch and they'll use some custom tooling. Sometimes they'll use LogStash running on two or sometimes they'll use Lambdas or some streaming services to get data into OpenSearch, getting data into OpenSearch can be difficult.

You'll have to first figure out how to collect this data. You, you also need to purchase this data in S3 because you wanna make it durable. Then you want to buffer this data in case uh the downstream system is not available or there's some impotence mismatch. And finally, you also want to be able to transform that data.

This transformation can be removing duplicates or routing this data conditionally to different clusters. So there's different kinds of transformations you want to do. And all of this is difficult, you have to write maybe custom code and that and then may update it each time you log data changes or your documents change. So it can be pretty painful to address these challenges.

In June of this year, we launched the Amazon Ingestion Service. The Amazon Ingestion Service takes care of all of this undifferentiated heavy lifting. You simply send your logs or your documents to the endpoint. We also have connectors to S3 C a fa and different sources.

The service uh in the service, you can uh configure um different kinds of processors that let you transform the data and this is through configuration. So you don't have to write code and then you can send the data into an OpenSearch cluster.

So it's, it's very easy to use. It's scaled and the best part is it is serverless. So it scales up and down with your traffic. So you don't have to worry about scaling sizing this cluster. It just does it automatically. And another important part is that it is actually very efficient.

We did some benchmarks compared to LogStash and it's about 60 to 70% more efficient than LogStash. So it's really efficient and cost effective way to get data into OpenSearch.

Let's look at this ingestion service a little bit more in detail. You there's a concept of sources, these are sources uh where that connect to different um uh transactional systems. Or you can put log data in, you will have different kinds of processors, you have processors to do enrich the data.

So uh go ip processors or processors to remove duplicates or do different kinds of rejects patterns. So red, red set of processors and the sync is typically the Amazon OpenSearch Service. We have launched a couple of new capabilities as well.

We've launched the ability to buffer data within the injection service. So you don't have to worry about it. And we've also launched new sources. So now we support Elastic seven dot x versions as well as older versions of OpenSearch. What this lets you do is it helps you in migration.

So if you have data in older versions of ElasticSearch or OpenSearch and you want to get to do the latest version, you can use this ingestion service to re index that data.

Talking about migrations, how many of y'all have had to deal with painful migrations? Can I get a show of hands? That's, that's a few of you know, migrations can be challenging. You need to figure out one how, what kind of cluster to use, how to size it, how to tune it. Then you need to figure out how to get your data into this new cluster.

Typically, you'll do some kind of poc try to figure out the parameters that you need to tune and then the migration of the data itself can be challenging. So to help with some of these challenges, i'm excited to announce a migration assistant.

This migration assistant is available in the AWS solutions library but is also available as an open source project. So if you want to build on it or if you are partners who are working with other customers, you can build on it and uh use the migration assistant.

So what does the migration assistant do? The migration assistant deploys an agent in an unintrusive manner on your existing cluster and helps fork and replicate traffic to a new cluster. And this cluster may be running the latest version that you want to migrate to what it also lets you do.

It lets you compare the results from the old cluster with the results from the new cluster to make sure that the results are showing up in the right order. If you're using for search, it also lets you compare the performance. So are you seeing the same performance or not?

The nice part is you can control the amount of traffic that goes to the new cluster. So you can go like a few percentage. So 30% of traffic or like 100% of traffic and you can replicate that. So with this kind of uh assistant. You now have the ability to really test your new cluster, the performance, the scale the correctness and and migrate in a more confident manner.

One of the challenges, one of the other challenges uh in uh that we hear from customers is if you are building applications using something like DynamoDB. Um and you want to add search to your application, you need to move data from DynamoDB to something like OpenSearch. And this can be challenging.

You have to build again a pipeline. Uh usually what we hear customers do is use Lambda or run this on EC2 uh using again some streams and, and it can be challenging what? So just yesterday, we announced the zero ETL integration with DynamoDB so that you can synchronize your DynamoDB table with your OpenSearch index.

The nice part is this integration is available in the Dynamo console. So as soon as you create your table there, you can go and set the index that you want to use for OpenSearch and the injection service uh that I just talked about does all the heavy lifting for you.

So it's a very easy way to get data in from DynamoDB into OpenSearch and keep it up to date.

Looking back at sort of the life of a data engineer uh dealing with trying to get data into OpenSearch, hopefully with all of these innovations, the injection service as well as the zero ETL integration with DynamoDB. It, it's, it's become a little easier so that you don't have to worry about some of the undifferentiated, having love thing.

Now, to talk about the search and the observable use cases. I'm happy to invite Carl.

Thank you, ma. Awesome. So, yeah, i get to talk about search. I'm carl. Again. The uh how many people here have had the responsibility of managing search on a platform? A what? So you know, traditionally with, you know, managing search, my job is to make sure that i get the most relevant data that from based off the customer prompt to solve that i think best meets my business objectives.

So to do that, i often have to fine tune search, i often have to um set up boosting results deal with synonym files to make sure i'm getting good results. So as mccool mentioned, opensearch at its heart is a relevancy engine. It's designed to make sure that we're getting the best results back to meet the customer inquiry.

It's also got other features that just come out of the box with opensearch. On top of tech search, you have other techniques like fastening which allows you to group and filter by category as well as geospatial capabilities.

So say i wanted to limit that search to within five miles of the requester location or autocomplete or fuzzy search. These are all capabilities that come out of the box with opensearch

And now you know with these new techniques for AI and ML, we're integrating those capabilities into OpenSearch too. So you have a full toolbox available to you to deliver the best results to your customer.

So this would be your traditional tech search. Amazon.com actually does run a quite a bit of ML in theirs today. But if you look at your traditional tech search, you've got a catalog of data and the search terms that are coming in are designed to best match the uh the customer's intent and how it does. That is it scores each and the highest scores the more relevant. And that is the rank that they're going to to be presented. And you can do things like fasting to drill down.

Now, who here has heard that there's this JA I thing have any talk about that and any of the sessions you've been at. So AI and ML has made huge leaps in the last, you know, most notably in the last year to have these models really deeply understand the English language and be able to better understand potentially intent than what you could get from a text prompt. And we're going to go demo. I'm going to show a couple of examples of this in a minute.

And so this is a really powerful functionality for search. We want to make sure that that is fully capable of being used with search. So how do ML, how does AI and ML actually understand the language and how does that actually work in practice. So it does. So by basically taking converting things into vectors, you might have heard of vectors is basically a row with dimensions are numbers. So a common model might have 1000 dimensions or 500 or even 4000 dimensions on any particular data. Those dimensions make up the model's understanding of that document. And when you search, you're actually searching for the closest match in that massively dimensional space to get back the most relevant objects that are closest in that dimensional space to the request.

And so that's how a model speaks, how this process tends to work or does work is that in the most simplest form, you've got a set of documents, there could be images, it could be audio, it could be logs, it could be rich text documents. You feed those into a model, the model then converts those into vectors. Those vectors can be stored in a vector database, Amazon OpenSearch Service and the vector engine that we offer on serverless is a very popular engine for storing vectors.

Then if you think about the world's simplest chat flow, you would have a experience where a user may ask a question. That question goes to the LM. The LLM converts that question into a vector and then returns back, sends that to OpenSearch which is going to send back a set of results which it then converts back into human language that sends back to the person.

Now, JAI applications are actually much more complicated than this in practice and most real world scenarios to build a really nice experience for customers. It's not quite as simple as just question answer with the model. You usually want to have multiple stages of analysis. You might have some reasoning chain of reasoning that you need to do and to do that, you tend to have to build an application middleware. You know, these can often be done with like Lang Chain or L Index or Haystack, which is going to walk through that workflow. And each step of that workflow may be interacting with one or many models that are, you know, interacting with one or many vector databases. But that's a lot of work to build those to build that middleware.

So I'm excited to talk about, we launched Open uh OpenSearch and Neural Search with 2.9. And the goal of Open Neural is to make reduce that amount of middleware that you have to build and also give you local and native access to all of the other capabilities of OpenSearch in the same pipeline.

So how Neural Search works is you can have your applications using the OpenSearch APIs that they're familiar with talking and communicating with OpenSearch inside OpenSearch, we have decomposed the indexing pipeline and the search pipeline. So that while that document comes to say that application sent a document in that document can just come into Open Search. You can have inside the ingest pipeline, say it was an image, it could then send that out to a model to create the embeddings and it would store the embeddings alongside the rest of the metadata associated with that object or other search information you have with that object.

Same goes for the search. Like if an application sends a search, the search pipeline could have multiple stages in it. So the search pipeline could have one stage where it's doing a lexical search. And then the semantic search can come back from a model can compare those results do a composite score. It could also call out to another model to do personalization to then rank those based off of the the user. And then it could actually call out to another model say to take those results and summarize them. And all of that could be done without having to go and build middleware and could be done strictly from OpenSearch and returned back as an OpenSearch API.

This also has the benefit of making it much easier to test out new things. I mean part of what's happening is there's new models all the time. And so having a stable application stack that you can work with and then having the ability to experiment and try new things easily is one of the things that we want to do for you. We want to make it easy for you to build and do things that we haven't thought of.

So on top of Neural Search, we've been able to add several features so far this year. So we added hybrid search which allows you to combine text, textual analysis and all those capabilities OpenSearch with the vector similarity search and with composite scoring, the ability to run your own fine tune models on SageMaker and have a no code integration with the Amazon OpenSearch Service managed clusters. So you could run your own model and call out to it through Neural Search.

We also with 211 which is just announced a few weeks ago, added a new type of search technique called sparse vector retrieval. So what i described earlier with the vector search was actually dense vectors. It was a vector with thousands of dimensions, sparse vector is a splayed model is a slimmer model where you are doing fewer tokens that are more aligned with the properties in the inverted index. And this has shown a lot of promise in our testing with uh being able to understand that intent without as much heavy cost and as much you know, to go out to a very large model, you can run a much smaller model and you can get performance more closer to on par to what a lexical could perform because the model is much lighter.

So we're really excited about sparse vector retrieval. And then Swami in his keynote showed an example of multimodal search. It was actually an OpenSearch screen in the back. I would have loved him to say that. But the sparse the multimodal search often times these models if you like multimodal is combining image and text, so you could send the image into the multimodal model, it will encode in those embeddings. What it thinks is in that image without you having to label the data or do anything like that. And I'm going to show a demo of that too, which can sometimes offer like really amazing results.

We've also added into the managed service. The ability to integrate these make it much easier. So we add an integrations tab to where there's CloudFormation templates right there that will spin up with Bedrock Titan text embeddings. The sparse model is wired up there. We're going to add others Cohere and other of our partners just to make it easy to spin these up without having to do as much manual work.

So with that, I hope for the search. You guys have seen that like the OpenSearch team has been busy. Hopefully, we've been adding a lot of stuff that you guys are going to be able to do a lot of powerful things with from the cap continue to expand our capabilities on vectors and vector search as well as the Neural Search plug in and building on top of the Neural Search plug in composite scoring, hybrid scoring, custom models, sparse vector and multimodal support.

So from there, let's shift over to a very uh another extremely popular use case with OpenSearch is using it as a machine generated data analysis tool for like log analytics, observability and how many people use OpenSearch for log analytics and observable use cases. Yeah, a lot. Yeah.

Um when you're doing that, OpenSearch is a great tool for that because it's got all of those capabilities. I talked to, talked about it's distributed, it can handle large streams of data and provide really fast query response which makes it ideal for that as a developer or devops engineer. Though I have to learn how to write queries to do my forensic analysis. I have to create dashboards for my data set. I have to configure low alerts.

Um I have to make manually correlate across things if I don't have a common schema in order to identify the right uh how issues are related across data sources. And if I was doing security analytics, I'd have to build a lot of tools across the top to be able to do security insights using OpenSearch.

So anyone here has had to carry a pager and been responsible in the middle of the night for fixing a problem having good tools matters a lot. You know, the more downtime creates stress. And so you need tools that allow you to hopefully quickly identify problems, quickly do forensic analysis and determine root cause. So you can get back to sleep or get back to your family.

So to help with this, a few years ago, we launched the Observ capabilities into OpenSearch that layer on some of these capabilities and make it even faster and easier. So we added, built in anomaly detection and alerts, we have rich support for OpenTelemetry data which allows you to do tracing and see spans and service maps to identify and pinpoint in complex environments, the source of problems and then be able to correlate that with your logs and with your metrics, adding features like log patterns, tailings surround.

We also created a language called PL which is a piped processing language which is really optimal for doing data discovery and exploration. It's piped so I can start with this sort by that group by this. And it's very logical with its flow. It's much easier for doing this type of data exploration than say SQL would be or OpenSearch DSL.

We've also then now extended it. JR is another very popular tracing format and built full visualization capabilities for YR uh including spans, trace groups, service maps and added automated ability to extract metrics out of logs that are correlated with the rest of the system.

So I'm going to play. Did we plug in audio? We did not but I'm doing it now. Oh, it's on this one. I'm sorry. Is there sound? Yeah, I uh can you, there you go. Nope. We should have tested this. Maybe it's me. Let's try that. Try. Now, from your data, we're excited to introduce the OpenSearch Assistant toolkit to help developers build generative AI experiences to solve search and analytics problems with the OpenSearch Assistant. You can build solutions that let you write queries using natural language such as are there any errors in my logs? Or if you want to quickly see your data in action, try asking the OpenSearch Assistant to create a visualization

The open search assistant unlocks a new world of possibilities for building powerful generative AI experiences and discovering rich insights from your data. Start building with the open search assistant today.

Apologies for the glitch. But mhm back here we go.

Cool. So with the open search assistant, I remember I mentioned some, you know, you have to be able to write a query and it's late night and you want to be able to quickly find an information, the open source open search tool kit assistant. Right now, we're releasing it in open source. We're going to build it into the service to make it available to all of you next year. But you can this assistant will you can use natural language to help you write those queries and then once you write that query and run it, it will automatically then take and summarize that data to help you provide easily get insights into that. And that's just the beginning, we're going to be implementing more and more skills into the system. So you could do things like create a visualization for me, create a pie chart of this data with this filter or create an alert for me at this threshold. And those skills will all be built into this system to make it much easier and faster to use the service to help you solve problems more quickly.

So this is all open source nope wrong one there. And you can that i mentioned the playground earlier. This is available at the observable.playground.opensearch.org. You can log in experiment with asking it questions with the data that's loaded in there and the summarization, it's pretty cool. I think you'll be excited about it. And um like i said, it's all open source so that uh you know, others that are using opensearch can customize it and do things with their own skills. Right now. it's wired up on our side with all of our own prompts and our own uh reasoning logic using uh anthropic cloud in the back end.

Cool. So the i also earlier this year, i mentioned that doing security analytics is often quite a bit of work to try to use open search for security analytics. And part of this is when you're doing observ and monitoring, you're typically looking at aggregations you're like is my error rate exceeded why over the last five minutes? And that's what's triggering events, security. you have to look at every log line and you have to analyze every log line against a threat database. And so earlier this year, we launched security analytics which builds in alerting a rules engine that can analyze against your own custom rules as well as 2200 sigma rules. We added every log line as they're ingested and then it will detect if there's issues. And we also built a correlation engine on top of that so that it can closely correlate if this threat on this host is also associated with this requester who also talked to this host.

Now, all that correlation, it builds an inbuilt graph automatically to help you trace and identify how threats could potentially be related out of the box with no manual configuration. So I'd be very excited for folks to take a look at this. There's a lot of i think power that can be derived from security analytics.

Now, this one, I'm particularly excited about oftentimes we have a lot of data. Opensearch is amazing. Like i said, for really fast analysis of data that we frequently use to troubleshoot and monitor our environments. However, there's often a larger pool of data out there that is just not economically feasible to load into opensearch. We don't query it that much, but there's valuable insights in that data.

So yesterday swami announced the zero tl integration with amazon s3 and what it does is it allows us to query that data in place without all of the expense of indexing it into opensearch. And it also has capabilities to do secondary indices automatically to accelerate those queries, build materialized views and opensearch. So you can get fast views of your data as well as selectively index data.

So how does this work? So, like i said today, you've got amazon opensearch service. It's got a bunch of your primary data in there. It's got your infrastructure logs, your app logs and that's stuff that you're looking at constantly and you need fast access to using the tables defined in aws glue, which it will help you do as well. You may out in s3, you may have secondary data like vpc flow logs or waf logs that can be enormous in size. But like i said, they sometimes can have really valuable information when you're trying to do forensics or troubleshooting. And it allows you to query that data directly as well as configure acceleration so that you can divine skip indexes. So it knows what partitions to skip to allow faster, faster performance, build materialized views. So if i want to summarize that data for a dashboard and keep it refreshed, it can do that as a secondary as well as covering indexes, which would be i actually want to copy all the data in this column into opensearch for further analysis.

So with that, let me actually show you that it's real. So you believe me, let's go back to two. All right up there. All right. So if i go into my data, there's a data sources object tab here. If i see in my data sources, if i click it, i now see that i have the zero etl connection. Its if i click on that, this is the actual bucket glue data catalog object that i've registered. And i have several options there. I can build accelerations, i can build an integration, which i'll show you an integration example or i can query it directly.

So I'll just show you that it actually works. I'm going to go over into discover a regular discover tab and we'll see i've got my regular open search data, but i also now have amazon s3 as an option. I click on that. I could run a sequel or ppl query. I'm going to go ahead and run a sql query. This is vpc flow log data and i'm just writing a quick query to show me the top 10 source by bytes in that data set. I just wrote a quick sql query and it's actually now none of this data is indexed. It's actually going out to s3 to scan that data to come back with the result. So it's not going to be quite as fast as opensearch because the data is not indexed. But hopefully, by the time i'm done dancing, it'll be done.

Um so we can see that data is now available to using the same opensearch sql or opensearch pl. The same tools i was using and none of that data was indexed. So from the same place that i'm looking at my other logs, i can quickly do analysis.

I mentioned the integration. So these data sources we've shipped with a set of integrations to make it easier to get them set up. There was a vpc flow log integration which i used which automatically built visualization and materialized views for me. So now if i go into my dashboards, this is my dashboard for my vpc flow logs of the data n s3. It's being powered by the materialized view that the integration created which is doing scans to get updated information and build just the summary information and opensearch. And i can see in this dashboard, you know, out of the box, i could do things like say only show me the rejects and i get the same opensearch like performance to redraw that dashboard because that summary data is available locally as an index and opensearch and then one of those secondary indices.

Cool. Awesome. Awesome. So with that, hopefully we're doing a good job making that developer and that dev ops users life better by adding more observable capability, more tools, the open search assistance coming soon, security analytics as well as the zero etl integration with s3, which is now in preview ed. So anybody can go try it out today and let us know how it's doing for you.

And with that, i will hand it over to bill burkett, one of our customers to talk about his experiences.

Hm. That's hello everybody. My name is bill burkett. I'm with the platform, infrastructure and engineering uh group at electronic arts. Uh i have a guy here to help me kick off my presentation. Any fc players out there? Not too many, a few of you. All right, good.

So what my organization does is provide a lot of services, for example, observ ability services that power over 100 180 eae a affiliated titles. So, one of these services that i want to talk to you guys today about is a uh formerly elastic search now, opensearch cluster that we built to give us observ ability um for some of the most popular titles that you're familiar with.

So for example, we have marketplace services, social services and matchmaking services that underlie a lot of uh a lot of our really common games like your fc 24 madden apex legends and a whole bunch of other ones. So, you know, i think the gaming industry pro uh poses a lot of interesting software development, life cycle challenges, some of you may be familiar with and some of them might be a little novel.

Uh some of our titles, for example, apex legends probably have a more familiar observ ability pattern than you might be used to in some of your live services. Pretty consistent logging output all day, all night, every day. Some of our titles are sports titles, madden fc nhl. These have very, very heavy development and load testing and release cycles that happen every year. So lots of logging data that's produced at launch time before launch time and then really heavily monitored.

Some of our titles are more stand alone. So they have again, very heavy load testing cycles, very heavy launch cycles and have a natural progression where they tend to decline over time.

Uh one of the other interesting things about uh the gaming environment is it loads tend to be pretty bursty and somewhat unpredictable. Uh you may get a viral event or a sporting event that may cause a lot of uh traffic to suddenly appear that you weren't ready for or sometimes there may be a promotion that you were uh you were ready for that, um that had more logging output that you were quite ready for.

So a concrete example, i love to give you this. It's top of mind because it actually happened while we were doing this migration is that uh we had a title, star wars squadrons. One of our more stand alone multiplayer titles uh was released a couple of years ago. Um recently, we released it in the epic game store as a free game release. So we knew that promotion was coming. We were somewhat ready for it, but we did not expect how chatty that game was going to be to our logging cluster.

So when that game was released as a free game at two x, the total volume to our logging cluster, we really scramble to try and scale this logging cluster and we, we weren't able to keep up because it was a single tenant or sorry, a multi-tenant logging cluster that affected the, the visibility and the availability of the observable of uh several of our other titles.

So what i want to talk about is some of the challenges with our legacy logging stack and why we decided to move uh move to opensearch.

So our uh our legacy elastic search stack was pretty difficult to scale this, this uh cluster was built manually in uh kubernetes. So when i say built manually, we deployed these containers ourselves, we deployed the workload ourselves. We built and packaged this ourselves. It was running as state full sets in kubernetes. We configured, you know, the ebs volumes ourselves, we configured the size of the two instances ourselves.

So a couple of the issues that we had also with this cluster was that it was in an aging kubernetes cluster. So it was no longer a aeks cluster that was in support. So one of the uh one of the things that we reached out to when we were talking to the elastic search folks was how can we better manage this ourselves? And their, their guidance was to um try and move us to the more modern operators.

Well, we couldn't move to the more modern elastic search operator because our version of kernes was old and we couldn't really update our version of kubernetes because we built a lot of custom stuff around the uh the elastic search cluster that we had. So we're really stuck between a rock and a hard place.

We're also spending a lot of babysitting and engineering toil with this cluster for, for this six person engineering team we really had between one and two people who are at all times. Ba basically dedicated to scaling this cluster up and scaling this cluster down. That's actually one of the things i forgot to mention, it's pretty unique about gaming is because of these life cycle instances. We don't always just scale up all the time. We actually have to scale clusters down as games have their natural natural aging life cycle. We want to scale the cluster down so that we can be cost efficient as possible, very hard to scale this cluster.

One of the things you might see a little later that i show was we were pretty massively over, over provisioned in our existing logging cluster and that was basically to get around some of this toil. You don't have to spend as much time babysitting as if you're just massively over provisioned.

The other challenge that we had with this logging stack was, uh, the high loss licensing fees. This was something that we were, uh we were paying to elastic and because we self hosted, we had limited, limited amount of support as well. We were pretty much on our own.

So just real quick, i want to walk through the, uh the architecture uh before and after the transition, it's gonna be pretty similar and i'll, i'll talk to you a little bit about why. But you can see, right, some of the titles on the left and some of these common services that we have matchmaking uh marketplace and social uh d cs is a custom ingestion service uh that we created ourselves. I think uh mccool mentioned earlier that ingestion is hard. So uh we, we did find that and we built a custom component to, to deal with that actually turned out to be a lifesaver later on. But from that point on, it's pretty, pretty typical uh elastic search that open search stack that you might be, be pretty familiar with. We're pumping data into k from coca logs, logs, dash pumping data into elastic search itself

So what do we do after the transition? All we had to do was just change out one little box from ElasticSearch to OpenSearch, pretty easy, right?

Actually, it turned out not to be so difficult because we had nice, nice architectural decoupling layer with DCS. This allowed us to uh do some pretty interesting things.

So I want to talk about how we used uh Laser to solve that problem. So what we actually did, we forked it that DCS point. This is a common ingestion end point that we created and allowed us to write to two places.

So we spent a lot of time mirroring this data off to this OpenSearch cluster to try and figure out what was the right uh right way to get it tuned. What was the right uh sharding strategy? What were the right instances strategies could we leverage ultra warm nodes, those type of things?

So that took uh that took a quite a bit of time of trial and error. Also something mccool touched touched on maybe uh might be, might be helped now with the migration service. But this was something that uh we spent a lot of time on and we begin to rightsize the previous cluster.

So again, one of the things we realized as we were doing this, so we were pretty, pretty massively over provision. We wanted to try as opposed to just bringing up two clusters and tests at the it simultaneously. We wanted to try and bring one up and one down. This was try and be as cost neutral as possible.

Again, we wanted to try and be uh be, be pretty protective of the of the game team's money. So in order to reduce the spend, we scaled as we scaled up, we scaled down. What we did is we took a low, some of our lower throughput tenants migrated them over.

One of the really nice things about OpenSearch was that we didn't have to retrain those customers. Basically, we pointed them at a new url. The OpenSearch UI was very familiar with the Cabana UI that they were pretty useful before we wrote some automation that migrated existing dashboards and existing user accounts over to the new system.

So after we did a couple of uat testing scenarios with some of our customers, we decided to move everybody else over. So how we did that was basically just do a right to both clusters simultaneously for two weeks.

So really after doing some downsizing and do writing the cd, writing to both clusters, we really, uh we really only had to pay for a smaller subset of two clusters simultaneously. So it really helped to save again a lot of money there.

Um last bullet point i call out here is the ultra warm rooms. That was really one of the the key ways that we were able to save money. So, you know, this, this is the slide. I think that kind of calls out what's really important and what, what really matters here.

So you can see how much this really reduced our cost. I left out some of the absolute values. But I'll say that this is in the seven figures and how much money this actually saved us in our logging cluster.

So you can see how we transition mostly from a lot of ec2. You see this is called out as ec2 other, that's mostly ebs volume spend. One of the things we were using was gp 2 s going to uh OpenSearch allowed to save a whole bunch of money on those uh legacy clusters.

So overall, we saved a whole bunch of absolute money. The thing that you don't also see here is the tco, right? The the reduction in spend and developer time that we spent in dev ops time, babysitting, scaling and maintaining this cluster.

So we have not had since we've done this uh probably about seven or eight months ago. Now, at this point, we have not had one issue with this cluster. It's not gone down. We have not lost any observ ability on any of our uh on any of our titles and it's just been pretty rock solid for us.

So I want to call out the last slide here, some of the challenges, opportunities and road ahead, some of the things we want to do, some of the things we had problems with some of the things you might face and some of the things that uh local talked about might be better.

Now uh right, sizing the initial cluster i called out there was a lot of uh gas check and refine going on there that took us a long time to figure out what was the right starting strategy? What were the rights instance sizes? How big should we make that cluster that took quite a bit of time, uh migrating users and dashboards at that time was something we had to write ourselves. We built that automation uh scaling today is much faster and easier.

Um we can scale right in under an hour when we see these problems come through. Uh but it's still uh still something that's not quite automated for us. We get an alert when it's time to scale up. We get an alert when it's time to scale down. But that's still a button. Somebody goes in basically and pushes today something we're looking forward to maybe in the uh the serverless options going forward,

Uh monitoring hardware and watching for bottlenecks. That's still an issue. We're not as abstracted from the hardware as we would like to be. You still need to make sure that your uh your cp us are in good shape. That's basically how we scale. We try and keep our cp us in that happy 50% spot. We're still looking at memory, still looking at, we're still pretty close to the hardware.

So one of the things again, looking at some of the server offerings, we hope we can get away from in the future. And then some of the, you know, some of the exciting things that we're looking about leveraging a i and ml to try and get more proactive about logging instead of um reactive.

So overall, it's been a pretty positive experience for us. We saved a lot of money and uh we're pretty excited to see what comes next in OpenSearch. Thank you for your time.

That was really exciting to see uh the savings that you all got from um you switching to OpenSearch. So I got the last five minutes. So I'll walk through some of the innovations on the Amazon OpenSearch Service if you are using, if you are part of the platform team and managing OpenSearch that can be challenging.

And so we've done a bunch of innovations to improve uh improve your life. We have improved operations on several fronts. We have done more thing, auto tune, which lets you tune your jvm or other settings based on the workload.

We have improved the self healing capabilities within OpenSearch. So self healing takes care of uh no, like a note going down or rotating a certificate. There's a bunch of uh self healing uh capabilities within OpenSearch that we do automatically on your behalf.

We have also added some self service capabilities. So ability to restart or reboot. A node is now available in a safe manner and that can help you get out of an issue quickly if you need to.

One of the things we have also introduced is off peak cars. So that what that lets you do is you can define a time window where your traffic is the lowest or where it's safe to deploy software and the automatic updates that we do for security or for other reasons will happen only in that window.

So if you are using OpenSearch for production, i highly recommend trying to use that particular feature.

One of the things uh one of the challenges we hear from customers is when the node goes down or when an easy goes down, you can have issues with OpenSearch and it's hard to maintain that 99.99% reliability.

So in uh early this year, we launched multi easy with standby that gives you the four nines availability by doing a bunch of innovations um by reducing the amount of data that movement that happens when an easy goes down or a node goes down uh by having a stand by um a z which you can uh which we automatically flip over to in case that a z does go down. Or if a node goes down, we can leverage that standby node to uh use and serve traffic.

And in this way, you don't have any downtime. And if you're use, if you have a latency sensitive application or application that really requires this high availability, you should try it out because in the, in the events that we've seen that have happened, uh uh we've seen uh like these clusters perform really well and have no issues.

So i highly recommend using multi easy with standby. If you want that availability.

We also just announced the OpenSearch optimized instance family yesterday. This instance family uh this instance type gives you 80% higher throughput and 30% improvement in price performance.

So if you are having an index uh indexing heavy workload or a right heavy workload, you will see uh some of these benefits of high throughput and lower cost.

So how did we uh sorry. One other key benefit which I want to highlight is the high durability with this instance family. Now, you can have the durability of s3 because these instances are now backed by s3.

So what is the innovation that we did to get this? If you look at ElasticSearch and the older versions of the OpenSearch, the way application works is a document or a log gets indexed into one node and the same document that gets indexed into another node or the replica.

And this, so this indexing happens twice. Also, the data is not backed by any s3 or any of this, uh any of the cloud uh uh sort of storage as a result. If a node goes down, you now are left with the replica. And uh so you need to make sure that you have the right amount of replication with this capability.

What we have done is instead of uh replicating the documents. We are actually taking the physical segments and pushing them to s3 and then the replicas download that from s3. And that, that gives you an rp of zero in case uh the node goes down and you can just uh get the data back from the node.

And it also improves the throughput because you're doing this indexing work only once and you're just downloading the uh physical sort of segments from s3. So, so it's a pretty powerful uh instance type uh giving you a lot of um throughput. And at a lower cost,

We've also launched serverless early this year. Um so lets you scale up and down based on your traffic and you don't have to worry about sharding. You don't have to worry about deciding the nodes. Uh all you need to do is send data to the sur endpoint and the sur takes care of the scaling up and down looking at the architecture of serverless.

Uh we have separated decoupled storage and compute. We have also decoupled indexing and search. So your indexing happens on a separate set of nodes that data gets stored in s3. And then your search happens from another set of nodes and those can scale independently.

And we've done a, a few innovations in uh soulus as well. Most recently we launched the dev test collection. So the starting cost for cus is even lower. So you can start with a minimum of one ocu or for indexing and one ocu for search.

We also announced the vector engine support in serverless vector vector databases are becoming more and more common. And OpenSearch has had this capability for many years. With server support for vector engine, you can now do your rag workflows without having to manage any infrastructure.

You can just create a vector collection, start sending your vector data to open it serves and again, it scales up and down with, with the new dev test collection. Your starting cost is also really low.

So there's a bunch of innovation that's happened here and uh it would be great to try it out.

So with all of these innovations, hopefully the car, the life of someone who's managing OpenSearch has become easier. We've done some innovations to help you reduce cost and operate your service at scale without having to worry about downtime.

So if you were to look at the key summary of this talk, there are four things to think about.

One. We have made it much easier to get data into OpenSearch with our ingestion service, with the zero etl capabilities that we have announced. You can easily get data into OpenSearch with all the a i and ml capabilities uh and the integrations that we have, you can do semantic search and hybrid search within OpenSearch and improve the relevance of your results.

You can also do multimodal search. So a lot of innovation in the search space. Uh uh we've also innovated a lot in the observ space. We now have zero l integration with s3 that lets you query your additional log data that you previously couldn't query.

You can leverage metrics and traces within OpenSearch. So in addition to logs now you can have metrics and traces within OpenSearch. And we have the new observ assistant tool kit that you can leverage to ask natural language questions.

And finally, we have done a lot of innovations in the service. These innovations are to reduce cost and help improve your operations with multi a with standby that gives you four lines of availability as well as the or one launch that helps you give higher throughput at 30% lower cost.

And finally, all of the innovations in server list that will help you not have to deal with any of the instance or sharding. So a lot of innovations across the board and we're really excited about uh y'all trying that out.

Thank you. If you all have uh doing the analytics uh superheroes, then uh uh uh this is the qr code to scan and thanks.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值