Accelerate generative AI development with integrated data platforms

最新推荐文章于 2024-07-20 19:31:22 发布

李白的朋友王维

最新推荐文章于 2024-07-20 19:31:22 发布

阅读量105

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134865290

版权

Hello, hello everybody and welcome to the first and only generative AI talk that we have at this event, right? So let me start with a number. 92% is the amount of 500 fortune companies that are already using generative AI to change how they work and speak to the customers. This is a quote from OpenAI's current or ex CEO. Some estimate that 92% of the companies are already using open AI to adopt generative AI.

What we will see in the next 20 minutes is how an integrated data platform can accelerate by removing obstacles your journey to generative AI.

So, first of all, who am I? And why am I talking to you? I'm Francisco Desio. I'm a staff developer advocate at a company called Yugabyte. But more than that, I'm a data aficionado which means that I love solving data problems with technology. And as you may have guessed, I'm also Italian, which means that I'm really, really lazy. So you can imagine how much I was amazed when they told me that instead of me solving problems, there was a tool out there that could solve all the problems for me. Crazy, right?

That tool is generative AI and generative AI is already solving a lot of problems for a lot of people. I mean, my kids are using generative AI to solve homework. You can read it if it's good or not. I'm using generative AI to make my content better. I'm a public speaker, I'm a writer, I'm not a native English speaker or writer. I use generative AI to help me make my content more readable, more understandable. And please let me know if it works at the end of the session.

But then other people are using generative AI to have the most common solution to a certain problem or to start with a business problem and end up with working code. So it seems like generative AI is perfect. But is it true? Well, I wouldn't say it's perfect but it's pretty good. As long as we ask data that is in the public domain. Because generative AI learned from the public domain, what is the most correct answer for some sort of definition of correctness to the most compelling words, to the most common questions, it learned how we speak and how we interact and what is the tone that we use in our interactions.

However, when we need to adopt gen AI in our business, we need to do one step more. We cannot rely on public data alone. We need to start feeding with private data. And here is the big, big problem. I want to ask you a question, how many of you have all your data sets in just one very well designed unique data solution? I believe apart from my colleague, none of the other hands were raised because most of the time if you're lucky and your data is well organized, you are in a situation like the one in the image where you have a huge amount of data assets scattered across a very different set of tools that are solving a specific function in your company. You will have a tool for analyzing the clickstream of your website, another tool to store the information about your users and stuff like this.

Our work as humans is to understand where the good parts of the information are and start collecting them and unifying them from all our various data assets. But then here comes the other problem. And the other problem is what we are feeding generative AI with because in my opinion AI is just a very good, but very, very good parent. So it will repeat whatever we provide to them.

The problem with private data is that we could have commercially valuable data, we could have sensitive data, we could have regulated data within our data assets. So if we just open the Pandora's box of our data assets and we shovel everything to generative AI, we are now at risk of exposing data that we shouldn't expose. But even though we are at risk of creating something that is useless because we include the good part of our data assets in a huge amount of garbage data that we have in our company.

But that's only a technical limit. The occurrence is only a technical limit if we just take whatever we have in house and we provide that to AI, we are at risk of exposing personal information. We are at risk of exposing regulated data. And the reality is that, you know, you don't want to put your company at risk or maybe if you have information about European customers do something like for example, Meta that received a 1 to 2 billion euro fine for not being GDPR compliant. And this could happen if you don't think carefully about what you are feeding AI with.

So after I've been probably scaring you about generative AI, let's try to understand what we need to make it work. The first thing is everything is based on data. We need data to make AI work and we probably have already tools in our technological tool belt that are containing the data that we need. Wouldn't it be cool if the same tools that we use for our day to day operation could also speak the language that generative AI is speaking? I mean, it would be cool because we could mix the queries that we usually do on the operational side with some sort of vector search that is behind generative AI or any AI. Because the language that generative AI speaks is not the images, is not the text, is not the voice. Every time you give an input to generative AI or every time you receive an output, AI will transform the input or output in numbers because all computers can do is talk numbers. So they will use what are called embeddings, long vector representations of these numbers, which translate whatever input or output in kind of a series of numbers themselves. And then they will try to find which other similar vectors are around. And that will be the next suggestion, the next iteration of generative AI.

If we could have tools in our tool belt that can do both the operational part, but also the embeddings part, we could, for example, in the case of a chatbot, not only suggest to a customer what is the most relevant resources based on the question, but we could also query, for example, the recent pattern of steps that the customer has been doing in order to give a solution that is relevant, not only for the question per se but also relevant to the previous steps in our platform. So mixing vector search with the standard queries that we use in day to day operation.

But solving the problem on a single tool is not enough. We have various tools, we have various data assets. So what we need is we need to harvest the data from the various places and bring them together, not only bring them together, also reshape it because we need for example, to exclude PII data in order not to expose those data outside. And I want to make a point here - I believe the harvesting should be done in real time because if you are approaching generative AI, you are probably willing to customize the experience for your users. And if you spend time and energy and money into building something that knows your product, that talks like you talk to your customers, what you don't want to end up with is that that tool provides for example, a suggestion about the feature of your product that either is not available or not. It was like this two weeks ago. So the real time component allows us to provide to generative AI the context about what the user has been trying to do recently, not two days ago, not five days ago, now.

So after we address the data, after we have all the data tools that can talk in embeddings, it's time to start interacting with generative AI. And we have two options - if we use a pre-trained model, we can either do fine tuning or prompt engineering. With fine tuning, basically we teach AI about our business in general. So we upload for example, in a specific folder, all our technical documentation and all our interactions with our customers. And then AI will be able to understand how we speak and what our product is about. But then the interesting bit for me comes with prompt engineering because with prompt engineering, in the case of a chatbot, we are within an interaction with a specific customer. So we can give information about who we are, what now. So if for example, a customer comes to us and says, how does product A work? Instead of just giving a list of the details of product A, if we are able to inject the fact that the same customer tried to use feature X two minutes ago and maybe received an error, why we can give a much better customized reply. So by adding a lot of recent context, near real time context, we are able to give a lot more relevant information.

So after a lot of this theory, let's try to understand how we connect all the pieces. Let's start from the end, the end is the end user and probably generative AI will be the first interaction from the user to our company. What is on the completely opposite side? On the completely opposite side, we will have our datasets which could be operational databases, NoSQL stores, could be search databases, could be a variety of different data tools.

What do we need now in the middle? We need something that can harvest and unify all these data assets and then need to be in near real time or real time. So we probably need a streaming data pipeline that can talk with all our tools and maybe also a little bit more. But then if we just collect this information, we are at risk of exposing something that we shouldn't. So what we need to add on top is some sort of computational engine that allows us to reshape the data, to remove PII fields, to aggregate the data, to not analyze the single event but a list of events one after the other.

And then after we are connecting our computational engine with the streaming data pipeline, we need to close the gap. So we need either to upload the data for fine tuning the model or we need to inject via API for example, context with prompt engineering.

So after I've been talking for 30 minutes, this seems straightforward, right? One day of work, everything is done. It's a little bit more complex than that because it requires a lot of data steps and it requires a lot of integrations. And that is usually the problem.

What if I could tell you that you could pretty much solve all the data related needs, and just focus on defining what you want to expose and how you want to expose? You can do that with Yugabyte - the company that I work for.

We are the trusted open source data platform for everyone. What we provide is the best of the open source tools in a very well integrated data platform. So both open source tools - do you have a data need? We cover that, we have everything that you need. We go from operational databases like MySQL and PostgreSQL - by the way, PostgreSQL can speak the embedding language that generative AI is speaking, the same for OpenSearch on the search phase. If you need to analyze massive amounts of data, we have ClickHouse. And then we have tools like Kafka and Flink that can connect various pieces of information together in near real time. And Flink can allow us to transform that.

So if we go back to our design, we can see that Yugabyte fits a lot of the places, fits almost all the places. But again, solving point solutions is not enough now. So giving you an operational database is not enough. What you want is something that is well integrated. What we did at Yugabyte is provide the best in class integration. They are coming from the best in class open source tools like Kafka Connect for example. But also we managed to create a lot of work to simplify how you can define connections and how you can create connections that scale as much as you need.

So we created connection integrations between the services and not only within services within Yugabyte. Also, for example, if you need a Lambda function to react to every Kafka message we got you covered. There are prebuilt integrations that can make this integration doable, easy to perform and scalable and trusted.

The trusted bit is very important and it's not only me telling you that we have a ton of customers using Yugabyte for their production workloads for several years. Some of these customers, they cannot work if Yugabyte doesn't work and they always work well. So I believe we are doing quite a good job in keeping everything up and running.

I want now to focus on three customers that are already using Yugabyte for their AI journey. The first one is IXL Education. IXL is helping 120,000 students and their teachers and is using Yugabyte and AI to understand where the students, to help the teachers understand where the students are struggling and create customized content for the students to have a better learning experience.

Labels is using a variety of tools within Yugabyte, including OpenSearch, and is doing kind of the good thing that I was saying before - it's mixing text search with vector similarity search to allow the customer to automate the ticketing system. And when they automate their ticketing system, they can resolve tickets by providing not only content that is similar word by word but is coherent and is interesting by the context with the vector search.

Finally, Sway.ai is what I would call an AI middle player because Sway.ai is using Yugabyte and AI in order to help their clients to build better and easier AI models. So for them, Yugabyte is not only the trusted open source data platform for everyone. For them, Yugabyte is the integrated AI backbone - it's what takes the data from where it is to where it's needed. Here is the remaining transcript formatted:

So if you want to know more, if you want to speak more about this, check everybody that has a crab jacket or a crab t-shirt like mine. We are at the booth 162 tonight at the very end of the corridor behind you.

And now I want to close with the same number that I started with. 92% of the companies, of the 500 fortune companies, are already using generative AI to change how they work and they speak to clients.

If you want to start your generative AI journey, I don't think you need to spend time into setting up a database or trying to understand how to integrate a streaming data platform with a database at scale. You can get that with a trusted scalable open source data platform like Yugabyte.

If now you scan your QR code, you will be able to start our trial where you will have $300 for one month to try all the services that we offer.

And I'm now happy to answer all your questions. Thank you very much.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Accelerate generative AI development with integrated data platforms

Hello, hello everybody and welcome to the first and only generative AI talk that we have at this event, right? So let me start with a number. 92% is the amount of 500 fortune companies that are already using generative AI to change how they work and speak
复制链接

扫一扫

Accelerate generative AI development with integrated data platforms

“相关推荐”对你有帮助么？