Build & deploy generative AI apps in days with Databricks Lakehouse AI-CSDN博客

本文链接：https://blog.csdn.net/just2gooo/article/details/134819908

Thank you for joining this session and building generative AI applications. My name is Inna Colliver. I'm part of the product team at Databricks and today I'm joined by Bradley Ason, who's part of Clubhouse, you may know them as Square. He's part of the engineering and machine learning team. And later in the session, he's gonna share more about their journey in bringing gen AI to production.

So we have a lot of content for you today. We're gonna start with defining what data centric generative AI is and we're going to go to some of the most well spread and used techniques at the moment - augmentation, fine tuning, pretraining your own models. And then we're gonna jump into this real, real life use cases that Bradley is gonna share at the end of the session.

So it's been a really, really crazy year for us at Databricks. The usage of transformers, the underlying technology of LMs, has been growing immensely. We're really, really excited not only to see POCs being built on Databricks, but also folks bringing those to production and actually proving direct business impacts.

We're working with many, many customers at the moment. The use cases are spreading across industries, serving internal and external end users. Use cases of the sort of processing internal documents, HR documents, policy documents and bringing those insights to internal users, boosting productivity all the way to actually having virtual assistants that are serving end users. It's been a really, really exciting year for us.

But those conversations really have one commonality, those customer conversations we're having, which is well, the industry is transforming. Generative AI is everywhere, generative AI is here to stay. How do we stay ahead of the competition? How do we stay relevant? How do we provide differentiating customer experiences leveraging LMs?

And there is, there's really good news and bad news. The good news is that the model techniques that are used in order to build those models are really getting commoditized. We have seen in the last year, the rate of releases of LMs is really, really rapid. We've seen a lot of the third party vendors releasing models more or less every, every month. The competition in the proprietary sector is intense. The prices have been dropping as much as 10x. We saw a couple of weeks back in OpenAI day, massive reduction of prices. So competition there is pretty, pretty fierce.

What is more interesting and more exciting to us is that the open source community is actually getting together sharing techniques and releasing models that are now either at par or in some use cases exceeding the quality that you can get from proprietary models.

On top of that, the tools that you need in order to implement those techniques of augmentation, fine tuning, pretraining are getting generally available. So in a world that innovation is coming to you so rapidly, when you have all of those tools, unfortunately your competitors have all of those tools.

So it's a question of how do you build experiences? How do you build use cases for your end customers that can differentiate you from the competition?

Well, in Databricks, and we believed, we believe in that for prior generations of AI, we believe that your competitive advantage is your data. Your competitors don't have access to that data. That data is relevant to your users and can unlock the QA bots, the content generation use cases that will put you ahead.

So if data is obviously in the core of your generative AI strategy, why do so many enterprises struggle to implement this at large scale?

Well, the thing is that in large enterprises, what happens is that there's procurement of a lot of different software for a lot of different purposes. You need a data warehouse where you can put your structured data, you're buying BI software for your executive dashboard, you might have data science, machine learning software for specific users that is used in a specific geography in your organization. All of those silos end up decoupling your data sources.

And what's happening is that you have multiple systems that keep data in a format that is different from each other. They have a different governance framework and you need to end up spending time and resources patching them together and uniting those components.

And what is even more evident with generative AI is that some of those security, privacy and data residency concerns are getting raised more and more. And on top of that, you need highly skilled technical personnel in order to actually unite those systems together.

In Databricks, we believe in a world where generative AI and classical ML would be part of your data platform and not set as a separate silo.

So if we think about building generative AI applications, they consist of more or less three stages:

Getting good quality data sets, preparing those data sets.
Choosing and tuning really good models and putting those models in production and deploying them in applications that can be monitored.

All of those systems should not sit outside of your data platform. They should sit inside of your data platform and have a common governance framework around them so that you don't need to spend the extra time, extra effort in order to unite those platforms.

I want to give you some examples. Let's start with the data. So typically what you would see is a data platform that may store your featureized data set. It may store it, it may store structured or unstructured data. And for AI use cases and gen AI use cases, you may find yourself using an AI platform that is separate from your data platform and bringing those data sets in systems you somewhat call feature store or vector store, etc. This leads to data duplication and this leads to scarcity and non-unified monitoring.

We believe in a world, we believe that the right way to architect this would be to unite the data platform and AI platform and have common tooling for both of those. We see a world where every table should be a feature store table. You shouldn't have, you shouldn't need to have a different feature store. You shouldn't need to hydrate into a feature store. You should have all of your tables available to your ML use cases.

And why is it so different to put that data behind a semantic search index? A vector search index? It shouldn't be. We released those two functionalities earlier in the year - vector search and feature serving.

Vector search is our production grade native semantic search index within Databricks. What it allows you to do is to convert any Delta table and serve it into a semantic search index. It is extremely, extremely powerful for retrieval augmentation applications. And we're gonna give you an example and we're gonna dive deeper in how it works.

The second feature we're really, really excited about is feature serving. What we saw is that for generative AI, it's not only that you need to make unstructured data available to your applications, you also need to make structured data available. So how do you serve structured data and on-demand computation at low latency? We're gonna dive into that as well.

To make it a little bit more real - where vector search and feature serving come into place - here's a use case:

Let's say we have a support chatbot and my end customer is asking "Please, I want to return my last order." What happens behind the scenes is that we may actually be looking into online features in order to figure out what the last order was and that might be stored somewhere in a table.

Also, our return policies might be stored in documents which are indexed behind a vector search endpoint. So we are looking into vector search, finding the relevant information and retrieving those chunks.

Then what we're doing is we are hydrating that information within a prompt which we are then sending to an LLM. That LLM is really, really good at using the information that we provided to them, summarizing that and giving our end user an answer, which in this case might be "Hey, since your last order was in the last 30 minutes, we can cancel it for you" because let's say in our return policies, that was the guidance.

So this is one of the examples for usage of feature serving and vector search. Let's dive deeper into each one of them.

Vector Search:

So what we heard from customers is multiple challenges:

This is amazing that we can put it behind a semantic search endpoint. But my knowledge base isn't static. I either update those documents. I have new documents coming in. I need this to be reflected within my application.
Model flexibility - folks are trying different embeddings models. They don't want to depend on one third party vendor embeddings model. They want to have the flexibility to change those.
Being able to have this within a managed experience. You don't want to DIY the pipelines that are going to sync your knowledge base with the semantic search index. You want this being done for you.
Performance - we're talking about use cases that require single millisecond retrieval of relevant information. So that is obviously very, very important.

So based on those challenges from customers, here's how Vector Search works:

At the moment, you can take any Delta table from Unity Catalog and you can run a very simple API call to sync this behind a vector search index. You can do this through the UI. I'm just showing you the API because it's very clear what kind of attributes you need to set.

I want you to know that there is an AI Gateway root name that basically points to your embeddings model of choice and allows you to swap that embeddings model if needed. This is powered by MoFlow AI Gateway, which is a proxy which gives you a stable API in order to communicate with external model vendors that may have different API specs.

So once you run the sync API from the Delta table, we're going to spin an endpoint, spin up an endpoint for you and you can already start to query the semantic search index with very simple APIs.

What happens here is that when new data is added to your Delta table and any changes are being done on the Delta table, those are automatically synced behind the index. You don't need to do anything. We have a pipeline under the hood that does that for you.

These are the sync APIs, you also get direct APIs which you can use to insert or delete vectors if you want to do that.

So that's Vector Search.

Feature Serving:

This is to do with managing structured data within your generative AI applications. There's essentially two types of structured data operations that you want to be doing:

Retrieving already calculated materialized features - that will be the classic feature store experience.
Actually having some data available only at the time of inference - so what you're doing is you're calculating a feature on demand, you have a specific user defined function we call it that performs that on-demand computation.

Now, you can do that with Feature Serving. The way it works is that you have your raw data within a Delta table and you would have your featureization, your logic, how is this feature defined? You're gonna put your logic into what we call a User Defined Function that could be in Python or SQL.

And then all you need to do is create that Feature Serving endpoint. So you're defining that function, this function you can see on the screenshot over here, but this function is now an entity within Unity Catalog. It basically captures the logic in which you're calculating that feature and you are serving that function behind an endpoint.

So now all of your applications can get access to that on demand computation feature.

So these are the two services that we're really, really excited for data preparation for generative AI.

I want to jump into models. So we mentioned that data is extremely important when it comes to building your generative AI strategy. But how do you leverage that data? There's essentially a couple of ways to do that:

Retrieval augmentation - retrieval augmentation means that you are hydrating your prompts with custom data and providing that to the NL-LM. There's no change in the model weights at all. The positives here - it's really simple to start and actually covers a very large percentage of use cases, doesn't require some in-depth NLP knowledge. So it's really easy to start with.
Fine tuning - fine tuning is extremely powerful technique. It allows you to make those models much, much better performance in your specific domain or on a task that this use case is relevant to. Fine tuning is extremely important when it comes to providing low latency experiences because you can actually achieve comparable performance with much smaller models. And we're gonna dive deep into that later.
Pretraining - so pretraining, unlike fine tuning isn't starting from a model checkpoint, but it's actually recalculating those models from the start. Pretraining is very powerful when you need to have full control over the data sources that this model is based on.

And obviously all of these are online with prompt engineering, which is relevant to each of those cases.

So the way we think about retrieval augmentation, fine tuning and pretraining is in some sort of a maturity curve. And what is very important to mention is that they are not mutually exclusive. You can have a retrieval augmentation application that is using an embedding model, which is tuned on your data, fine tuning, which is then hitting an LLM which is fully pretrained on your data. These are not mutually exclusive. We actually recommend that you, you do all of those and kind of walk the maturity curve.

What is important also to mention is that you can start thinking about evaluation of your LLM applications sort of in the scientific method. You want to establish a baseline. And then by introducing new techniques like retrieval augmentation, fine tuning and pretraining, trying to beat that baseline.

Here are the services that we have at Databricks which align to each of those techniques:

Vector Search
MoFlow AI Gateway
Feature Serving

And I'm going to jump now towards fine tuning embeddings models.

What does that mean? What does the embeddings model even do?

So the embeddings model basically takes your raw data, your chunks document content chunks, and converts them into vectors and positions those vectors within the multidimensional vector space. So that when you're doing a similarity search, you're basically doing an approximate nearest neighbor search on that content.

Now what's tricky is that generic embedding models don't really know your terminology. So what we have on the screen over here is software companies actually data bricks. So we use the term "NEOs" internally for defining our projects around serverless. And for us warehouse is a data warehouse. But for a retailer company, data warehouse would mean something completely different. So the terminologies that we have that are semantically similar, similar by meaning, would not be similar by meaning in a retailer.

So a generic embedding model really can't represent that data in the vector space adequately. So what fine tuning your embedding model does is that it allows you to better represent those points in the multidimensional vector space and allows you to improve the recall of retrieval very, very significantly. We've actually found out that the retrieval is one of the major factors in the performance of retrieval augmentation applications.

The way we do this right now is within OML in Databricks is a low code experience which allows you to pick a model of your choice, an embedding model or a fine tuning model, and select your data set and kick off an AutoML job. AutoML for Databricks is a glass box experience, which means that any model that is produced is fully available for you to see all the artifacts. You can take that model and you can serve it anywhere or you can serve it on the database if you would like to. You can see all of the notebooks, etc. So there's no, there's no locking into serving it on the platform.

So that was about fine tuning your embeddings model and how that improves retrieve augmentation performance. What about fine tuning the foundational models? We actually heard that in the keynote as well - why do folks want to fine tune the foundational models?

So the first reason is that fine tuning can significantly, significantly allow you to gain awareness about a certain domain or a certain task and can allow you to have a par performance or either better model performance than a generic model, which obviously doesn't know anything about the data in your organization.

Fine tuned models are often much, much smaller than large language models in the realm of GPTs because the smaller models are much lighter. That means that the inference compute that is required to serve those models is much, much cheaper. So you can actually reduce the cost significantly because you're working with a much smaller model.

Because the model is smaller, it actually allows you to reduce the latency of inference. This could be a big, big blocker if you're using very big large language models. And there's been incredible innovation in bringing the latency of those models down. But fine tuning a model like this can help you reach satisfactory latency.

And the fourth reason is keeping your LM intellectual property in house. You really don't want to be in a situation where you have fine tuned a model on a platform and you're not able to take that model away and serve it anywhere you would want.

So how does fine tuning work currently on Databricks? We're providing folks with the following tools. So you can start with any Delta table or take your data from your cloud storage. You can choose a model from our model gallery or model marketplace. Those would be you can load those within Unity Catalog with a few clicks. You can then kick off an API or you can do it via the user interface and start a fine tuning job.

Those models are your IP, it's your data and your IP. You can take those models out or you can serve them if you would like them. This is fine tuning pre-training. Why would you want to own your model? Well, large language models work in a way that actually allows them to expose a lot of the data that they've been trained on. And you might, you might see this with some of your experiences with public models like GPT.

Pre-training from scratch, not starting from a checkpoint, can allow you to have full control over the data sources that your models are trained on. That could be quite overwhelming because it means that you need a lot of data to do that. But if you carefully curate the open source data sets that you believe are approved from your organization for that particular use case, you can actually gather enough tokens to train a model from scratch. This gives you full control over the intellectual property as you would own the weights, full privacy, and you can pay less as those models change over time.

You may have heard about the recent acquisition of MosaicML. MosaicML has Oras technology for training those models for very, very cheap prices. So this is a very simple architecture of what MosaicML does, but very simply you have a client interface which you're operating through a CLI or UX experience and you have a control plane in which you're configuring your runs. This is done by YAMLs at the moment.

When jobs are submitted, they run in a highly optimized compute plane on the Mosaic ecosystem and then they pull out your data from cloud storage, they stream that directly from cloud, and they feed it to the model. We've seen extreme, up to 7x faster models and cheaper model training using that technology.

And really this is to do with having full control over your data, full control over your model, being able to pick it and bring it out of the platform if needed. So this was OML fine tuning foundational models and MosaicML.

So let's jump into production and organization of those applications. What we hear from customers is that in that process of going from POC to production, they actually want to try many different LLMs. They want to try open source LLMs, they want to fine tune an LLM and try that. As I mentioned, it's the scientific method - they take an iterative approach.

So how do I discover what I can use? Which vendors can I use within the organization are questions that we hear often. Another problem that we hear from customers is, you know what, we are experimenting with different LLMs, but we have created an ops infrastructure, an LLM ops setup around that. And because these different vendors have different API spaces, we actually do need to change more than one component in that architecture.

We need to very easily swap those models, but we're working with so many different vendors. How do we do that? How do we get to have the choice?

The third thing is that folks are dealing with multiple vendors, multiple choices, open source proprietary, etc. We've heard it many times before - we don't want to be managing our API keys. We don't want to be accidentally pushing our API keys to GitHub. So how do we centrally manage those security store credentials? And how do we centralize the key management?

And the fourth thing is if you're dealing with multiple vendors, how do you monitor the costs that are associated with each one of those? This is where MoFlow AI Gateway comes into place.

So MoFlow AI Gateway acts as a proxy to LLMs that could sit internally in Databricks, but also externally. You can set up AI Gateway which is gonna have a stable API and connect the rest of your architecture, the rest of your LLM ops, to an AI Gateway and have the flexibility to swap the model that is pointing to.

Here's the code so that it's more clear. What you do is you create a route for MoFlow AI Gateway which is going to point to your external model vendor provider. You will provide your API keys. Now you have this inside Databricks. All you need to do is build the rest of your infrastructure pointing to that route. That is a stable route that you can point out to external models. You can point that to Bedrock, you can point that to a Databricks Serve model, choose your weapon.

So that's MoFlow AI Gateway. Okay, let's look at the flavors of models that you can use. The first flavor is models that are completely customized. These are Python models, these might be models that are taken off the shelf from Hugging Face and you are deploying behind model serving behind a REST API endpoint or on CPU or GPU. So this is where your classical models would be, the generative AI purposes. This is also where your NLP chains are served.

The second flavor of model serving are those foundational models that I mentioned. So these are models that are, these are routes basically to third party platforms. Or these are your fine tuned foundational models that you're serving on Databricks. You don't need to set up hosting for those, and you can start playing around with them immediately.

The third one is those external models. Now, we noticed that customers have all of those flavors of models within Databricks. And what we heard is, you know what, we have models served on Databricks, we're also using models that we, we have a POC with OpenAI. We also want to swap that OpenAI usage to Anthropic's LLM-7TB. We have these multiple use cases with multiple flavors of the models we're using.

How do we have a central monitoring for those models? So what you're gonna see very, very recently, so it's currently in preview, you can see it if you can see it on the Serving tab in Databricks, but we are unifying all of those within Databricks model serving. So you are now going to have a single pane of glass where you can see your models that are served on Databricks. You can see your external models, which might be pointing to Bedrock, OpenAI, etc. You're gonna see your foundational models and you're going to have a single API experience, a single SKU for those models, etc.

So this unification is happening on Databricks. Something that we're really, really excited about is model serving of foundational models on Databricks. We are now automatically optimizing your foundational models when you serve them on Databricks. What that means is that for certain curated, and we're updating this on an ongoing basis, for certain curated model architectures, when deployed, you would automatically have an optimized instance of the model. This leads to lower latency, lower cost.

I talked about model serving unification of different flavors of models. What is Lighthouse monitoring? So Lighthouse monitoring is, and you can see it says data and model Lighthouse monitoring, is actually not an ML specific functionality. It's actually both a data functionality and an ML functionality.

We came to the realization that data engineers and ML engineers shouldn't have two different stacks for monitoring. They should have one single stack. So what Lakehouse monitoring does is a production grade monitoring solution for both of your tables, your structured data, and your models.

The way we do this is that we have our models output data to what we call inference tables. And then we monitor those tables with the exact same tooling that we monitor other tables in Databricks. You can define alerts within Lakehouse monitoring. We're actually going to compute some default metrics for you already and we're going to build you a dashboard that shows you those. But if you, if you want to have custom metrics, you can do that and you can set proactive alerting on them.

And all of your, all of your data consumers and all of your deployed endpoints are using Lighthouse monitoring system tables in order to do this. Here's an example of the dashboard that we would build for you.

Why is it so important to have a single monitoring solution for data and ML? The simple answer to this is root cause analysis. If your data quality solution is separated from your ML monitoring, what happens is that the data lineage and the ML lineage is broken. So when something goes wrong at a served ML endpoint and that is caused by an issue in the data because you don't have a single lineage, you cannot jump back and figure out which table had more zeros than expected.

So this is what Lakehouse monitoring allows you to do, it allows you to keep the data lineage and the ML lineage connected because the observable layer is the same, there are no different stacks for monitoring your data and monitoring your ML models. So we're really, really excited about this.

Awesome. So that concludes the Databricks AI stack and really, really excited for Bradley to show us their journey into AI.

So the first one that I think is really interesting and doesn't exactly need a lot of your data to work is um actually like calling actions based off of a conversation. So you can think of this um as someone filling out a form by just typing in natural language or maybe they're making a request from you that ultimately ends up calling some backend API uh to take an action. And so you start with unstructured data and you need to output something like JSON. Um it's actually pretty easy to do. Uh it doesn't take a lot of work with one of the more capable models. So you just take a, an input, you give it a prompt with maybe a couple of few shot examples and a description of the valid fields. Uh and then it'll just generate JSON for you. Uh and it works great if you use a powerful model.

So maybe the first version of this, you use GPT Four and you realize that it's not 100% accurate. So you put in a validation step uh that can tell you when it has succeeded and maybe you can do some retries or just have some fallback option. Um but quickly you realize that GBT Four can be a little bit slow. Uh Turbo will help but uh maybe you come up with a new prompt that helps to it to improve performance or you try out a new model, uh something open source or maybe a different vendor and you've quickly built up a more complicated chain and you've tried out six or seven different versions of it before you've shipped that.

And I think one of the key insights we had early on was that you will quickly lose track of the performance of the different versions. And what were the best prompts unless you're treating this whole stack as a version model. So this is where the MLflow comes in. You can just implement this in Python or maybe something like LangChain. Uh and then put that version model into the LLM Ops tools and track it. And now you can start doing something like A/B testing one version versus another so that you can tell that you're making improvements and you can keep track of all of these as your application.

Uh really only has to know how to call the serving endpoint, you don't need to be just changing your application every time you're making a change to the models. So uh one of the problems for us was that this proliferated very quickly, we had a lot of different applications calling a lot of different end points. I think we're at a couple of 100 of these today. And so operationally, it's really important that you figure out a way to manage this from the platform side and for teams to have visibility into what everything is doing.

Also, these chains are, uh you really have a like compared to the number of foundation models, you have a lot of applications of them. And so you wouldn't possibly want to be hosting a GPU serving endpoint for every one of these chains. Uh it would be egregiously expensive. So instead you build these so that your endpoints are calling out to the foundational models. So every endpoint is probably backed by some general purpose model.

Uh and you can build it like this, but it quickly gets complicated. And uh each end point, you have to kind of know what it's calling. Um as you know, was mentioning, if you want to make a change, you're actually gonna have to change code to call out a different endpoint. And so it quickly becomes much easier to do this through an AI gateway. So you have a whole stack of serving endpoints, each of them is going to be calling out for its foundational capabilities through an AI gateway and then that can just route to a collection of different models like LLama self hosted, but also ChatGBT or CLAD or any other vendor that you want to fit into there.

And um so again, that really helps reduce the cost by not needing to host your models multiple times, but also operationally, now you have a single place for visibility into things like weight limits or usage uh and cost attribution so that you can tell who's using what. And I think very importantly for some of these models where you have a company wide rate limit on GPT Four. For example, it's really key that you have a place to enforce that one application is not gonna use up the whole quota and end up shutting down production apps.

So uh another aspect of this, that I was talking about a minute ago is that you want to have feedback so that you can iterate and improve on these. And I, I think talking about the inference tables from before, uh that's really easy to do in the Databricks solution. So you just take an endpoint and you set up inference logging and now you've got input and outputs.

Um but to actually make use of that data, you probably wanna join it back to some form of customer feedback. So your application is maybe measuring uh accept rates on the suggested actions or something like that. And then you can log that into Kafka. And this is kind of natively what we've been doing with our data.

Um before we start using Databricks for serving endpoints, we had a pipeline there to take those events and make it available as Delta tables. And so you can then easily join your feedback back to your inferences and start actually measuring A/B test outcomes or even doing analysis on what inputs cause bad outputs.

Ok. So let me talk through one more use case. And I think this is maybe the most common one that you'll see is some sort of customer facing chat experience. So uh obviously, like we were talking about, you want to get your data into this stack. And if you're gonna have someone ask a question about Square, you wanna make sure it's not just using the reference material from the base model, you want to be referencing your content.

So that means that we need a retrieval step like we mentioned before. Uh and so you again, you're kind of reusing the same infrastructure, but instead of the chain being kind of a simple prompt and validate. Uh now we're retrieving documents. So this is where the vector search comes in. I think it's been really crucial for us to actually, I think most of our applications so far are retrieval based. That lets us take our context, which is kind of already in our data systems and just transparently get it into the model.

Um some of our first efforts, we actually did most of the vector search within memory. And that goes a long way like as in you don't need a separate vector search endpoint to hold that you can instead just have it in the model hosting. But uh even even if you don't have scaling challenges with that, it's really operationally difficult to update your context if you have to redeploy the model anytime you want to add new data to it.

And so decoupling those gives you this really valuable workflow where you can separately iterate on your prompt or your different model selections uh in the main chain. Um while also just adding new context over time uh into the separate vector search endpoint.

Um another thing I wanted to talk about in a use case like this is now we have freeform customer input and we have our data going into these models. And so it's really crucial for this. If there's any risk that we have sensitive data flowing through this pipeline, that we're not necessarily sending that data out to a third party.

And uh we knew early on that for our most sensitive data, we were going to rely on self hosted models for that security. Um but if you just try to set this up and you build your own container and try to run PyTorch uh for some of these big models, you'll see that it is dramatically slower than what the vendors are doing and how quickly you can get responses from say OpenAI.

Um and that's because it really does need to be optimized. And so uh those optimized GPU serving endpoints that we talked about, those were uh really enabling, it was like kind of night and day being able to build these solutions without optimized GPU serving. It was just too slow, the latency per token was too high to really make this function.

Yes, I think another thing that we've built towards operationally after getting our first models out in the wild, uh is ensuring quality on both ends of the stack. And so if you just take unfiltered input, pass it out and then put the output of the model in front of a customer, there's a lot that can go wrong there.

Uh so I think we've started to focus on solutions kind of as a pre and post processing step. So at the input time, you really want to filter for potential bad actors trying to do prompt injection or things like that, which you can pretty reliably detect. And then similarly on the output, you put a validation set that says ok, nothing toxic is coming out of this or even more ambitiously, you can actually ask a second model, is this a hallucination and then be confident, more confident in your results.

And so again, that flexibility of just having this be a model implemented with Python where you can start adding more steps to the process and update that and compare it against a version without those steps. Um is what really enables you to uh be confident that you're making improvements and to add new um requirements into your stack.

Ok? So I'll talk about fine tuning for a minute. This is pretty similar uh to what Ana was mentioning before. Um the, the thing you'll notice when you start building these is that even if you have a relatively complex stack, the vast majority of your latency budget and your spend is coming from the generation step. And that's because of the big models.

And so for us, we've been looking at fine tuning mostly as a way to drive down latency or improve those costs. So the idea here would be maybe it was working well enough with a 13 billion parameter model. Uh but with fine tuning, you might be able to take that down to a 3 billion parameter model. And we're really confident that we can do this because we have, you know, before this year, we have successfully built uh like GPT based chat models.

So we know that models under the 1 billion parameter range can work. And we just need to find an operationally scalable way to uh make it easy to do that. And so let me talk a little bit about the things you need to figure out as you start to do the fine tuning approach.

So first, uh the easiest thing you can do is to just take your fine tuned model and host it in an endpoint. And so that's like a very specific model, maybe it's just one use case. Uh and it's easy to just go set that up.

Um but also there's a lot of applications where you might reuse that fine tuned model. And like I was mentioning before, you're gonna have different prompts, you're gonna iterate on the different uh inputs and outputs and validation. And so like before, with self hosting a general model, you can also put your fine tuned models on the AI Gateway and have multiple endpoints pointing to that same fine tuned model.

Um the last thing I'll mention for fine tuning is that there's definitely a minimal scale to justify this investment. So both in the like upfront costs of doing the fine tuning, um the operational costs of hosting this on GPUs if you have even one a 10 that ends up being relatively expensive. So you need enough usage of these endpoints to justify it.

Um but on the flip side, I think a year ago, there was also a significant upfront and talent cost to this where you'd have to go figure out the training pipeline and actually implement RLHF or something like that and then run this. And I think that tools like Mosaic are making this increasingly unnecessary. It is as easy as just assembling your dataset of input output pairs and then you can just go run a fine tuning and have a model ready to go.

Ok. So let me talk about uh just some of the takeaways that we've learned as we've done this. The first is like I mentioned, it's a fast moving space and we wanna make sure that we're shipping things quickly. So we're investing in a platform that makes this easy to do for all of the people developing these models, but doing it minimally so that we can move quickly and we'll just try to keep up as all these use cases proliferate.

The second is that even simple applications, it's tempting to just do a direct integration with a, with a, you know, an endpoint, but it is very valuable to treat them as version models and make a process out of improving them and uh giving them standalone endpoints so that your applications don't need frequent updates.

And then the last is you have to be iterative when you're moving towards those more advanced techniques because along the way, you need to be making sure that every investment you're making is either driving more performance or reducing costs.

Alright. And then yeah, come back up, we have a couple more summaries. Can you hear me? Yes. So it's not a big word. Thank you. Hear me.

So in summary, we, we recommend you start going from left to right into the maturity curve, start with prompt engineering with as they are models. It's very simple. You can see if there is real business value in the use case, then take a step further into retrieval generation. Fine tuning can allow you to really, really optimize that latency, optimize the cost.

And then if you're really going for full control over the data sources, go to pretraining. It's an iterative approach, create a baseline, create an evaluation framework and start going um from left to right on that maturity curve in terms of next steps.

So the first QR code is going to lead you to an end to end retrieval augmentation demo with um vector search with uh Moflow AI Gateway, there's multiple notebooks. Um it's going to also show you how to parse the PDFs. Um so it's, it's really end to end. I recommend you try it out.

Um the second QR code we have AJ AI Hackathon coming up. So we would like to invite you to it. Um and we have a booth um at the expo. This is the number of the booth, there's a massive Databricks um sign. Um and there's a lot of uh our colleagues there that can do a live demo for you. There's like multiple screens. So if you really wanna see the things that we spoke about here in action. Um, we invite you to go to the, to the booth.