Unraveling AI terminology (sponsored by Cloudflare) (Cloudflare )

All right. Hey folks, my name is Rita. Um I'm here from Cloudflare. I lead product for our developer platform. Na I, I'm excited to be joined here today by Jacob Lee from LangChain and by Philip Schmidt from Hugging Face, two of the companies doing some of the most exciting things happening in AI.

And what I wanted to start with was this quick picture of the New York Times in 1996. So why am I showing this to you? Uh I am from New York but uh this is not actually about the Yankees winning the world series, but what this is about is it's about disruptive technologies.

And there's this really interesting thing that happens whenever a new technology comes along. And then the very first thing that we do whenever we try to use that new technology is we bring the old technology into the new world. So what do I mean by that? Well, on this very day in 1996 this is actually what www dot new york times dot com look like.

Now, you might note that it looks remarkably similar and this is exactly what I mean, right. So in the old paradigm with the newspapers, right, you had to get it all ready and then get all of the copies printed out because that took some time and then that was all you were able to distribute that day across millions of users.

But the thing that's unique about the web is that you are able to generate entirely unique experiences within just a few seconds. Now, it was really, really important for companies like the New York Times, like publishers to get started with a new paradigm, right? Just having that online presence was really, really important.

But it's also interesting to reimagine what things look like when you contextualize them entirely within the new paradigm. So if you think about it in the context of what news looks like in the age of the web, it actually looks more something like this, right?

So if I'm really into AI developer experience, programming the weather in New York, I can get my entirely personalized news feed every single day.

Now, why am I telling you all of this? Well, just like the internet AI presents a really, really huge paradigm shift and just like with the web, it's important to, first of all, start getting comfortable with that paradigm shift and start having a presence there and that can be pretty intimidating.

But also I want to say that we're still so so very early on in this AI space that I believe every single person in this audience has the opportunity to come up with what it looks like, what it feels like to reimagine existing experiences within this context.

Now, AI is really, really fun to play around with. I just had a chance to sit down and play around with things over Thanksgiving weekend. It's my favorite time to hack around with things and it's really cool because I have actually no artistic skills. Um but I was able to generate a print right in the style of Hoka Si that presented Paris without being able to draw it all.

And I think that's a really great demonstration of how powerful AI is and of the fact that so many people dismiss it as a toy because of things like this. But whenever someone is dismissing a new technology as a toy, I think that's how you know, that it's going to be really big.

Importantly though AI is really shifting the way that productivity works and the way that communication works and we're already starting to see early signs of that. So 55% is the time that's been saved by developers who have started using AI based code generation tools.

I want you to think really deeply about that 55% of their time saved, right? Imagine you could save 55% of the time in your day and you could double your entire productivity. That's a lot of time back I started to get an hour back in my uh an hour back in my day. Right.

So if AI represents the shift, that's as big as the internet, as big as mobile, as big as the cloud. One of the things that we've also seen is that with each of these paradigm shifts, technology grows and develops much, much more exponentially and quickly.

And Jacob actually has a really great slide that shows just how much is happening even in like the span of the past few months in AI. So what does this mean for you? I imagine a lot of folks here, just like many of the customers that I talk to every day have been tasked with this very nebulous thing coming up with an AI strategy for your company, right?

You have investors asking about this, you all of a sudden your CEO your CFO CTO everyone is really invested in this and some of you might have even been thinking, why is everyone asking about this? Now, we've been doing AI for like 10 years, we have a data science team. Um and on top of that, there's all of this new terminology that's cropping up, that's also been really hard to keep up with, right?

There's all of these new technologies and terms every day. So my hope in my talk and in our discussion with um with Jacob and with Philip is to help provide a bit of a framework for how to think about all of these different concepts.

So this framework has really helped us and our company internally and how we think and talk about AI. And it's by splitting out first of all, the two different types of AI. So you have predictive AI and you have generative AI and we'll dive deeply more deeply into each of them.

But when you think predictive, I want you to think data that only your company has, right? And making sense of it in order to create either better experiences for the user or be more efficient as a business, right? So you want to predict the things that are happening next with generative AI. On the other hand, as the name would suggest, the thing that you're doing is you're generating something entirely from scratch and that's going to work slightly differently.

Now, on the other hand, we have two very different processes that happen in AI, right? We have training and inference and like to also separate these out because these are going to dictate the tools that you end up using to implement each of these things.

And while all of the hype, all of the excitement is happening in the generative AI space, I do think that it's really important for us to continue to think about what are the ways in which we can take advantage of technologies in the predictive AI space as well.

So this gives you a slightly more holistic framework for how to conceptualize all of this. And so that's where I'm gonna start now, I'm a PM. So it always helps me to think about things in terms of use cases, right? What are people using this for?

So when it comes to predictive AI, as I mentioned, the two kind of broad categories are if existing data, you want to make use of it in order to create better user experiences and make your business more efficient. You can see a few examples of this here, right? Things like inventory management, personalized user experiences, personalized campaign marketing.

I'll even give you an example from Cloudflare. So we've actually been using AI for this use case for the past decade or so. And it's actually for the thing that, you know, cloth are best for, right ddos mitigation.

So every day we see about 20% of the traffic that happens on the web. And we're able to take these insights and see all of the patterns that occur, right? Every single day traffic goes up because it's daytime, it wanes when it's night time there when people have sales, product launches. But there are also spikes when there are attacks and it's important to be able to distinguish these two things.

And so what AI helps us to do is rather than have a person that's sitting there monitoring every single activity we can help have a machine determine these patterns and make a pretty strong prediction as to which one of these it is.

So how does this all work under the hood? Well, the process generally looks something like this. The very, very first step is you want to start collecting data. So whenever users are clicking on things, if you have logs maybe from your traffic, that's a really great starting point. And you'll generally want to use something like a queuing service to then start storing all of these different events.

Now, from that point, um sometimes this will be done by a data science team. Sometimes it's gonna be done by an engineering team, but you want to start using compute to develop a model, right? And that model is gonna act as that decision making engine for you.

Now as you train this model, you're gonna have more and more data that's incoming. So you might want to modify it and continue to improve it and make it smarter. And the thing here is that in predictive AI due to the size of data generally, uh you do want something that's individualized to your business.

And because the size of the data is again, relatively smaller, the time to train new model can be a few hours or days. So you can continue to iterate and verify that the new model is better than the previous version.

The inference process is actually pretty straightforward, right. So something like a request comes in. So if you're doing something like tra tracking, you know, what are people clicking on today? Someone clicked on this particular thing. And the thing that you're gonna do next is try to predict the next action, right? Maybe based on past traffic.

If they previously clicked on this item, then they will likely be interested in this other item as well. And so you run through this model, models in the predictive space are actually able to run on pretty commoditized computes such as cp us. So you don't need necessarily to invest in, you know, heavy equipment in order to be able to get started on the infant side here.

Now, when you think about generative use cases, again, you're creating something that didn't exist before. So what's the most common generative use case? Now, it's all of these chat bots. And to me, this really speaks to the metaphor of taking a somewhat pre-existing technology and bringing it into this new paradigm, right?

So we've been familiar with the search bar for the past couple of decades except now, instead of typing, you know, a pretty vague query into the search bar, we're able to give it more context and in response, be able to uh and in response, be able to receive something that's quite a bit richer than uh than just a bunch of links, something that's more curated for us.

Something that's maybe even in a particular style, right? Maybe you want a response in the form of an essay in bullet point form, generative AI can do all of that for you. Um very similar thing with code generation.

And so a lot of these cases, I like to think of them as how do we make humans within the company more efficient by taking tasks that would previously take humans a really long time to even get to the 90%. And having AI take care of at least that very very first step, right?

Um so instead of writing all of the boilerplate of the code, you can have AI generated for you. So you can focus on solving the really hard problems and the algorithms and all of that.

Now, the training process for generative AI is actually very, very similar to the training process for predictive AI with one key difference, which is first of all the size of the data sets. So I want you when you think about generative AI, I want you to think internet scale size data sets.

And as a result of that, you're going to have much, much longer training cycles when it comes to generative AI. So rather than a new model, taking a few or a few days to develop, it's going to take months and it, there's quite a few humans involved in the process as well.

There's a lot of data cleanup that has to happen, there's a lot of verification. And so the way that that impacts the way the generative AI works is that it's much less likely for each company to develop their own model, right? It doesn't make sense for each of us to take uh to take our cycles right, to spend our engineering cycles on developing, on developing one of these new models.

And so similarly, on the inference side, what you're going to see is process is very, very similar, but um you are going to need to have specialized computes. So GPUs in order to be able to run generative in france. And that's where again, one of the things that we'll be talking about in our discussion is where do you go with self hosted? When do you start to procure your own infrastructure? When do you start to go with cloud versus, you know, something like a serverless solution? When you do need access to um much more robust and specialized GPUs.

The other thing to keep in mind with generative inference, I i just mentioned that uh generative AI models are typically much more one size fits all. So might be thinking ok. Well, for my business, there's a lot of products that I have that are custom to us, maybe there's a lot of jargon, a lot of lingo things that are very, very specialized. How can I take this model that's been trained on all of this? Uh all of this fast input that wasn't custom to me and use it for my own business.

Um so once generative AI has been trained, it's really hard for it to learn new things. And a metaphor that I really like to use is actually from the movie, how many people here have seen the movie? 51st dates?

Um all right, quite a few folks. And if you haven't seen it, I, I think it's pretty common on planes. So maybe check it out on your flight back. But the idea is, um, you have Drew Barrymore's character, right? Um and she keeps going on dates with Adam Sandler's character, but she's had an accident so she can't actually form new memories, right?

She has all of her past memories of her childhood and maybe what she learned in like third grade biology. Um but when it comes to remembering what happens on their dates every day is a hard reset and an AI model once it's been trained works a little bit similarly to that.

So how do you provide it with a context that it needs? So, one way to do that is through a technology called a vector database and to use the analogy here, this is a little bit similar to in the movie every morning when she wakes up, Drew Barrymore, watches a video about everything that's happened in her life in the past few months, right? To give her that context so this is exactly what we're doing.

We are scanning maybe uh if you have a product catalog, you can have that product catalog be indexed and then saved in a vector database for the model to be able to access it whenever it needs in order to be able to answer specific questions.

And so when someone comes in, maybe they're doing some christmas shopping and they ask, you know, what's a red shirt that you can find, it will look for things that are similar, right? So this is kind of how the vector representation works. It's gonna look for things that are numerically similar and start to pull up items that fall into that category.

So that's one way of customizing in outputs. Another way to customize generative AI for your particular use case is through something called fine tuning. So here we are basically teaching the model new behavior by constantly correcting the outputs that it's giving us, right?

So it'll give an output, you can provide feedback and then you can adjust accordingly and basically set up the model with those parameters that are specific to you.

So we just went through all of these different quadrants, right? Um and again, this is just a helpful framework when you're starting to think about use cases, it becomes very easy to talk over each other. And so you can start to ask yourself which of these are we talking about? What's the right tooling for this particular use case.

And before I hand things over to Jacob, I'll leave you with a couple more thoughts on considerations that you might have when you're starting to run AI in production. First of all, there's always going to be skepticism when you're introducing a new technology into the company. And so the thing that I always recommend for people to get comfortable is find an internal use case, right?

Um just get people to start playing around with it. Something very common is something that's gonna help your team be more productive.

Maybe it's code generation tools, maybe it's even something that helps your, it start to ask answer questions that they get every day about, you know, how do i set up my laptop? Right? Um the second thing that i would recommend you to do is come up with an a i road map. Um this is something that can be brainstormed with other product managers, with other engineers, right? Go look at your um your dashboard, your customers think about where can i optimize their experience or where can i make our company more efficient by using a i to either like get our team 90% of the way there or by having a i kind of predict what's gonna happen next, not last but not least, set up some guidelines for internal usage.

So a lot of the f that i've seen around using a i is with, um you know, what is it gonna do when it comes to privacy? Right. Um what do i do if someone, you know, starts using a i at my company and leaks personalized data? And so again, that's where having some guidelines can be really, really helpful in getting everyone on the same page and getting everyone comfortable.

Now, the other thing that i would consider once you're starting to expose things to your customers, once you've gotten more comfortable with it is where do you want to run your a i on the spectrum of device network and centralized cloud? And there's always going to be a trade off between these five things, right?

So um are you looking to optimize performance? Are you looking to optimize cost? Are you looking to optimize accuracy, privacy or developer experience in helping your team go faster? So for example, in something like a self driving car performance is just gonna be the most important thing, right? You need those decisions to be able to be made in split nanosecond. And so that's where running the inference on the device is going to make the most sense.

If you have an experience where a user is waiting on the other end of it, you still want it to be pretty fast, but maybe because you're serving so many users, it's more important for you to be cost efficient. And in those cases, it might make sense to run your inference on something like cloud layer right, that can provide that cost efficiency at network scale, still running close to the user.

Now, in cases where maybe you want to run 100 billion parameter model in a centralized cloud, right? Um then that's gonna be the best thing for that use case because you want to optimize for as much accuracy as possible by having as large a model as possible.

We'll be talking through some more of these trade offs and some of the best practices in our firestar chat. So i encourage all of you to think about any questions that you might have and we'll come back to you to be able to answer those. And with that, i will hand it over to jacob to dive more into the landscape.

Oh, you hop up steve. All right. Thank you very much, rita. My name is jacob and i'm a founding software engineer at lang chain and i'll be like re said, walking through the llm landscape in a bit more depth. And before i get too far in, if you're not already aware of lang chain, we're a bit newer on the block than cloud flare or hugging face. We help developers build and ship context aware reasoning applications with large language models. And while we help with various levels and stages of building with s the most relevant parts of line chain for this talk are there we go our popular open source framework for, for orchestrating and coordinating these different modules that are important when building with ms and links with our lm tracing and observ tool, which we'll use to illustrate specific application runs a bit later in the demo.

Ok. So to get some perspective on the current landscape of lms, let's walk through a quick timeline. In the beginning, there was the bombshell paper that introduced the transformer, the underlying architecture that underpins all these crazy recent advances in a i attention is all you need. We can look at this as a starting point for modern llm development. And it's crazy to think that this is almost seven years ago now and predates most of the a i initiatives and start ups that are currently so hyped today.

Well, except for hugging face, shout out to philip over there while putting my slides together. I was actually shocked to discover that hugging face creates all the hype and was in fact founded in 2016. Anyway, transformer shocked the community by introducing this mechanism of attention and achieving all these crazy results.

Open a i was quick to capitalize on all this training and releasing gt one less than a year later in february 2018 gp t two. A year after that in february 2019 and gbd three which i'm sure you're all familiar with in 2020.

Now let it sink in for a second gg three is almost three years old. Headlines continued at a steady but not rapid rate. And at least for me as a developer, the first inklings that i had that lms would be world shaking came from github copilot and some of the attention grabbing features they provided around code understanding.

Then of course, at the end of november 2022 open a released chat g bt and lit an inferno, we were off to the races. The bloom llm on hugging face came out around the same time. And within months, open source efforts from meta with the original llama and g bt for all started to emerge as the wider community began to understand the incredible potential of what they were working with.

This pace has only accelerated recently with higher and higher quality open source models like lama two mistral falcon and various fine tunes with impressive benchmarks such as the very recent open orca private frontier models continue to improve rapidly too with anthropic claw. I recently announcing a 200,000 token context window which is just a truly massive amount of data and open air releasing various state of the art features such as assistance function, calling and gp t four v.

And the takeaway here is that you're probably all that you're probably already aware of the landscape is still shifting extremely, extremely quickly, possibly even more quickly than a few months ago. Because of that, when building with s it's important to stay flexible and avoid locking into a specific model, the only constant within the space at the moment is change.

So how do we evaluate the landscape of lo ones? Well, it depends on your task much like rita said, but there are various uh benchmarks that have proven useful in kind of getting a sense of how lo ms perform in different types of data. You essentially run the over a data set and have either a human or a more powerful l like gpt four judge a result.

And here's one for chat data published on hugging face. And it compares outputs from various lols both open and closed source and has an interesting technique where they actually they have an app where you can see the outputs from one lm for the same prompt versus another. And a user will actually pick and they're treated like a competition or elo if you're familiar with chess, where an llm that is picked over another will gain elo the other one will lose elo.

And we see a few interesting open source models here, in particular wizard lm 70 b starting to approach some capabilities of some of the frontier models like anthropic cloud two and gb d 3.5 turbo. Ms straw deserves special mention as well for being an open source. Very small model 7 billion parameters, parameters are essentially a proxy for how powerful the model is. It's how much data and how many weights are within the model at a high level and some billion parameters is much smaller than most of the other models on the list punching above its waist class. Very, very interesting and useful and close to the top.

We see anthropic line of l os, the cloud series, which are probably of special interest to many of you in the audience here given that you can run them privately through aws bedrock. So what does this all mean? We should just go off of the previous list and call it a day. And it's true that while there's definitely a lower limit, at least currently with m's parameter size and you know, kind of useful, useful outputs here.

For example, here's a output from gpt two, the predecessor for, from gpt three that i prompted with how are you doing today? And it responded with a very literal interpretation of how i am doing today by answering the best way to ensure that i'm doing today is to give your children an active role in the classroom, in the classroom, in the classroom, in the classroom and the classroom. And yeah, some very vaguely generically english text that, you know, you're, you wouldn't, you'd never want to put this in front of a user. In other words.

However, depending on your task, you know, l ms are approaching a point where you may not need to optimize specifically for power or performance in a kind of reasoning sense to get good results for your task. For example, i ask both gt four and 3.5 turbo in this slide, why is the sky blue and both lms outputted something reasonable mention things like really sc scattering and a bunch of other factors. But gbd 3.5 turbo responded almost five times faster and again, still generated a answer acceptable for the task at hand here.

So a few other dimensions to consider when evaluating lm is context for your specific use case our context window size, which again is the amount of data an llm can take in its prompt and output as well. Latency and cost open source models also can be a bit trickier to deploy. Although folks like rita and cloud flair's excellent workers a i have been making this simple, simpler to run as well as hugging face inference and can be a bit more specific in terms of strengths and weaknesses.

So for example, a recent benchmark took put llama 270 b which is a pretty large model. You couldn't run it on your macbook would need some pretty specialized hardware, but they put it roughly on par with gt 3.5 turbo, which i'm sure many of you in the audience are familiar with for specifically chat. However, it lag behind on coding tasks, which can be interesting for things like extraction where you might want to pull out specific features from some input in a structured way.

So different models have different structures of weaknesses in short and that's something important to think about when picking through the current landscape.

All right. Now, let's switch gears a little bit and talk about architectures. Uh my plan here is to walk through some handy concepts and sort of uh ways to conceptualize and categorize different um ways of designing llm maps so that you can apply them to your own work.

E line, shane, we started thinking through um various levels of autonomy. Um so i'm sure many of you have heard the buzzy term agents maybe raise your hand if you have ok, good chunk. And then, you know, on the other end, you have just directly code where you as a developer would start, you know, writing out exactly what you want the computer to do and the computer would then go do it unless of course you have a book and we've been calling in the middle of kind of this mundane code aspect and fully autonomous agent with a few layers in between.

We essentially think of the llm as being able to make essentially three kinds of decisions. One where it just basically takes some input and output something that it decides that's a simple llm call. So why is sky blue? It might decide to output some text on really scattering from a physics tech hooks. For example, another one is deciding which of a finite set of predetermined steps to take. So things like tool use might might fall under here or routing, which i'll get to in a moment. And at the far end we have determining what sequence of steps is available. And this is kind of what you, you know, you hear, feel the agi from suki and all the crazy things that sam almonds talking about. That's kind of what they're going. Well, and you have things like auto gpt and things like that, that are sort of steps in that direction. At least where you basically say here you're an llm. Here's like a very ill defined task. We're not going to give you any rails, just go off and do it.

So i would assume many of you in the audience are familiar with code given that this is a developer conference and many of you have also at least used some kind of call to an llm at some degree, whether directly via code or indirectly with chat gbt. So i'm going to be focusing a little bit more in on what we call routing. And this is where you essentially let the llm decide between one or more steps at a given juncture. They can be as simple as a misstatement or branch, a misstatement. This can be quite simple and powerful and even a very basic form and showcases that you don't need to give the lm full skynet privileges to take advantage of some degree of autonomy.

Cool. Next, let's dig dig in on another concept retriever augmented generation and this is also called rag for short. Anyone in the audience ever heard of rag before? Awesome. So let's say you have two you have an important research question um shown on screen here. And as everyone knows nowadays, there are two ways that you answer questions. One type it into google. We're all familiar with google and you would get a list of links, you'd maybe pick one, you would then go and read the link yourself and kind of get some insight into what the answer might be. And then on the other end, there's of course, chat gpt where you are trusting the l ms output directly and it's sth results from its training data. You're not like picking a source and reading it. It's essentially just pulling it out of whatever it was trained on.

So if you think of these two methods as two ends of a spectrum, we can call the google end retrieval only where you ask google directly again to fetch documents and then you read them yourself. And the chat gpt end as generation only where you rely on the synthesized output from the data. The lm was trained on now, retrieval augments and generation attempts to get the best of both these worlds by combining the two ends using sources fetched from a retrieval source to ground the l's generation reality and specific knowledge.

So one common pipeline for setting this up is as follows, you start with loading some data from some documents. It could be a pdf a website that you've, you know, pull the data from um cs v file and database and then you split that document in smaller semantically contained chunks. This is called text splitting. And the reason for this is you want to want to work with llm is you want to avoid distraction as much as possible. And by chunking inputs into semantically contained outputs, you can help the lm focus in on specific ideas that are most relevant to your query.

Next, we'll embed those chunks into a specialized database called a vector store, which rita already explained at a high level. And i think it comes in many forms. But the most important part that you need to know for this is that they have powerful similarity search capabilities over natural language. So they can take some input and return data or documents that's already ingested that are most similar and relevant to your query.

Finally, we will generate an output by presenting the retrieved documents from the vector store to the lol and grounding the output

Now, let's see how we can apply these concepts to build a popular chat over documents chain. And this is one of the most popular use cases of land chain to date.

The idea would be that you have some large pdf or large data source and you want to ask it questions. No one wants to read, you know, 400 pages of sec notes, we just want to ask, you know, where is enron still around or something? Right.

So we'll start with an embedding and splitting an original data source and embedding those chunks into that data source. And then when these are asked a question, we search the vector store for the most relevant chunks and pass that to the to ground it.

Another key uh key piece here is managing chat history. Uh since we want our, you know, we generally want our chat bots to be conversational and take into account previous interactions and terms. For example, my first question is who is jacob or who is rita? And you know, the vector store returns some data from some in justice chunks, drops a resume perhaps like a linkedin page or something and you know, spits out something that's plausible and answers part of my question. But then i have a follow-up. I want to ask more. I want to know, do they know javascript? For example.

So it's obvious that, you know, it's a human that, that they represent references, the last person mentioned the conversation. But as we mentioned the 50 51st date analogy, excuse me, the lm has no such context or no, no such reminders.

So we also need to ground the lms generation in prior chat history. And the way we do that is to add another pass where we pass the initial question, the user's question plus the previous chat history into another llm call to rephrase it until we call a stand alone question free of context. This is one example of query transformation but sort of, you know, for a question like do they know js the up would be, does rita know js or does jacob know js and the vector store will be able to handle that much better because it won't, it will know to search for chunks related to jacob or rita. And the lm will also have that context as well.

And this works well enough if we want our bots to answer questions based on one topic. But if we have multiple different domains, for example, you know, jacob and rita to reduce the amount of distraction needs to deal with, we can add a step where the another call chooses which database to query based on the incoming questions.

So if the question is asked about jacob, we only want to search for documents related to jacob and we adjust those in a separate store likewise for rita. And this helps ensure that only focused knowledge is returned rather than potentially mixed chunks.

And if you recall the previous slide on levels of autonomy, this is a simple example of routing.

So how does this look in practice? And how do we build something like this in practice? Let's walk through a demo app using a very small open source chat model hosted on cloud flare worker a i. And we're going to use. that is open source lama 27 billion parameters which i can even run on my macbook. And i've done before.

And we're also going to use cloud flowers vectorize vector stores for a database and run on workers. So entirely kind of self-contained one provider shop.

And this, that's why i start by asking the question and that question will be oh and yes, a very important note here. I have prepare the vector stores with two sets of information. One, a paper by open eyes, lilian wang on autonomous agents and sort of cutting edge of artificial intelligence and the other marketing pdf on cloud fla.

So we have a cloud knowledge base and a autonomous agent knowledge base and we're going to show how the routing works there as well.

So here it is, let's start by asking a question, what are autonomous agents? So we're going to go in and ask that and the element is going to go work through that architecture diagram. I showed you earlier create a vector store and start generating.

So we get based on the provided content, autonomy agents are intelligence systems that operate independently and make decisions without human intervention, which seems pretty reasonable. It mentions things like memory, self reflection that you know, it pulled from the blog post in our you know, relevant and all in all seems like a pretty solid answer.

Cool. Um so let's see, i think have all the trouble with this and the dinner. So let's ask you a follow up question. Let's say we want to learn more about the memory piece of autonom agents. So we might ask it. But uh ok, so yeah, the follow up question was tell me a bit more about memory within the context of autonomous agents. But i just phrased it as tell me a bit more about memory and the answer gives is actually not quite as good.

It says great. Thank you a little bit more gb two e if you will um great. Thank you for providing more context based on the information provided. Here are some key points about us agents. I think it actually seems to go back and respond to some of the kind of initial generation they made with the initial question.

So i set the app up with tracing as well. So we can go and this is where it supported a good tooling. We can go to lakes and view the trace and try to figure out what happened here.

So let's have a look, we'll go on and have a look at the question rephrase step and we'll see that took my initial question or query. Tell me more about memory and it rephrased it into something a little bit more chatty than i would have liked. And you know, it makes sense because lama two is a chat model like a, you know, very chat focussed model and it's very small so it's very focused and i get great. Thank you for asking. Autonomous agents are artificial intelligence systems that are, yeah.

And you know, this was, this was intended to be like a stand-alone follow-up dere question. And instead i got like a chat response essentially. So that's not so great.

And let's have a look at the routing step as well to see how that works.

Cool. There we go. Um sorry, folks, technical difficulties. Ok. So the video didn't play. But yeah, so i've asked it basically to route between either cloudflare or artificial intelligence based on which database it thinks would be more likely to be able to handle the question effectively. And i've also asked to output only one of the following answers. Cloud flare, artificial intelligence or neither.

And you see again, i get a chatty very chatty response saying based on the question provided, i would recommend using the artificial intelligence database to answer the question which is correct but not in the output format i want. And the website, the routing actually looks specifically the keyword artificial intelligence to the output. So this does technically work, but it doesn't seem like it's the most reliable or kind of robust way of dealing with outputs here and there we go.

Cool. So how do we fix this? And again, one of the strengths of lama two is that it's tiny open source and very cheap to run. Again, i can run on my m two mac with, you know, very small gp u. But clearly, it would be nice to have a little bit more power in a few cases.

You know, if we had a user ask a follow up question like that, they'd be very disappointed. So let's loop in uh anthropic cloud two hosted on amazon bedrock for question, rephrasing and routing the two places where it seemed to cause the most trouble and see what happens.

And as it happens, these are actually to the more token light tasks that the than you know, pure output generation based on retrieved documents. You know, it just takes the question and chat history into account plus a prompt versus anything like a long list of documents and chunks retrieved from factor database and then you know, longer conversational output.

So yeah, we have the same kind of initial question. Tell me about autonomous agents or what are autonomous agents and a pretty similar output from lama two. We're still using that for the answer generation step, remember, but let's see what happens now.

So i'll ask, tell me more about memory just as before and it'll do some thinking and i get based on the provided context. Here are some key aspects of memory on autonomous agents. And it goes on to list a few kind of relevant ones from the blog post which is actually pretty good.

Here. I get things on long term or short term memory. I get things on reflection, i get things on retrieval models, et cetera. So this is a lot better, right?

And let's go into the tracy n to see um see if the uh bedrock chad model did much better. Um hold on. Let's see. All right. Well, yeah, sorry about the video folks. But um we can see that it took the initial question and rephrased it to something much more kind of relevant and useful to the uh more simpler downstream llm.

And uh so tell me more about memory and turn it into what are some key aspects of memory in autonomous agents. Much better.

Now, let's look at the routing step. So again, it took the initial now this rephrased stand alone question and responded with only the output i wanted, which was either cloud flare artificial intelligence or neither and responded with just artificial intelligence.

So much more robust and much more easy to, you know, reliable and routing and the output as i showed earlier is what we expect.

Cool. So in short, there's a lot of change going on with the ll ms at the moment. Um and a fire hose of new models emerging each with different strengths. Some are fast, some are cheap, some are powerful, some are easy to use, some are open source, some are private, some are good to chat, others are good to code and extraction and some are even more customizable with fine tuning support over your data.

And um yeah, more exotic things like custom grammar, you can constrain outputs and here's some links to the lines, smith traces from the demo that you can peruse that about wraps it up for me and a live version of the demo app i showed and the github uh link as well.

So if you want to try it yourself and kind of experiment with different models, i know i tried, like i said, a very small basic one for on workers, but workers is coming out with some more advanced ones and more powerful ones that may perform various tasks. Much better.

And yeah, thank you very much. We'll be on to the fireside chat now and i'd like to call up rita and philip again.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值