Real-time RAG: How to augment LLMs with Redis and Amazon Bedrock-CSDN博客

本文链接：https://blog.csdn.net/just2gooo/article/details/134874385

all right. so who's here to talk about large language models yet in one of these theaters? pretty much everybody? all right.

what about one on retrieval? augmented generation? uh, like half. ok.

all right. so we're gonna talk about some basics, but we're also gonna talk about some interesting things here. take a look.

so first, i'm gonna start with some of the challenges this year. i've built probably 100 plus different rag systems for customers. uh, whether that's a vector database plus a knowledge graph plus, uh, large language models and multiple setting, fine tune models, et cetera. and i've run into a number of different things. we're gonna talk about those to kind of ground the talk and tell you why i've run into them. and that just gives you a basis for what i'm going to talk about.

we're gonna talk about data strategy. we're gonna talk about how i'm using these elements. we're talking a little bit about reddit because obviously i have to. right. come on, you know, it's one of these talks and then we're gonna get into rag and then something called symantec caching, which is actually gonna be really useful to you to both increase qps and also reduce your cost of using any large language model anyone. and then we're gonna talk about bedrock and now you can do this really easily.

so it's gonna be kind of jam packed. they gave me 20 minutes. so if i go too fast, you can come see me at the reddest booth, i'll stick around for like an hour and you ask questions. all right.

all right. so i talked about challenges. what are some of the challenges that face this year? building these systems?

well, the first is cost where the customer hosts the model. ok. do you put it on your own hardware? when's the last time you were able to get an h 100 gp u? seriously. it takes a while and then ok, you're gonna host it on a cloud even if you're doing it yourself. those costs rack up even lambda with a really good pricing, right? aws has great pricing for gdn four, but maybe you need higher throughput. those kind of considerations are really hard and it takes a really long time to figure out the cost optimization profile. ok.

quality. a lot of people are scared about hallucinations. you're the number one thing i hear of. uh we're gonna do it internal because we're scared that a customer is gonna hear the wrong thing. and so, oh everything, every single one of these is an internal use case, but we can get better than that performance. you know, the qps of a large language model, on average one of these ones that's 70 billion parameters or more, it's like two like two queries per second. that is something that has not really existed in a lot of these high transaction high-speed ecosystems before. you don't usually have something that has two queries per second. i mean, that is a a sizable shift from where we've been from the databases that support a lot of transactions per second security. imagine if you need a model and this is a setup that's come into play many a time that needs both externally facing assets and internally facing assets to be consumed. do you fine tune that model? you can't do that because then that model could possibly expose that internal information to the public. and so there's all these challenges, all these things that have come up.

and one thing i like to do is try to say rethink data strategy. it sounds kind of buzz worthy, but everybody and their mother wants a private jet g bt at very base level. those internal use cases i was telling you about. everybody wants something that says here are a bunch of my pdf s and powerpoints. now answer all of my questions and all of the things that my users are gonna ask whether it is internal or external, it is essentially a private jet g bt. and so we're gonna start there as a base use case.

and so what do you do for that? do you? fine tune. well, again, aforementioned problems of internal, external and then again, infra costs. so, how often are you going to do that? how long does that take that knowledge? can't be updated rapidly? what, what happens if that pdf gets updated one time a day? are you going to retrain that lom one time a day again, back to cost? are you gonna feed everything into the prompt? i'm sure everybody's read the lost in the middle paper, right? if you haven't, you should actually go read the lost in the middle paper. um you know, i think clad is up to what, 200 k now? well, you can't find uh a lot of that information in the generated output. if it's in the middle of that prompt, go look at it, you fill it with 200 k tokens which will be expensive by the way back to cost again and everything in the middle, it's gonna forget it's called lost in the middle gray paper. you should go check it out.

so this has got to be a middle ground and this is how i explain it to people. what is it? what's the middle ground? i'm sure you've heard talk about it today. it's about your databases, right? and so what is a badger database really quickly? because i'm sure everybody's heard something on this very, from very basics. you take unstructured data, audio images, text, anything that's unstructured data. i say it's kind of wrong but anything but an excel spreadsheet. basically, that's why it's a swiss army knife and machine learning tool kit. you take any of those types of unstructured data, you create embeddings from them, using something like a from open a i or one of those hugging face models off the shelf. and then you create your embeddings and you create a search space usually with an algorithm like k and n uh k nearest neighbors, approximate nearest neighbors in the form of one of the more popular ones like hierarchical navigable small worlds, facebook a i similarity search f i et cetera.

now what changed with vector databases from the previous generation of f i and a lot of the other vector search algorithms to vector databases is you got full crud support, you got high-speed transactions, you got redundancy, you got persistence, you got all these other things that databases like red i have had for years. and so i will take a quick detour to talk about red. i because i told you i would, everybody here is familiar with reddit. raise your hand if you've used reddit at least once. yeah. come on. it's everybody. um everybody's familiar with open source core. you've used reddit for caching, you've used reddit for messaging message store, et cetera and you've used these data structures, bib maps hashes, et cetera.

you might not know that you can also store json and probabilistic bottles. and then see this is where i'm starting to lose people. i can see it. you might not know that you can also have it as a document database. ok? and then you might not know that we also brought a lot of those clients in-house so that we maintain them now so that we can make sure that that code is of high quality. and we also released a new version of reddit insights that you can check it out. honestly, it's that one's free. you should just go check that out. if you're using reddit, even if it's any variation of reddit, it's very good, then we added more. and so recently we added even streaming data integration rd i very cool, but i don't have time to talk about it.

query and search is the one i'm talking about today. essentially, this is a plug in to reddit. and this allows you to then create a secondary index structure on top of all of your json or hash data inside a red that you can then query at the speed of red. i the same single millisecond transactions that you're used to. you can now perform a vector search in those same single millisecond times. and then lastly, you've got all of the enterprise goodies. and so a lot of people think of red is right. oh it's a cache, it's not persistent, it's not a database, it's not it's even better than that when you get to an enterprise database. and that's the difference. if you use an open source, you're right. if you're using enterprise, you're not because there's a huge separation between those two things. and that's why i usually like to explain this because people like they go download the open source do like, hey, the search commands not working. this guy is focus. what is he talking about? well, it's because it's in reddit stack and reddit enterprise. and so i want to be very explicit that the vector search capabilities that i'm gonna talk about today are in the enterprise version that is also not a last catch. we love u aws, but it's not a last cache, which is also a very common misconception. ok?

all of that is redis enterprise as i mentioned now back to vector data. this for ll ms ok. three different topics that i'll talk about today. one interesting thing is that reddit can be used for all of these. and you'll see at the end why it's actually really good for it, retrieval augmented generation, which you've heard about or at least most of you heard about i will explain in detail. conversational memory is turning out to be a really big one because chat interfaces are actually one of the most popular ways right now to expose this kind of asset. and so with chat interfaces are the dominant medium you needed to remember the last 10 messages. the user said, you also need to remember the last 10 relevant messages the user said and that's a lot different type of search samantha cashing, which i'll talk about. very cool as well. ok.

so what is retrieval? augmented generation? i don't like a lot of people's graphs with this. they have a graph that's like this to this over here in this area here. and i'll show you when it looks kind of like that later. but uh since it's mine, i can dis it retrieval a generation is very simple, it is exactly as it sounds. you have a query. you take that query or some variation of that query. if you're doing something like hypothetical document embeddings, and you're gonna take that and then create an embedding with that and use it to look up relevant information using vector search. so vector search is essentially a process by saying how similar it's technically how far away. but using, you know the opposite of the distance, you find the similarity, right? so how close is one vector to another? hence, you find out how similar things are. and so with that user query, you can then say, oh i know these three pieces of information are very relevant. and so when that large language model then goes to get your prompt that you've written and instructed it. these are instruction based models in the middle of that instruction is three relevant pieces of information in this case, it's documentation on reddit. but you can imagine it's your pdf, it's your fa q at your company. it's your documentation. it's anything that's unstructured data. and again, this is why it's the swiss army knife of the machine learning tool kit because embeddings are incredibly versatile. and as we become more multimodal, this is only going to become more of a dominant paradigm because embeddings can handle any type of unstructured data. and as soon as we also do some other things that are gonna come out soon, retrieval augmented generations could become even more powerful. all right.

so i'll dive in here. why retrieval augmented generation? we talked about some of those costs and those challenges right that we were talking about earlier. how does retrieval augmented generation actually alleviate that?

well, one it's faster. how long does it take to fine tune a model? even if it's one epic could take hours. how long does it take for you to insert a hash into red about two milliseconds or less? even if it's a json document which you can also use with vector search four milliseconds. and so if you need something that is a retrieval augmented generation paradigm, and you need that knowledge base to be updated rapidly. say, oh what's the rate of this mutual bond done that one this year? can that information be stale? no, that needs to be updated in real time. say you're a factory and you have sensors all around your factory. hey, what's the status of my system? can that be stale? no fresh fast has to be. and so there are retrieval augment of generation paradigms that rely on an external knowledge base that is up update in real time. and so that is one major benefit of this paradigm.

again, also it's cheaper, fine tuning is very expensive if you look at the ro i justification over time and you're doing fine tuning, fine tuning, fine tuning, fine tuning, hosting a a dedicated knowledge base, even if it grows over time is going to be cheaper. trust me, i've done a lot of fine tuning. it gets super expensive over time if you try to maintain it that way.

also the sensitive data we talked about this earlier. what if you want a segment? you need ac ls. hey, these users can see this. hey, these users can see this. hey, these are external users. what do you do? you can't fine tune. it usually fine tuning is even better for behavioral changes. anyway, i want my model to talk like this. it's not better for contextual information. and so all of these things together make rag retrieval augmented generation a really powerful paradigm and it's used a lot and obviously talked a lot about here. um and i won't continue to hammer every point of it home. but if you have more questions on raging, come find me uh at the the booth later i want to get this symantec caching though because it's something that a lot of people don't know about. ok.

so, what is symantec cashing in broad strokes? who, uh, who basically knows how caching works? right. you know, 1 to 1 you take one thing put in the database. oh, is it the same thing? you get the same thing back? right. well, that speeds up websites, right. because reddit is faster than post grads or my sql or what have you? ok. same paradigm. except for what if it's a semantically similar question and not the exact same question. what if the question is, what is the capital of france? and the next user said, what really is the capital of france? they deserve the same answer. should you pay for that llm invocation twice? no, you don't want to do that. i don't want you to do that. why don't we save you some money? that's the same paradigm that caching has been in all of computer science for use, right? except for now you can give it a threshold, you can say, all right, 98% similar say you have an fa q, right? you know most of the answers you wanted to give, ask them prepopulate it. you pay for that lom one time through all your fa q. why would you keep paying for it? randomly? reduce the hallucinations? stop paying for that. and then again, if you have something that updates in real time, you can use the eviction strategies, the same things that you've been using reddit for cashing for years and all of those different aspects like time to live. oh say i don't want these user responses to be older than an hour, right? then you can use the same paradigms with now vector search and the ability to say a semantic threshold or even better visual threshold. as i mentioned, this is any type on structure data doesn't just have to be text. how similar is this image? say your mid journey and you're trying to do images. oh has a user put in a prompt like this before? why am i generating that image again? send it back or image similarity. they're doing some kind of upscaling thing. what if the user sent one like that, send it back the same image? if it's within the threshold, everything within a 98% or more similarity should just be sent back to the user. and that we've already seen justifies its own cost multiple times over. i've set up as i said, a ton of these systems this year justifies its own costs even if you're paying us hundreds of thousands of dollars. if you have a system that is expensive in terms of lation, it's gonna justify its own cost. and so i'll give you an example. Here is the remaining transcript formatted for better readability:

wow. ok. wow. we have a library called red vl. uh my team builds it, it is a purpose-built client library for using redis as a vector database. there's a thing right here. we re hosts our doc links. so if it doesn't automatically redirect, you let me know but basic premises right here. ok. that i was just talking about and we call this llm cache. but really it's a symantec cache. you can do both types of caching. you can do your traditional hash caching, same question, same answer or you can give this a semantic threshold. this interven this could not be simpler. you can put this in a decorator. you can put this on any function. you can put it on fast api routes, you can put it on a flask, et cetera. and every single time that you have a user that asks the same thing within that certain threshold, you're gonna save yourself some money and make your api faster. that's kind of the bottom line. and so it's really like a no brainer even if you have something where you're like. all right, i'm only gonna use 200 megabytes of space. you're still gonna end up in your thresholds at say 98.9% similarity. i promise you you're still saving on llm costs, you know how many times a user hits the same question twice a lot. we did one for financial analysts, you know, say that the same thing a lot and how similar their questions are. so help yourself check it out.

all right, this is a fun one who here is familiar with archive.com. reads a lot of papers. i do. yeah. so it's, it's a place for scientific papers. i spend a lot of my days there checking out new stuff. it's almost impossible to keep up with a fire hydrant of things that come from archive.com. but uh this is one that i created with harrison chase at lang chain. if you're familiar with lang chain, uh it's a framework for helping you build up these types of rag applications. um and harris and i put this one together because we were curious, could i create like a research assistant where if i put in a topic and it just searched that, you know, bm 25 regular tech search on archive and got me the top 10 papers. could i automatically chunk those up, index those put them into a red database and then be able to say questions and talk to like a guru of that topic. and so we spent a while on this, but it's really interesting, it does better with some topics versus others, but this is all of archive. so obviously, if you're a company, you should optimize in a certain area on a certain topic. but this is just an overall test of how can it do in a super broad setting and it has settings to show you i want this context window link, i want this amount of documents. i want this amount of chunks for document and it will allow you to change those controls to help you learn how better to do this. and i use it as a learning tool. so if you go and check this out here, i'll put that up there. so you can check out a link but that it it like actual platform. the gooey, it's a streamlet app will help you actually learn how a lot of this works. and yes, this does use man to catch and you'll see how much faster it is when you ask a similar question and you can also change that distance threshold as well.

so last thing i'm gonna talk about because i got a minute 30 left is amazon bedrock. obviously, we're here at reinvented, right? you gotta talk a little bit about it. a bs well, bedrock's really cool. it allows you to just click through and make it so that you can deploy reddit into the marketplace. and amazon, you say you have a big commit right to amazon like most people here or something like that. well, you can just go in and click, click. we help, we worked with the bedrock team to make this as easy as possible. and you can use your reddit database with bedrock, which is gonna handle a lot of that stuff that i've talked about today for you. and obviously, i couldn't go too technically into rag and some of the other things like chunking and whatnot. but if you have more questions about that kind of more technical depth, please feel free to come talk to me. um bedrock is really interesting and i highly recommend you check it out. i wish i actually had a little bit more time to talk about it, but you can see a general representation of what i was talking about today and how you would do that with bedrock right here.

all right. so lastly, i will show you uh red svl. that's that client library i was talking about with this manic caching and what have you. and then also this is my team's rep uh github organization with all of our examples. we just do a bunch of them for people just to, you know, not all of them are s you know, great, right? but they're easy. they're meant to be like kind of walkthroughs. so check them out if you have questions, hit us up on github file issues. we love that stuff. um and thank you so much. that's my twitter. if you wanna see me talk about rag more often. yeah, thanks.