Best practices for querying vector data for gen AI apps in PostgreSQL

最新推荐文章于 2024-07-22 20:29:10 发布

李白的朋友王维

最新推荐文章于 2024-07-22 20:29:10 发布

阅读量96

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134824332

版权

Good afternoon, everyone. Welcome to the Mandalay Bay.

I'm Jonathan Katz. Today we'll talk about a topic that is very near and dear to my heart - vector search and retrieval.

Before we get started, this is a 400 level talk. We're going to dive really deep in this stuff. I'm certainly happy to discuss further. We have a lot of content in this talk. It's probably going to take the full amount of time.

I want to remark that we have two chalk talks after this - one on Wednesday, one on Friday. It will be a great forum for Q&A around this. But we do have a lot of content to cover and I can't wait to get started.

Here's how we'll do this:

First, we'll talk about what's driving the need for vector search and retrieval in databases. This topic is generative AI.

Then, we'll look specifically at Postgres today and Postgres as a vector store and the place for you to perform this search and retrieval.

From there, we'll look at the open source pg_vector extension, which is enabling these vector searches in Postgres, as well as some features in Amazon Aurora that help accelerate vector queries.

Finally, I'll look ahead at the open source roadmap for pg_vector.

To begin, what is driving this problem? When I wrote this slide, I was in my in-law's house in Florida. It had a very Floridian theme - alligators, birds, floral decorations.

I thought about having a store selling Florida tchotchkes. You have all these descriptions. I'm trying to build an app around it. I see all these things with generative AI. I see there's probably some way I can take some magic and build this interactive app to buy Florida tchotchkes online.

But how do I build that magic? What is that thing sitting in between building this interactive store?

The primary component is a foundational model. A foundational model is a machine learning model trained on vast amounts of publicly available data - think the entire internet.

These models take a lot of computational power and time to build. When you're done, you can query them through various interfaces and get humanlike answers as part of the application.

Even better - if you have existing information, you can use it to augment these foundational models. This is where our discussion today around vector search and retrieval comes in.

Here at AWS, we make it possible to use foundational models through Amazon Bedrock, an AI service. With Bedrock, you have a choice of foundational models. You can tie it directly to data in Amazon Aurora, Amazon RDS, or other data stores. It can work securely in your own VPC.

There are plenty of talks this year about using Bedrock. I encourage you to attend those.

Today we'll focus on the search and retrieval aspect. A primary technique for doing this is retrieve augmented generation. How many have heard of retrieve augmented generation? Cool - you're in the right place!

Briefly, foundational models train on publicly available information. But you likely have private data in your database that you can use to augment the model's response.

Let's go back to our Florida tchotchke store. I want to buy a blue elephant vase. If I just ask that of a foundational model, it won't know the cost since it wasn't trained on my data.

But if I have a knowledge base - a product catalog and price info in an Amazon Aurora database - then I know a blue elephant vase costs $20. That's the power of retrieve augmented generation - in a few steps, you can deliver that AI experience without building your own foundational model.

We need a common interface between our database and the foundational model. There are many data types - text, images, video. We need a mechanism to communicate between all parties in the AI infrastructure.

This is where the vector data type comes in. Vectors have 200 years of mathematical history, yet interesting properties we'll dive into today.

The vector is that mathematical representation of data that goes through an embedding model, taking raw data and turning it into a vector. It can be used for similarity search, which we'll demo soon.

Here's a typical retrieve augmentation workflow:

First, take an unstructured data document and break it into chunks. This is the science and art of retrieve augmented generation - taking raw data and getting the right chunks.

Send those chunks to an embedding model like Amazon Titan. It turns them into vectors.

Store those vectors in a database for later use. This is the first step - getting existing data into vectors stored alongside chunks that will ultimately feed the foundational model.

The second part is what the user interacts with. A user asks a question, it goes to the embedding model, turning it into a vector.

Use that vector to query your data store like Amazon Aurora. Perform a similarity search to find the most similar vectors.

Take the question and context you want the foundational model to answer. Send that to the model and get the response to send back to the user.

That's retrieve augmented generation - a simple workflow with many steps working with vectors.

What's fascinating about vectors is they're so simple - just dimensions and a size - yet challenging.

First, generating embeddings takes time. If querying a million embeddings, you can't do that in real time. You need to store them in a database.

Consider the size. Amazon Titan generates 1536 dimensional embeddings - 4 byte floats. That's a 6 kilobyte payload. A million embeddings is almost 6 gigabytes uncompressed.

Naturally you think - let's compress it. But this data is hard to compress - just random floats. Quantization helps reduce size but loses information.

You need to query the vectors by comparing them with a distance operation. You have to query every dimension - a challenge with 1500 dimensions. And you compare against every vector - an N-squared algorithmic problem.

We need to efficiently compare them. We can't shortcut comparing dimensions, but maybe we can shortcut comparing all vectors in a database.

This is where approximate nearest neighbor comes in. Exact nearest neighbor queries every stored vector, which is exhaustive.

Approximate nearest neighbor gets the most similar vectors without querying them all. This reduces computation time - it's faster than exact nearest neighbor search.

Most likely we can get into some nuance there. But the problem is that you are not looking at every single vector in your data set. So suddenly you need to be aware of something called recall, which is the measurement of relevancy.

So for example, if I expect to get 10 vectors, if I know my exact nearest vector is supposed to be these 10, but Anley returned eight of the 10 expected set, that means I have a recall of 80%.

What does this mean practically? It might mean nothing. You might build an approximate nearest neighbor index and everything appears to work as expected. Or it could mean everything. It could mean that you're starting to see results you don't expect in your application.

And there's different ways that you can help figure out what your, you know, how to adjust your recall. It could be something with your embedding model. It could be something with your vector index. It could be something with how you're searching the data.

But this is something that you have to keep in mind when you're working with vector databases and all these things are in tension. There's a tension between storage, performance, relevancy and cost. And based upon what you want and what's important for your application, you have to choose what matters most for you and you need to make that investment.

So you might decide relevancy is more important for me. But storage is also important, you know, so is the price of my storage and I'm willing to pay more to get more compute to make sure I can get better relevancy, but it might sacrifice a little bit on performance. And you know, maybe my query goes from 100 milliseconds to 200 milliseconds and that's not so bad.

But all these things are in tension and it's what you have to consider when you decide ultimately, how do you store your vector data? How do you query your vector data? And there's a few things you want to, you know, ask yourself as you're designing the vector storage as part of your system.

And one of them is, do I even need vector storage? Right? Where does this fit into my workflow? Am I doing something like which you've augmented generation or am I working with a foundational model directly? And I don't necessarily need to augment things.

And once you understand that, once you decide, ok, I do need something to store my vectors in, how much data are you storing? That can impact, you know, what kind of storage you choose. If you're storing trillions of vectors, you know something of that scale, you might not want to have them on an EBS volume, you probably want to have them something like S3.

So once you understand the scope of how much data that you're storing, then you need to figure out where you need to invest. Is it in storage? Do you want fast storage or is this something small enough you want everything to fit in memory or something where you know, performance really matters and you still want everything to fit into memory? How much are you willing to spend for that?

And ultimately, you know what, what's your relevancy threshold? And then once you figure out those four parameters, you can figure out what are your tradeoffs and this is where, you know, this is where you impact, you affect, you impact your design.

Is it? What kind of index do you use? How does that impact your query time? Like what, what's your acceptable query time that might ultimately impact the design?

So this gets into Postgres itself as a vector store. And the first thing is, you know, you know, why Postgres for anything? Well, Postgres, it's an open source database. It has a long history of development. I've been fortunate to be a part of the Postgres community for over a decade now, I'm actually on the core team of the Postgres project.

And I've seen, you know, through these years, a lot of what's happened with Postgres is that bottom column that it's been able to support robust enterprise workloads from a functionality standpoint, from a robustness standpoint from a reliability standpoint.

But one of the most powerful parts of Postgres is its extensibility is the ability to add functionality to Postgres without having to fork it. And that's through something called an extension. And this is where the the notion of being able to do vector searches and Postgres comes in.

So first why even use Postgres for vector searches? Because it's there, there are actually there are reasons why you want to do that. For one thing, you might not need to update your existing client libraries at all. You can add vector search capabilities to your application without doing much work. I'm a recovering app developer. And that's something that always appealed to me that I would always try out some new technology and then, oh, well, it worked with Postgres or it works with something else. I'm using my stack. Cool. I'm just gonna keep building with that because I don't have to do as much work.

You might want to keep your AI/ML data close to a lot of your transactional data. So for the example, we had earlier where I already have a Florida tchotchke store. Well, all that data, you know, in my product catalog is there. It's not a big leap to keep my my vector embeddings my text chunks and all of that, you know, next to it. And Postgres can work, you know, as you know, the persistent transactional store with other vector systems as well.

You know, common pattern is you might have Postgres storing your transactional data and it might be a permanent hub for things that are in, you know, a text search database like OpenSearch. And this gives us, you know, this brings us to pg_vector.

How many folks here are familiar with pg_vector? Cool. For those of you who are not, pg_vector is an open source extension that adds vector search and storage capabilities to Postgres. And it's a project that's been around since 2021 and it rose in rapid popularity this year for, you know, a lot of the reasons being that it enabled the ability to do vector search and retrieval in Postgres.

Now, fun fact, Postgres has actually supported a vector data type since the beginning in 1985. It was actually used to help speed up access control lookups within Postgres, but it was lacking some of the functionality that's needed for modern vector search such as indexing. Pg_vector brings two types of indexes: IVF and HNSW, which we will spend a lot of time discussing today as it impacts how you search and store your vectors.

It also brings the ability to do not only exact nearest neighbor search but that approximate nearest neighbor search that we discussed. And this is an important distinction, by the way, because of all the nearest neighbor type searches, Postgres supported prior to pg_vector, they're all exact nearest neighbor search and you talk to people using relational databases for years. It's like, yes, of course, it's exact because when I, when I run a query, I expect to get the exact results out.

So working with approximate nearest neighbor searches can take like a little bit of readjusting to if you're so used to getting those exact answers that you expect.

Postgres also acts as a metadata store. As mentioned, you can colocate your embeddings with it and there's a choice of distance operators. The two most popular tend to be those first two listed. You know, they, they look kind of like Star Wars TIE fighters, there's Euclidean or L2 distance, which is line of sight distance and there's cosine distance, which is the one in the middle, which is the, the angular distance is the, the simplest way to describe it.

Now, a lot of your decision for how you store your vector data will come down to your indexing methods. And we're gonna just dive right in and talk about, you know, just briefly compare IVF and HNSW and then do a deep dive into both.

So what are these methods? IVF is an inverted flat file. HNSW stands for hierarchical navigable small world. Say that 10 times fast.

Now, what are, you know, what are these and what are the differences? I think let's do a high level overview.

IVF is k-means based. So the idea is you have a bunch of vectors in space and you want to find it, you want to find a certain number of centers and you cluster vectors around each center. For IVF, they call those lists and buckets. What happens when you build the index is that you'll find a vector and assign it to a list. And the idea is that you're gonna look through these lists to find the vectors that are most similar to the ones that you're searching for. And don't worry, we're, we're gonna dive into depth of what that looks like.

HNSW is graph based. And the idea is that you traverse a graph until you get into and get into a neighborhood that's most similar to what you want. And the idea with HNSW is that you're going to do a lot of work up front to be able to build these dense neighborhoods of vectors that when you do the search, you don't have to do as much work because you're basically finding the the neighborhood of vectors that you're most similar to.

Now with IVF, you need to have your data already in your table to build the index. You can't figure out the centers unless you have data to be able to figure out the centers. Whereas HNSW is iterative, you can basically start with an empty table and start adding vectors into it. And it's gonna be able to build out that graph, you know, with a high degree of relevancy.

And finally, when you think about insert time, the IVF insert time is bounded by the number of lists. So you have 10 lists that's roughly going to, you know, be the amount of time it takes to insert a vector. If you have 1000 lists, you know, you have to search 1000 to figure out where to insert the vector.

With HNSW the insertion time is going to increase as you add more vectors to the index, you know, kind of similar to, you know, how a B-tree index works in that way.

Now, before diving in, you know, the first thing you might be thinking about it like, ok, cool. How do we choose which one?

So let me start with the easiest one. If you need exact answers, you want 100% 100% recall, you don't use an index. You might say, well, Jonathan, I'm sure I can use an index to get 100% recall. And yes, you probably can do that. But if you need to guarantee 100% recall, you don't use an index.

Now, if you want fast indexing, you're gonna use IVF. We're gonna see that overall IVF is a much faster indexing method, but there's gonna be tradeoffs to it in terms of what you have to do to be able to get the right recall performance ratio.

If you want an index that's as close to set and forget as possible, HNSW tends to be this, the, the default of pg_vector seem to be pretty good for the use cases we've seen.

Now again, you know, I I still give that an asterisk as you dive into it. You know, depending on your data center, your embedding model, you might need to tweak the the build parameters, the search parameters. The idea is that this is gonna get as close to this is app developers love this because there's not as much work they have to do as compared to IVF.

And where HNSW is really shining is that, you know, it's both high performance, high recall. This is why this is why folks have been adopting it because they see, you know, they're basically able to manage it fairly easily and get the performance numbers that they want with the relevancy that they want.

So now let's really dive in. We're gonna talk about strategies for managing vectors with pg_vector and look at some of the best practices established to date. And there's really four areas that we want to look at.

First, how do you store the vectors? This does impact how you store the vectors does impact how you can query them from Postgres. We'll look at strategies for both of the different indices and we'll look at a topic that's become very popular lately, which is filtering, filtering, you know, quite simply is the WHERE clause. So you keep that in the back of your mind as we go through this.

So first storage. Storage matters in Postgres because once you start dealing with large data storage might not be as intuitive as you think.

So, Postgres has something called TOAST. How many people are familiar with TOAST? I like to have toast, you know, every every day in the morning, like no joke, no exaggeration. But TOAST is actually a novel system in Postgres to be able to deal with large data.

The atomic unit of storage in Postgres is something called a page and page can basically store about eight kilobytes of data. Now, the new Postgres user knows that you can recompile Postgres and get a larger page size or a smaller page size. But most installations are going to use this eight kilobyte page.

So what happens if you have something that falls off an 8 kilobyte page, you need a way to store it.

And again, this is something that's fairly common in Postgres. I might be storing, you know, some, you know, some large, you know, text data or I might store a blob in Postgres and that's way larger than eight kilobytes. You need a way to be able to store that and that's where TOAST comes in. TOAST can store that data out of line from your main table and effectively it's able to store it across, you know, multiple pages so to speak. And because of this, it gives an effective way of storing large data and also it can have benefits on queries as well. Because if you have a large text blob, that's something you're not querying very often. You don't, you know, you don't necessarily want to be pulling that into memory. So, you know, storing it out of line, it does help improve query speed.

And because of that Postgres by default, by default, uses TOAST. Anything that's over two kilobytes, that's a tunable parameter. You can, you can select that on a, on a per table basis. And I believe maybe even a per column basis. But that, you know that again is, you know, these, you know, these defaults are decided particularly for things like large, you know, large text data and blobs. So if you have a five and a 10 dimensional vector or four byte float vector, that's automatically gonna be TOASTed. So the Amazon tight embedding is, you know, you start that in Postgres, that's automatically stored out of line away from the table.

Why does this matter? Well, actually, before we get to why this matter, the other thing you need to be aware of is Postgres column storage types. The first is plain, if you set, if you set this specific under column to be plain, the data is stored in line with the table no matter what. So the advantage of this is that for something where your big blob is in the hot path such as a vector similarity search, you might see some performance benefits with that. But then you have a limit is that you can't store more than eight kilobytes of that data.

So roughly a 2000 dimensional vector, then there's extended which is the default in PgVector, which is you store the data out of line and compress it. And you're probably thinking like wait a second, didn't you say you can't compress a vector? That's correct. And I believe in the next release of PgVector it's gonna move to external, which is you TOAST the vector, but you don't try to compress it. And there should probably be a little bit of a performance benefit particularly on inserts with that method.

And finally, there's main which is that you compress the data and store it in line for PgVector. The ones that you're gonna be most interested in are well moving forward external and plain. And so I think the storage type does have an impact.

So let's take a, let's take this Postgres query plan. I started a million vectors 100 28 dimensions and I'm doing an exact nearest neighbor search and Postgres is like cool, I'm gonna do a, you know, I'm gonna do, you know six parallel workers to pull it in. Ok. That makes sense, right? You know, I greatly expand the vector. You know, I'm using 1500 dimensions and Postgres is like, ok, cool, I'm gonna use four workers.

Wait a second. That's less workers. But I know that's more data like this is, you know, you know, at least 10 times the size. Why are you planning less workers? And it's because because we're TOASTing the table, the Postgres query planner as of right now sees smaller, you know, sees less pages in the main table and says like, ok, I don't need to do as much work when in fact, we have to do more work because we have to go to the TOAST table to be able to first pull those pages in and then do the similarity.

So the storage matters here. And we need to, you know, we need to make sure that when you're storing large vectors, particularly from these, you know, from the foundational models that are giving you, you know, large vector embeddings, you understand how the data is stored. So you can choose to use plain storage, which again, there's ultimately to be a cap of how large you can store that, you know, you know, within, within a single Postgres page. Or there's a parameter called min parallel table scan size. Which if you only take away one thing from that is if you set that to a very low value Postgres is gonna spawn more parallel workers. So that's one way to have your cake and eat it too where you can still TOAST those vectors, but get the benefits of getting, you know, more parallel workers.

And here's that same query with that parameter set to one and we're getting 11 workers, which makes a lot more sense because we are pulling a lot more data into memory with that, you know, so we're keeping that in mind that's some guidance around storage. Now we can move into indexing and we're gonna start with HNSW, you know, given the, given the popularity of the HNSW and how it really the performance results we've been seeing around it.

Now HNSW does have a few more knobs to turn in itself to be able to configure it. But it tends to be a little bit easier overall than IVF flat, which you'll see in a few slides. And a lot of the work of HNSW is done upfront. It's in the index building process and there's two parameters you need to be aware of. There's M and EF construction M is the number of bidirectional links between vectors. And the takeaway from that is the more bidirectional links you have, the more likely you're gonna build these neighborhoods of vectors that are most similar to each other. We'll see that there's a cost to doing that in a few slides.

You then have EF construction think of EF construction as your memory as you're building the index that as you're visiting all these vectors in the graph, you're keeping a list of how many vectors you're most similar to. And the idea is that the bigger the list you keep, the more likely you're gonna find the most similar vectors to yourself as you build out that graph.

Now, what does it look like to build an HNSW index as I have a bunch of vectors in space? And I want to index that orange one HNSW works with layers, you go from less dense layers to more dense layers. So at the top layer, you might say find one vector most similar to you transition down to the next layer and you might build links to, you know, two vectors. But you can see that the layers get more dense until you get to the bottom layer, we can maximize the number of links up into M to build that graph. And the takeaway I want to have from you to have from this slide is that this just take a little bit of work because you basically have to look through, you know, look through a bunch of vectors as you descend down these layers and find the ones that are most similar to and build those links and then ultimately store them. It's a little bit of a process. But we're about to see the payoff with querying with querying.

There's only one parameter you need to worry about at least from an HNSW standpoint. And that's EF search. And again, that's similar to what we discussed before. That's your, that's your list of vectors that you're most similar to the larger the list, the more likely you'll find vectors you're more similar to. And there's a caveat specifically of Postgres is that your limit must be greater than or equal to EF search. So you can't have a limit of 20 in EF search of 10 because then you're only going to get 10 results back.

Now, querying is very quick, effectively, what you're doing with querying is that you're going to descend down to the bottommost layer and then find your most similar vector so you find there's an entry point vector that no matter what, you know, that's the entry point you go from there, you find the vector that's closest to you and you go down to the next layer, same process, you find the vector most similar to you go down to the next layer. In this case, we're at the bottom layer, the densest layer, you know, and I want to give an example of what all the links might look like here, you find the vector most similar to and you're done.

And, and if you're familiar with database internals, this should feel a lot like going down a B-tree or a similar type of indexing structure because it is it's a graph and databases tend to be very good at dealing with these kinds of structures.

Now, just to recap real quick, the key to HNSW is that you're gonna pay, you're gonna do more work upfront with building the index, but the payoff is that you're gonna do less work when you're searching through the index and you're gonna get the results that are most relevant to what you're looking for though, you might need to tweak it based upon what you're doing, you know, just some best practices. And you know, these are, I would say these are observed best practices, typically default values work.

In the PgVector project upstream, we did a lot of work to determine what the default values were. And initially, we thought, you know, we were looking at M equals 16 EF construction equals 40. We decided to bump up EF construction to 64. We, we did see a, a good jump in recall with not that much more effort to be able to, to get that level of recall.

Now, different, you know, different, different vector databases have different implementations of HNSW PgVector's is homegrown based upon the HNSW paper. And we found because of its implementation using one of the later algorithms in the paper, we were able to get more relevant results with a lower value of construction. So this advice doesn't necessarily apply to other systems. These are specific to, to PgVector.

Additionally, the other thing that we saw was that well, one nice thing about the PgVector implementation is that it supports concurrent inserts or the ability to, you know, insert data, insert data from, you know, multiple, multiple processes, multiple threads at the same time. It doesn't support parallel builds yet though. Foreshadowing it's actually been committed and it's slated for the next release. But this is a way to help accelerate building your HNSW index.

So here's an example of how concurrent inserts impacts index build time. You know, we, you know, we kept doubling the client size and we went from inserting a hundred thousand 100 and 28 dimensional vectors from taking an hour to taking close to a minute by doing those concurrent inserts. So the current best practice for HNSW is start from an empty table and then use concurrent inserts to be able to speed up the build.

Now, the other two parameters you need to worry about or worry about. Think about when building an HNSW index are EF construction and M because those impact relevancy and the first guidance is before increasing M or decreasing M play with EF construction and see how it increases relevancy because the trade off is going to be increases index build time. But you know, it will ultimately help push up, you know, relevancy particularly on queries that use lower EF search.

The reason why we say start with EF construction before M is that as you increase M, it does impact recall in particular like this data set, this was the open source GIST data set. We, we found that like it tends to be very toxic with a lot of these approximate nearest neighbor algorithms for, you know, for whatever reason. But increasing M can certainly help with it. But look at the index build time, that's minutes, not seconds, it does take a lot more time. And even though we do end up getting much better results, you know, in terms of relevancy, you know, there you know, there is a cost to it.

So again, this is the those, these are these tradeoffs, you know, in terms of cost and timing and performance that you have to go through. So, in conclusion with HNSW, the biggest impact on your query performance and recall is going to be the index build, you do the work up front, but the payoff is ultimately for how you query it. And as you go through that note that when you're querying the data, EF search does improve recall, but it's gonna impact performance.

So let's get into IVF flat. So IVF flat is, you know, as you know, as we mentioned, that's a key means based algorithm and in this case means are buckets or lists. And as you organize your vectors, you're organizing your vectors, you know, by the list that they're in. The trade off is ultimately gonna be, you know, around how many vectors are in the bucket and the relevancy on it.

And just, you know, just, you know, to get started with this, you know, the first thing is, you know, let's say a bunch of vectors in space. I'm gonna try to use three lists. What's ultimately gonna happen is that you're gonna determine, you know, the, you know, the centers of your three lists based upon the buckets that they're in and you're going to build the links to them. And those are where the vectors are, that's where they're assigned. Those are the lists we're done, the parameter that's used to query vectors in IVF flat is the va dot probes parameter, which is by default one. And basically, it's going to say when you run one of these similarity searches, you're going to only go to one list, find the list that you're closest to go in and you know, find your, in this case, your three closest vectors.

Now, what's nice is that this is actually very fast as well because if I only have to search one list and all the vectors in the list, if there's a reasonable amount of vectors in it, it's gonna be very fast. But you might not necessarily see all the vectors that you're most relevant to.

So look in this case, you know, if you eyeball it, it seems like that vector in that list, we didn't search, it's actually closer to our query vector. In fact, in this, in this example, if we send probes to two, we see that we get that vector as well. So probes is very important ivf flat. You know, that's actually that's actually the, the more important parameter because that's gonna help drive relevancy. But you see that there's gonna be a linear cost here because we basically have to search over, you know, more list and we have to search over all the vectors that are in those lists.

So what are our, our general performance strategies first? You know, probes, probes is gonna be the key parameter list can impact it as well because ultimately lists, uh, you know, if you have fewer vectors in your list, you're gonna do less work as you search over them. But probes are ultimately going to determine how many lists that you go into and that's what's gonna help increase performance.

Now, what's interesting is that we've seen some issues with postscript choosing to use the ivf flat index. Um, this was particularly acute before pg vector 043. We improved the costing on that, but you might see after a certain point, postscript should pick the index and you'll still get, you know, highly relevant queries. But it, you know, it was, you know, it suddenly we start doing a sequential scan and suddenly you'd have a query from, you know, that was like 200 milliseconds going to be like five seconds, which is a huge jump.

There's a parameter post called random page cost. I mean, this goes back to the the spinning disk days where basically trying to find a random page on dis had a heavy cost to it. So even if it was index, you know, you'd have to pay a penalty. If you lower random page cost, it's more likely postgres will pick using an index over a sequential scan. So that's one way that you can help tune in, you know, help postgres to focus on, you know, choosing an ivf foot scan as opposed to a sequential scan.

Also keeping more of your data in memory does help speed things up. Um and that includes your table that includes your index. And if you're toasting your data, that actually includes the toasted value staying in memory and shared buffers is the primary parameter that that affects that.

Um on both of our own r ds, we try to set shared buffers to, you know values that are gonna be optimal for your workload. But you know, you should also, you should also inspect to see how much of your data is being to memory.

Now again, that shared buffer advice is also applies to applies to word database workloads. In general, you know, naturally, the more workload that you have in memory, the more uh you know, the faster it's going to perform.

Finally, specifically ivf flat, you might be loading a lot of data into, you know, temporary memory. You know, we call this working memory. If you're doing a sort, if probes is very high, what might happen is that you have a sort that spills to dis and every time you have data spilled to dis you know, it takes effort to, you know, read that back from disk in order to in order to complete the query.

So if you set work memory to a higher value that data will stay in memory and it makes it easier to operate. But beware, you don't necessarily want to change work memory globally to be able to, to be a larger value because what might happen is you end up running out of memory on your system and then the out of memory killer comes and start, you know, and weird things start happening.

So be careful with this one. And this is also why, you know, in general, like the, the, the best practice we've been giving is to start with the hsw indexing because it is a little bit easier to tune these things.

You know, i've talked to customers that have deployed ivf flat indexes into production and you know, they're having great success with it, but a lot of it is that they have to keep in mind the different knobs that they, that they need to tune additionally as you're setting up, you know, ivf flat indexes, you know, how do you choose the values for how you build it?

Uh the general guidance, you know, this is coming from the pg vector repository is that if it's, you know, less than a million vectors take your number of vectors divide by 1000 start with that as your number of lists, if it's over a million vectors take the square root of the number of vectors.

The other thing to keep in mind is you might need to rebuild your index as you continue to add or remove vectors. And the reason is because your centers start to skew, remember when we build an ivf flat index, we're taking a snapshot of all the vectors that are already in that index.

Now, if you add more vectors, what's gonna happen is that what were your centers are no longer your centers? They're gonna shift and the results you might be getting out of uh out of your iv flat queries might not be what you expect.

What you can do is you can rebuild posts allows you to rebuild indexes concurrently and that can help reset the center. So that way you're getting the results that you want. And finally, uh ivf i does support parallel builds at this time. So leverage parallelism as you do the rebuilds.

And what's really cool is, you know, in terms of a lot of this is actually, you know, parallel liable in pg vector. But we found that it was this step where we're basically pulling vectors from our table or from disk and assigning them to the centers was taking the most work.

And the reason was that prior to pg vector five, it was doing a sequential scan. So imagine you have 10 million vectors in your table, you're building this index, you're scanning through all 10 million vectors sequentially assigned them to the list. And then, you know, and then boom, eventually you're done. But that takes a while and we know post supports parallel scans.

You know, we just saw that, you know, in the earlier example with the, you know, with the the costing. So pg vector 050 added the ability to do parallel scans. And with this, you know, does parallel sign into the list and ultimately goes through the process. And granted there's other parts we can parallelize including the k means or um ultimately parallel rights. But that's a lot more work. But even just like doing that, like we saw, you know, huge jumps in speed ups on this, on this data set, we saw a two x speed up.

Um i've seen up to four x speed ups, you know, particularly on 100 million vector data sets that we were storing. So if you do choose the ivf flat method leverage parallelism, it will, you know, greatly speed up your, your build times and granted the build times tend to not be all that bad for ivf flat, but it's always better to be faster right now.

This leads us in this topic of filtering. You know, we've built our index, we searched it but not every query we run is gonna be select star from table order by my distance operation limit. You have this war clause where cause is my favorite clause is i used it as an app developer all the time because i can filter down my data set and that's key.

You might not want to search over your entire database. You might want to search like a category of products, you know the the florida charke products. But filtering impacts approximate nearest neighbor queries and maybe not in the ways that you expect the first is that post graph postscript sees the filter. It might be like, well, i'm not gonna use the index at all because, well, this is gonna return 50,000 rows and they're all toasted.

So i'm gonna do a sequential scan cause that's gonna be fast. But suddenly your queries are very slow. The other thing you might see is that postcode uses an index but you're not getting back the results that you expect. In fact, you might be getting too few results or, or, or what it might also happen is that you don't see all the results that you want.

Like for example, the i va flat, you might only, you might not go to all the lists that contain your relevant results. And then suddenly, you know, you're doing this post filter on the data and you don't, you know, you might not, you might not get like the blue elephant vase that you expect, you know, from the florida chaz example.

So we need some techniques to, you know, i i, the way i would put it is that, you know, first off do i even need to do an approximate nearest neighbor search? You know, when i have a filter, the answer is that it depends how much it is being filtered down if you have a bre index on that category id column. And you know, it's everything is about 100 rows. No, you don't need it like 100 rows. You know, even even with these expensive vector queries, 100 rows is nothing. You're gonna get, you know, very fast queries return.

So you need to figure out how many rows does your filter need to remove? Because if you're in the 50,000 range, 100,000 range, yes, an approximate nearest neighbor search is gonna help impact it. And that's the final thing you need to ask yourself as well as do i need the exact results when i run the filter or the approximate results if you need exact results, make sure that your filter is indexed or your the filter itself is indexable because that's gonna help speed up the query.

If you're ok with the proximate results, we do have some techniques that can help. The first is a partial index postscript lets you build indexes over subsets of data. You can apply a where clause to the create index and postscripts will only index values or only index uh rows of that particular value. Very effective technique can be used, you know, beyond just approximate nearest neighbors searches.

The next is partitioning. And this is a great technique if you have some data that you know, well, i mean you can partition your, your entire if you have a natural partition key, first off, it's a very effective technique. But if you have say some vectors that you want to index some that you don't. This is another thing. There is another technique you can use as well.

And what happens is that in a partition, you can choose to choose to build an index over that partition. So again, these two techniques you know this, this is currently what's available in postgres. We'll talk about um some ongoing work to uh expand. You know what you can do in terms of filtering to make sure you can get the power prefiltered, eliminating rows before you do the, you know, approximate nearest neighbor search.

One other use case that came up is filtering with existing embeddings. I know this is like a complicated union query up there. And yes, i, i agree. This is complicated. In fact, i think i wrote it too. But some something that's come is that i have a vector within my data set. And i want to search over all the other vectors around it, but exclude that vector from the search.

So we need to be a little bit tricky. We need to do some work with sub queries. And in this case, the example that was presented was that there were two vectors that uh we want to do a similarity search over but exclude those vectors from the final results. And that, that's what led to the, you know, the, the union sub query that then gets, you know, ultimately filtered over at the end because this allows you to use the indexes within the you know, within the table.

But uh but you know, as you can see, you know, this takes a little bit of work, you know, this is not necessarily the kind of query that you want to write. I know for the astute observer, this might actually refer, remove too few rows, we publishers set the limit higher on each of the sub queries before the final query.

So these are the techniques we have available today. And there's also some features in amazon aurora that can help accelerate vector searches. So briefly, amazon aurora is aws s uh you know, flagship commercial database and has all features that you expect, you know, when running uh you know, commercial workloads.

The way i like to look at it from an app developer perspective is that, you know, all the things that i should worry about, but i don't want to worry about, you know, high availability backups, monitoring are all included

And there's features that you know, help help you to continuously scale. This extends to vector workloads. You know, we recently announced uh optimized rates for Amazon Aurora, which adds, you know, several things to help uh that to basically leverage uh uh nvme the nvme cash available on storage and this works.

This is available in two ways. First, we talked about query spilling to disk instead of having to spill, you know, you know, spill, you know, down to a lower disk layer, these queries can spill to the local, the local nvme, which is much faster when you have to read them back into main memory.

The other feature that's involved with that uh optimized reads has is the nve layer acts as a page cache. So remember that the page is the fundamental unit for storing data and post grass. And ultimately, what happens is you may have a page in memory and something you know, causes it to get evicted. You know, there's, you know, too much memory, you have to get rid of a page because something is, you know, hotter or fresher that page gets evicted. But instead of, you know, instead of having to retrieve that page from storage, it can be available in the local nvme. So that way when you need that page again and it hasn't been evicted from the local nvme, you can pull it up, they give it in a way as extending the available memory on that instance.

Next, you know, pay attention to what generation instance you're running. Uh we've not, you know, we're gonna see that we've, you know, we've done some experiments, uh some experiments with the r seven gs and you know, they help to, you know, they provide some benefits over the r six gs.

And lastly, we talk about the importance of developer tools and frameworks like lane chain are compatible with Amazon Aurora, you know, particularly through the pg vector compatibility.

Let's dive a little bit more into uh you know, this tiered caching ability of uh optimized reads where it really shines is when your workload exceeds memory. And in particular, not only exceeds memory, but you're throwing a lot of concurrency at it. And you basically have to shovel pages in and out of memory.

So again, instead of having to, you know, pull an evicted page all the way back from storage, you're able to pull it from the local nvme cache. And in these experiments where we had workloads that greatly exceeded memory, we saw up to a nine x speed up.

Now, the test we ran here was with a billion vectors which should help answer the question. Can I store a billion vectors in post? Yes. In fact, we store this in a single table. Now, normally when i design a table, i don't put a billion vectors in a single table, i'm probably gonna partition that table or start amongst, you know, a bunch of different tables. But sometimes you just want you, you, you just want to stress test things to see how will they handle it.

So we ran this test. Um we, you know, we, we targeted a higher, you know, higher level of recall. So we increased the ef search value and then we threw out concurrency at it to see how the optimized reads feature uh worked. So we compared two instances, one of the optimized reads one without. And that's where we saw the, the nine x speed up.

So the takeaway from this is that, uh you know, optimized reads does help extend the amount effectively extends the amount of memory you have in your system and helps you to push your workload farther on the current instance that you're running.

Now, in terms of looking at newer hardware, we ran, we ran a, you know, a similar experiment with the r seven gs versus the r six gs. In this case, we took that hnswe of search parameter, which is uh you know, effectively increasing the search radius or the amount of vectors that you store on your list. And you know, we kept gradually increasing it. What we saw with uh the r seven gs was that as we really increased. Well, first off, they were faster than the r six gs, you know, even at a lower ef search value. But as we increased it, and every time you increase ef search, you're basically increasing the amount of cpu power you need to take to increase the queries. We really saw, we really saw the r seven gs shine, you know, on the higher eef serve. So they kept getting faster and faster than the r six gs. And this is why i say hardware selection matters. You want to make sure that you're picking hardware that's going to make the most sense for your workload though. Again, you know, this comes at the tension of cost as you look at what makes the most sense for what you're building at.

Looking ahead, you know, what's coming down the pipe or a pike, i should say, um you know, as i, as i alluded to with h and sw today, the way that you can most quickly build these indices is through concurrent inserts or concurrent copy commands. And actually, i'll note in post chris 16 recently released on r ds. Uh there's actually improvements to a concurrent copy. Uh i'll be talking a little bit more about that at opn 302 tomorrow. But already committed to the pg vector repository is parallel builds for hnsw. So if you already have an existing data set and you want to be able to build an hnsw index more quickly, that feature is available. The tension with that is making sure that you have enough maintenance, working memory available for, for building that because you, you, you don't really want to be swapping the data in and out of memory as much.

Next. You know, we talk about filtering and the challenges around filtering. Um there's a proposed patch for uh enhancement to uh index based filtering called hq a and n. The way i would describe it is that it effectively gives you a multi column index for p for uh for hnsw. So let's say your vector is your first column. And then you can put in scalar attributes such as like a category id next to it and effectively filter of those attributes. So that way you can write that select star you know, from where query and not have to worry about doing you know that you know any like sub queries or any other complicated queries, it makes it much simpler to query the data, the support for more data types uh already proposed. And this will allow you to index vectors of larger sizes so long as they're smaller.

Um there's been discussions around product quantization scalar quantization scale quantization is effectively taking something like a, you know, a float four vector and reducing it to a you know one byte integer vector. And the deal with that is that you're able to shrink down the size of the vector but you might lose some information in the process. So that can impact recall product quantization does that to an even greater level. It's like taking like 100 20 dimensional vector making an eight dimensional vector. And what happens is that with that technique, you basically build out a bunch of centers that you know are stored, you know some kind of meta information and then you point all the vectors you know to those centers to be able to get your product quantization probably probably like two whole talks just in those techniques right there. But that is something that is on the pg vector road map.

And last but not least parallel query, we found that since we added support for hnsw to pg vector. The need for parallel query is a little bit less. But we're gonna probably see, you know, these larger hnsw indexes get built and then, you know, we might need parallel query there. But there is a proposal for adding parallel querying for ivf flat.

So in conclusion, if you, if you take away nothing else from this talk here, here, here are three things to, to go home with. First, your primary design decision is gonna be around query performance and recall. Those are always intention, no matter what method you use, you want fast queries, you're probably going to impact the relevancy of your results. And again, that's, you know, that's ok. That's, that is a healthy tension, but you have to decide what makes the most sense for your application. And again. Based upon what embedding model you're using, you know how your index is built, et cetera, you might even need to think about it. Like everything may just work. You get very fast queries with highly relevant results. And i hope, i hope that's true. I hope you don't have to use any of the techniques in this talk. I hope everything just works. But if not, you know, now you hope now you have some best practices that will help you make these decisions.

The decision that you will have to make is where do you want to invest? Is it going to be in your underlying storage? Are you, do you want to keep everything in memory? Do you want to use ebs? Do you want to, you know, do you want to keep things in s3? You know, what level of compute are you doing? Are you gonna get a very large instance? So that way you can maximize your concurrency for building these indices or can you use something smaller because, you know, maybe your workload is, you know, not too heavy and it's, you know, you know, it's fine to use a smaller instance. And then your indexing strategy, you, you have to pick what you want to, that's gonna make the most sense for your workload. Again. I recommend starting with hnsw and seeing if that works for you.

And finally, you do need to plan for today and tomorrow, this is a rapidly evolving space. I mean, think back to last november, i mean, we had just started talking about foundational models at that point. And now here we are like looking at, you know, all these different, you know, you know, vector database solutions out there, including pg vector pg vector itself is rapidly innovating last year. Honestly, barely anyone was looking at pg vector. I'd even say myself included. And i'm personally happy i've been involved in the project. I mean, it's, it's been a lot of fun like looking at this kind of data. But if you look at pg vector last year and where it is today, it's able to handle much larger workloads. I mean, we're talking about, you know, billions scale workloads and it's gonna continue, it's gonna continue uh you know, growing and changing.

So as you make decisions about how you want to store your vector data, do plan for what you're doing today and also plan for tomorrow to make sure that the decision you make is gonna grow with your workload.

Now there's, you know, but wait, there's more um continue your journey learning about this topic. If you wanna, if you wanna see, you know, how do you combine uh you know, how do you combine something like peachy vector with a l chain application? Uh there's dat 413 that is uh i believe that's a live coding talk. Uh that should be a lot of fun if you want a higher level picture of how, you know, where you know, vector vectors fit in, in general as well as other general database trends. I encourage you to go to dat uh 2112. Uh that's an innovation talk and last, but not least, you know, if you don't have time to, to uh uh ask your question today. Uh we are doing two chalk talks, not one but two d 323. Where really it's gonna be a q and a session we're gonna draw on the board, we'll talk about different architectures, but really, it's your time to be able to ask questions and really dive deep in these topics.

So i know this is the end of the day. I know there's uh there's a lot going on at the expo hall at the venetian. So thank you for joining, joining me today.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Best practices for querying vector data for gen AI apps in PostgreSQL

Right?Is it?
复制链接

扫一扫