向量数据库 A Vector Database ，From IBM

原创于 2025-12-15 22:22:36 发布 · 652 阅读

19 ·

CC 4.0 BY-SA版权

文章标签：

#数据库 #张量 #向量 #向量数据库 #vector database #RAG

AI 同时被 3 个专栏收录

36 篇文章

订阅专栏

数据库

23 篇文章

订阅专栏

人工智能

18 篇文章

订阅专栏

https://www.ibm.com/cn-zh/think/topics/vector-database

What is a vector database?

https://www.ibm.com/think/topics/vector-database

What is a vector database?

A vector database stores, manages and indexes high-dimensional vector data. Data points are stored as arrays of numbers called “vectors,” which are clustered based on similarity. This design enables low-latency queries, making it ideal for AI applications.

Vector databases are growing in popularity because they deliver the speed and performance needed to drive generative artificial intelligence (AI) use cases and applications. According to Gartner®, by 2026, more than 30% of enterprises will have adopted vector databases to build their foundation models with relevant business data.1

向量数据库用于存储、管理和索引高维向量数据。数据点以称为"向量"的数字数组形式存储，并根据相似性进行聚类。这种设计可实现低延迟查询，使其成为人工智能应用的理想选择。

向量数据库正日益流行，因为它们提供了驱动生成式人工智能(AI)用例和应用所需的速度和性能。根据Gartner®的预测，到2026年，超过30%的企业将采用向量数据库，利用相关业务数据构建其基础模型。

Vector databases versus traditional databases

Unlike traditional relational databases with rows and columns, data points in a vector database are represented by vectors with a fixed number of dimensions. Because they use high-dimensional vector embeddings, vector databases are better able to handle unstructured datasets.

The nature of data has undergone a profound transformation. It's no longer confined to structured information easily stored in traditional databases. Unstructured data—including social media posts, images, videos, audio clips and more—is growing 30% to 60% year over year.2

Relational databases excel at managing structured and semistructured datasets in specific formats. Loading unstructured data sources into a traditional relational database to store, manage and prepare the data for artificial intelligence (AI) is a labor-intensive process, especially with new generative use cases such as similarity search.

与传统的关系型数据库（基于行和列结构）不同，向量数据库中的数据点由具有固定维度的向量表示。由于采用高维向量嵌入技术，向量数据库能更高效地处理非结构化数据集。

数据的本质已发生深刻变革，不再局限于易于存储在传统数据库中的结构化信息。非结构化数据——包括社交媒体帖子、图像、视频、音频片段等——正以每年30%至60%的速度增长。

关系型数据库擅长管理特定格式的结构化和半结构化数据集。若要将非结构化数据源导入传统关系型数据库进行存储、管理并为人工智能（AI）应用做准备，这一过程需要密集的人力投入，特别是在相似性搜索等新型生成式应用场景中。

Traditional search typically represents data by using discrete tokens or features, such as keywords, tags or metadata. Traditional searches rely on exact matches to retrieve relevant results. For example, a search for "smartphone" would return results containing the word "smartphone."

Opposed to this, vector search represents data as dense vectors, which are vectors with most or all elements being nonzero. Vectors are represented in a continuous vector space, the mathematical space in which data is represented as vectors.

Vector representations enable similarity search. For example, a vector search for “smartphone” might also return results for “cellphone” and “mobile devices.”

Each dimension of the dense vector corresponds to a latent feature or aspect of the data. A latent feature is an underlying characteristic or attribute that is not directly observed but inferred from the data through mathematical models or algorithms.

Latent features capture the hidden patterns and relationships in the data, enabling more meaningful and accurate representations of items as vectors in a high-dimensional space.

传统搜索通常使用离散标记或特征（如关键词、标签或元数据）来表示数据，其检索结果依赖于精确匹配。例如搜索"智能手机"时，只会返回包含该关键词的结果。

与之相反，向量搜索将数据表示为稠密向量（即大多数或所有元素非零的向量）。这些向量存在于连续的向量空间——一种用向量表征数据的数学空间。

向量表示法支持相似性搜索。例如当搜索"智能手机"的向量时，可能同时返回"手机"和"移动设备"相关结果。

稠密向量的每个维度都对应数据的潜在特征或属性。潜在特征指无法直接观测、但能通过数学模型或算法从数据中推断出的底层特性。

这些潜在特征能捕捉数据中的隐藏模式和关联关系，从而在高维空间中生成更具语义意义且精确的向量表示。

What are vectors?

Vectors are a subset of tensors, which in machine learning (ML) is a generic term for a group of numbers—or a grouping of groups of numbers—in n-dimensional space. Tensors function as a mathematical bookkeeping device for data. Working up from the smallest element:

A scalar is a zero-dimensional tensor, containing a single number. For example, a system modeling weather data might represent a single day’s high temperature (in Fahrenheit) in scalar form as 85.
Then, a vector is a one-dimensional (or first-degree or first-order) tensor, containing multiple scalars of the same type of data. For example, a weather model might use the low, mean and high temperatures for a single day in vector form: 62, 77, 85. Each scalar component is a feature—that is, a dimension—of the vector, representing a feature of that day’s weather.

Vector numbers can represent complex objects such as words, images, videos and audio generated by an ML model. This high-dimensional vector data, containing multiple features, is essential to machine learning, natural language processing (NLP) and other AI tasks. Some example uses of vector data include:

Text: Chatbots need to understand natural language. They do this by relying on vectors that represent words, paragraphs and entire documents.
Images: Image pixels can be described by numerical data and combined to make up a high-dimensional vector for that image.
Speech or audio: Like images, sound waves can also be broken into numerical data and represented as vectors, enabling AI applications such as voice recognition.

向量是张量的子集，在机器学习（ML）领域中，张量是对n维空间中一组数字（或多组数字组合）的统称。张量作为数据的数学记录工具，其层级结构如下：

最基础的单元是标量——零维张量，仅包含单个数值。例如气象数据建模系统中，某日最高气温（华氏度）可用标量形式表示为85。

向量则是一维（或称一阶）张量，由多个同类型数据的标量组成。例如气象模型可能将某日最低、平均和最高气温以向量形式呈现：[62, 77, 85]。每个标量分量都是向量的特征维度，代表当日天气的某个特征。

向量数值可表示机器学习模型生成的复杂对象，如文字、图像、视频和音频。这种包含多维特征的高维向量数据对机器学习、自然语言处理（NLP）等AI任务至关重要，典型应用包括：

文本处理：聊天机器人通过代表词语、段落及整篇文档的向量来理解自然语言。

图像识别：图像像素可转化为数值数据，组合构成该图像的高维特征向量。

语音处理：声波与图像类似，可分解为数值数据并以向量形式呈现，从而实现语音识别等AI应用。

What are vector embeddings?

Vector embeddings are numerical representations of data points that convert various types of data—including nonmathematical data such as words, audio or images—into arrays of numbers that ML models can process.

Artificial intelligence (AI) models, from simple linear regression algorithms to the intricate neural networks used in deep learning, operate through mathematical logic.

Any data that an AI model uses, including unstructured data, needs to be recorded numerically. Vector embedding is a way to convert an unstructured data point into an array of numbers that expresses that data’s original meaning.

Here's a simplified example of word embeddings for a very small corpus (2 words), where each word is represented as a 3-dimensional vector:

cat [0.2, -0.4, 0.7]
dog [0.6, 0.1, 0.5]

In this example, each word ("cat") is associated with a unique vector ([0.2, -0.4, 0.7]). The values in the vector represent the word's position in a continuous 3-dimensional vector space.

Words with similar meanings or contexts are expected to have similar vector representations. For instance, the vectors for "cat" and "dog" are close together, reflecting their semantic relationship.

Embedding models are trained to convert data points into vectors. Vector databases store and index the outputs of these embedding models. Within the database, vectors can be grouped together or identified as opposites based on semantic meaning or features across virtually any data type.

Vector embeddings are the backbone of recommendations, chatbots and generative apps such as ChatGPT.

For example, take the words “car” and “vehicle.” They have similar meanings but are spelled differently. For an AI application to enable effective semantic search, the vector representations of “car” and “vehicle” must capture their semantic similarity. In machine learning, embeddings represent high-dimensional vectors that encode this semantic information.

How are vector databases used?

Vector databases serve three key functions in AI and ML applications:

Vector storage
Vector indexing
Similarity search based on querying or prompting

In operation, vector databases work by using multiple algorithms to conduct an approximate nearest neighbor (ANN) search. The algorithms are then gathered in a pipeline to quickly and accurately retrieve and deliver data neighboring the vector that is queried.

For example, an ANN search could look for products that are visually similar in an e-commerce catalog. Additional uses include anomaly detection, classification and semantic search. Because a dataset runs through a model just once, results are returned within milliseconds.

Vector storage

Vector databases store the outputs of an embedding model algorithm, the vector embeddings. They also store each vector’s metadata—including title, description and data type—which can be queried by using metadata filters.

By ingesting and storing these embeddings, the database can facilitate fast retrieval of a similarity search, matching the user’s prompt with a similar vector embedding.

Vector indexing

Vectors need to be indexed to accelerate searches within high-dimensional data spaces. Vector databases create indexes on vector embeddings for search functions.

The vector database indexes vectors by using an ML algorithm. Indexing maps the vectors to new data structures that enable faster similarity or distance searches, such as nearest neighbor searches, between vectors.

Vectors can be indexed by using algorithms such as hierarchical navigable small world (HNSW), locality-sensitive hashing (LSH) or product quantization (PQ).

HNSW is popular as it creates a tree-like structure. Each node of the tree shows a set of vectors complete with the hierarchies in each. Similarities between vectors are shown at the edges between the nodes.
LSH indexes content by using an approximate nearest-neighbor search. For extra speed, the index can be optimized by returning an approximate, but nonexhaustive result.
PQ converts each dataset into a short, memory-efficient representation. Only the short representations are stored, rather than all of the vectors.

Similarity search based on querying or prompting

Query vectors are vector representations of search queries. When a user queries or prompts an AI model, the model computes an embedding of the query or prompt. The database then calculates distances between query vectors and vectors stored in the index to return similar results.

Databases can measure the distance between vectors with various algorithms, such as nearest neighbor search. Measurements can also be based on various similarity metrics, such as cosine similarity.

The database returns the most similar vectors or nearest neighbors to the query vector according to the similarity ranking. These calculations support various machine learning tasks, such as recommendation systems, semantic search, image recognition and other natural language processing tasks.

Advantages of vector databases

Vector databases are a popular way to power enterprise AI-based applications because they can deliver many benefits:

Speed and performance
Scalability
Lower cost of ownership
Data management
Flexibility

Considerations for vector databases and data strategy

Organizations have a breadth of options when choosing a vector database capability. To find one that meets their data and AI needs, many organizations consider:

Types of vector databases
Integration with a data ecosystem
When vector indexing is not optimal
Tools for creating and deploying vector databases

Types of vector databases

There are a few alternatives to choose from.

Stand-alone, proprietary, fully vectorized databases such as Pinecone.
Open-source solutions such as Weaviate or Milvus, which provide built-in RESTful APIs and support for Python and Java programming languages.
Data lakehouses with vector database capabilities integrated, such as IBM watsonx.data™.
Vector database and database search extensions such as PostgreSQL’s open source pgvector extension, which provides vector similarity search capabilities. An SQL vector database can combine the advantages of a traditional SQL database with the power of a vector database.

Integration with a data ecosystem

Vector databases should not be considered as stand-alone capabilities, but rather a part of a broader data and AI ecosystem.

Many offer APIs, native extensions or can be integrated with databases. Because vector databases are built to use enterprise data to enhance models, organizations must also have proper data governance and security in place to help ensure that the data used to train large language models (LLMs) can be trusted.

Beyond APIs, many vector databases use programming-language-specific software development kits (SDKs) that can wrap around the APIs. Using the SDKs, developers often find it easier to work with the data in their apps.

When vector indexing is not optimal

Using a vector store and index is well suited for applications that are based on facts or fact-based querying, such as extracting specific information from complex documents.

However, asking for a summary of topics would not work well with a vector index. In this case, an LLM would go through all the different possible contexts on that topic within the data.

A faster option would be to use a different kind of index, such as a list index rather than a vector index, because a list index would immediately fetch the first element in each listing.

Tools for creating and deploying vector databases

To optimize vector database development, LangChain is an open-source orchestration framework for developing applications that use LLMs.

Available in both Python-based and JavaScript-based libraries, LangChain’s tools and APIs simplify the process of building LLM-driven apps such as chatbots and virtual agents. LangChain provides integrations for over 25 different embedding methods, and for over 50 different vector stores (both cloud-hosted and local).

To power enterprise-grade AI, a data lakehouse might be paired with an integrated vector database. Organizations can unify, curate and prepare vectorized embeddings for their generative AI applications at scale across their trusted, governed data. This enhances the relevance and precision of their AI workloads, including chatbots, personalized recommendation systems and image similarity search applications.

Vector database use cases

The applications for vector databases are vast and growing. Some key use cases include:

Retrieval-augmented generation (RAG)
Conversational AI
Recommendation engines
Vector search

Retrieval-augmented generation (RAG)

Retrieval-augmented generation (RAG) is an AI framework for enabling large language models (LLMs) to retrieve facts from an external knowledge base. Vector databases are key to supporting RAG implementations.

Enterprises are increasingly favoring RAG in generative AI workflows for its faster time-to-market, efficient inference and reliable output. The framework is particularly helpful in use cases such as customer care, HR and talent management.

RAG helps ensure that a model is linked to the most current, reliable facts and that users have access to the model’s sources so that its claims can be verified. Anchoring the LLM in trusted data can help reduce model hallucinations.

RAG uses high-dimensional vector data to enrich prompts with semantically relevant information for in-context learning by foundation models. RAG requires effective storage and retrieval during the inference stage, which handles the highest volume of data.

Vector databases excel at efficiently indexing, storing and retrieving these high-dimensional vectors, providing the speed, precision and scale needed for applications such as recommendation engines and chatbots.

向量数据库的应用场景广泛且持续增长，主要用例包括：

检索增强生成（RAG）

对话式人工智能

推荐引擎

向量搜索

检索增强生成（RAG）是一种人工智能框架，使大型语言模型（LLM）能够从外部知识库中检索事实。向量数据库是支持RAG实现的关键技术。

企业越来越青睐在生成式AI工作流程中采用RAG，因其具备上市周期短、推理效率高和输出可靠等优势。该框架在客户服务、人力资源和人才管理等场景中尤为实用。

RAG能确保模型与最新可靠的事实关联，并允许用户查看模型的信息来源以验证其结论。将LLM锚定在可信数据上有助于减少模型幻觉现象。

RAG利用高维向量数据，为基础模型提供语义相关的上下文学习信息。该技术需要在推理阶段实现高效的数据存储和检索，此时处理的数据量最为庞大。

向量数据库擅长高效索引、存储和检索这些高维向量，为推荐引擎和聊天机器人等应用提供所需的速度、精度和扩展能力。

Conversational AI

Vector databases, particularly when used to implement RAG frameworks, can help improve virtual agent interactions by enhancing the agent’s ability to parse relevant knowledge bases efficiently and accurately. Agents can provide real-time contextual answers to user queries, along with the source documents and page numbers for reference.

向量数据库，特别是用于实现RAG框架时，能通过增强虚拟代理高效精准解析相关知识库的能力来优化交互体验。代理可实时提供符合语境的用户问题答案，并附上来源文档及页码供参考。

Recommendation engines

E-commerce sites, for instance, can use vectors to represent customer preferences and product attributes. This enables them to suggest items similar to past purchases, based on vector similarity, enhancing user experience and increasing retention.

Vector search

This search technique is used to discover similar items or data points, typically represented as vectors, in large collections. Vector search can capture the semantic relationships between elements, enabling effective processing by machine learning models and artificial intelligence applications.

These searches can take several forms.

Semantic search: Perform searches based on the meaning or context of a query, enabling more precise and relevant results. Because both words and phrases can be represented as vectors, semantic vector search functions understand user intent better than general keywords.
Similarity search and applications: Find similar images, audio, video or text data to support advanced image and speech recognition and natural language processing. Images and video can be indexed and retrieved based on similarity.