OpenAI 双语文档参考 Embeddings

Embeddings

What are embeddings? 什么是嵌入?

OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are commonly used for:
OpenAI 的文本嵌入衡量文本字符串的相关性。嵌入通常用于:

  • Search (where results are ranked by relevance to a query string)
    搜索(结果按与查询字符串的相关性排序)
  • Clustering (where text strings are grouped by similarity)
    聚类(其中文本字符串按相似性分组)
  • Recommendations (where items with related text strings are recommended)
    推荐(推荐具有相关文本字符串的项目)
  • Anomaly detection (where outliers with little relatedness are identified)
    异常检测(识别出相关性很小的异常值)
  • Diversity measurement (where similarity distributions are analyzed)
    多样性测量(分析相似性分布)
  • Classification (where text strings are classified by their most similar label)
    分类(其中文本字符串按其最相似的标签分类)

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.
嵌入是浮点数的向量(列表)。两个向量之间的距离衡量它们的相关性。小距离表示高相关性,大距离表示低相关性。

Visit our pricing page to learn about Embeddings pricing. Requests are billed based on the number of tokens in the input sent.
访问我们的定价页面以了解嵌入定价。请求根据发送的输入中的令牌数量计费。

**To see embeddings in action, check out our code samples

要查看嵌入的实际效果,请查看我们的代码示例**

  • Classification
  • Topic clustering
  • Search
  • Recommendations

Browse Samples‍

How to get embeddings 如何获得嵌入

To get an embedding, send your text string to the embeddings API endpoint along with a choice of embedding model ID (e.g., text-embedding-ada-002). The response will contain an embedding, which you can extract, save, and use.
要获得嵌入,请将您的文本字符串连同选择的嵌入模型 ID(例如 text-embedding-ada-002 )一起发送到嵌入 API 端点。响应将包含一个嵌入,您可以提取、保存和使用它。

Example requests:

Example: Getting embeddings 示例:获取嵌入

python

response = openai.Embedding.create(
    input="Your text string goes here",
    model="text-embedding-ada-002"
)
embeddings = response['data'][0]['embedding']

Example response:

{
  "data": [
    {
      "embedding": [
        -0.006929283495992422,
        -0.005336422007530928,
        ...
        -4.547132266452536e-05,
        -0.024047505110502243
      ],
      "index": 0,
      "object": "embedding"
    }
  ],
  "model": "text-embedding-ada-002",
  "object": "list",
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

See more Python code examples in the OpenAI Cookbook.
在 OpenAI Cookbook 中查看更多 Python 代码示例。

When using OpenAI embeddings, please keep in mind their limitations and risks.
使用 OpenAI 嵌入时,请牢记它们的局限性和风险。

Embedding models

OpenAI offers one second-generation embedding model (denoted by -002 in the model ID) and 16 first-generation models (denoted by -001 in the model ID).
OpenAI 提供了一个二代嵌入模型(在模型 ID 中用 -002 表示)和 16 个第一代模型(在模型 ID 中用 -001 表示)。

We recommend using text-embedding-ada-002 for nearly all use cases. It’s better, cheaper, and simpler to use. Read the blog post announcement.
我们建议对几乎所有用例使用 text-embedding-ada-002。它更好、更便宜、更易于使用。阅读博文公告。

MODEL GENERATIONTOKENIZERMAX INPUT TOKENS 最大输入代币KNOWLEDGE CUTOFF
V2cl100k_base8191Sep 2021
V1GPT-2/GPT-32046Aug 2020

Usage is priced per input token, at a rate of $0.0004 per 1000 tokens, or about ~3,000 pages per US dollar (assuming ~800 tokens per page):
使用量按输入令牌定价,每 1000 个令牌 0.0004 美元,或每美元约 3,000 页(假设每页约 800 个令牌):

MODELROUGH PAGES PER DOLLAR 每美元粗略页数EXAMPLE PERFORMANCE ON BEIR SEARCH EVAL BEIR SEARCH EVAL 的性能示例
text-embedding-ada-002 文本嵌入-ada-002300053.9
-davinci--001652.8
-curie--0016050.9
-babbage--00124050.4
-ada--00130049.0

Second-generation models 二代机型

MODEL NAMETOKENIZERMAX INPUT TOKENS 最大输入代币OUTPUT DIMENSIONS
text-embedding-ada-002 文本嵌入-ada-002cl100k_base81911536

First-generation models (not recommended)
第一代机型(不推荐)

Use cases

Here we show some representative use cases. We will use the Amazon fine-food reviews dataset for the following examples.
在这里,我们展示了一些有代表性的用例。我们将在以下示例中使用亚马逊美食评论数据集。

Obtaining the embeddings 获取嵌入

The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text). For example:
该数据集包含截至 2012 年 10 月亚马逊用户留下的总共 568,454 条食品评论。我们将使用 1,000 条最新评论的子集用于说明目的。评论是英文的,往往是正面的或负面的。每条评论都有一个 ProductId、UserId、Score、评论标题(摘要)和评论正文(文本)。例如:

PRODUCT IDUSER IDSCORESUMMARYTEXT
B001E4KFG0A3SGXH7AUHU8GW5Good Quality Dog Food 优质狗粮I have bought several of the Vitality canned… 我买了好几个活力罐头…
B00813GRG4A1D87F6ZCVE5NK1Not as Advertised 不像宣传的那样Product arrived labeled as Jumbo Salted Peanut… 产品到达时标记为 Jumbo Salted Peanut…

We will combine the review summary and review text into a single combined text. The model will encode this combined text and output a single vector embedding.
我们会将评论摘要和评论文本合并为一个组合文本。该模型将对该组合文本进行编码并输出单个向量嵌入。

Obtain_dataset.ipynb 获取数据集.ipynb
def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

df['ada_embedding'] = df.combined.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
df.to_csv('output/embedded_1k_reviews.csv', index=False)

To load the data from a saved file, you can run the following:
要从保存的文件中加载数据,您可以运行以下命令:

import pandas as pd

df = pd.read_csv('output/embedded_1k_reviews.csv')
df['ada_embedding'] = df.ada_embedding.apply(eval).apply(np.array)

Data visualization in 2D 二维数据可视化

Embedding as a text feature encoder for ML algorithms
嵌入作为 ML 算法的文本特征编码器

Regression_using_embeddings.ipynb
回归_使用_embeddings.ipynb

An embedding can be used as a general free-text feature encoder within a machine learning model. Incorporating embeddings will improve the performance of any machine learning model, if some of the relevant inputs are free text. An embedding can also be used as a categorical feature encoder within a ML model. This adds most value if the names of categorical variables are meaningful and numerous, such as job titles. Similarity embeddings generally perform better than search embeddings for this task.
嵌入可以用作机器学习模型中的通用自由文本特征编码器。如果一些相关输入是自由文本,则合并嵌入将提高任何机器学习模型的性能。嵌入也可以用作 ML 模型中的分类特征编码器。如果分类变量的名称有意义且数量众多,例如职位名称,那么这会增加最大的价值。对于此任务,相似性嵌入通常比搜索嵌入表现更好。

We observed that generally the embedding representation is very rich and information dense. For example, reducing the dimensionality of the inputs using SVD or PCA, even by 10%, generally results in worse downstream performance on specific tasks.
我们观察到,通常嵌入表示非常丰富且信息密集。例如,使用 SVD 或 PCA 降低输入的维度,即使降低 10%,通常也会导致特定任务的下游性能变差。

This code splits the data into a training set and a testing set, which will be used by the following two use cases, namely regression and classification.
此代码将数据拆分为训练集和测试集,将由以下两个用例使用,即回归和分类。

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    list(df.ada_embedding.values),
    df.Score,
    test_size = 0.2,
    random_state=42
)

Regression using the embedding features
使用嵌入特征进行回归

Embeddings present an elegant way of predicting a numerical value. In this example we predict the reviewer’s star rating, based on the text of their review. Because the semantic information contained within embeddings is high, the prediction is decent even with very few reviews.
嵌入提供了一种预测数值的优雅方式。在这个例子中,我们根据评论的文本预测评论者的星级。因为嵌入中包含的语义信息很高,所以即使评论很少,预测也不错。

We assume the score is a continuous variable between 1 and 5, and allow the algorithm to predict any floating point value. The ML algorithm minimizes the distance of the predicted value to the true score, and achieves a mean absolute error of 0.39, which means that on average the prediction is off by less than half a star.
我们假设分数是 1 到 5 之间的连续变量,并允许算法预测任何浮点值。 ML 算法最小化预测值与真实分数的距离,并实现 0.39 的平均绝对误差,这意味着平均预测偏差不到半星。

from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(n_estimators=100)
rfr.fit(X_train, y_train)
preds = rfr.predict(X_test)

Collapse‍

Classification using the embedding features
使用嵌入特征进行分类

Classification_using_embeddings.ipynb
Classification_using_embeddings.ipynb

This time, instead of having the algorithm predict a value anywhere between 1 and 5, we will attempt to classify the exact number of stars for a review into 5 buckets, ranging from 1 to 5 stars.
这一次,我们不再让算法预测 1 到 5 之间的任何值,而是尝试将评论的准确星数分类为 5 个桶,范围从 1 到 5 星。

After the training, the model learns to predict 1 and 5-star reviews much better than the more nuanced reviews (2-4 stars), likely due to more extreme sentiment expression.
训练后,该模型学习预测 1 星和 5 星评论,比更细微的评论(2-4 星)更好,这可能是由于更极端的情绪表达。

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)

Collapse‍

Zero-shot classification 零样本分类

Zero-shot_classification_with_embeddings.ipynb
零射分类_with_embeddings.ipynb

We can use embeddings for zero shot classification without any labeled training data. For each class, we embed the class name or a short description of the class. To classify some new text in a zero-shot manner, we compare its embedding to all class embeddings and predict the class with the highest similarity.
我们可以在没有任何标记训练数据的情况下使用嵌入进行零镜头分类。对于每个类,我们嵌入类名或类的简短描述。为了以零样本方式对一些新文本进行分类,我们将其嵌入与所有类嵌入进行比较,并预测具有最高相似度的类。

from openai.embeddings_utils import cosine_similarity, get_embedding

df= df[df.Score!=3]
df['sentiment'] = df.Score.replace({1:'negative', 2:'negative', 4:'positive', 5:'positive'})

labels = ['negative', 'positive']
label_embeddings = [get_embedding(label, model=model) for label in labels]

def label_score(review_embedding, label_embeddings):
   return cosine_similarity(review_embedding, label_embeddings[1]) - cosine_similarity(review_embedding, label_embeddings[0])

prediction = 'positive' if label_score('Sample Review', label_embeddings) > 0 else 'negative'

Collapse‍

Obtaining user and product embeddings for cold-start recommendation
获取用于冷启动推荐的用户和产品嵌入

User_and_product_embeddings.ipynb
User_and_product_embeddings.ipynb

We can obtain a user embedding by averaging over all of their reviews. Similarly, we can obtain a product embedding by averaging over all the reviews about that product. In order to showcase the usefulness of this approach we use a subset of 50k reviews to cover more reviews per user and per product.
我们可以通过对他们的所有评论进行平均来获得用户嵌入。同样,我们可以通过对有关该产品的所有评论进行平均来获得产品嵌入。为了展示这种方法的实用性,我们使用 50k 评论的子集来覆盖每个用户和每个产品的更多评论。

We evaluate the usefulness of these embeddings on a separate test set, where we plot similarity of the user and product embedding as a function of the rating. Interestingly, based on this approach, even before the user receives the product we can predict better than random whether they would like the product.
我们在单独的测试集上评估这些嵌入的有用性,我们将用户和产品嵌入的相似性绘制为评分的函数。有趣的是,基于这种方法,甚至在用户收到产品之前,我们就可以比随机预测更好地预测他们是否喜欢该产品。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-a1P0PNpF-1681138519772)(…/…/…/…/…/…/workspace/vscode-project/ringyin-blog/source/images/embeddings-boxplot.png)]

user_embeddings = df.groupby('UserId').ada_embedding.apply(np.mean)
prod_embeddings = df.groupby('ProductId').ada_embedding.apply(np.mean)

Collapse‍

Clustering

Clustering.ipynb

Clustering is one way of making sense of a large volume of textual data. Embeddings are useful for this task, as they provide semantically meaningful vector representations of each text. Thus, in an unsupervised way, clustering will uncover hidden groupings in our dataset.
聚类是理解大量文本数据的一种方式。嵌入对于这项任务很有用,因为它们提供了每个文本的语义上有意义的向量表示。因此,以一种无监督的方式,聚类将揭示我们数据集中隐藏的分组。

In this example, we discover four distinct clusters: one focusing on dog food, one on negative reviews, and two on positive reviews.
在这个例子中,我们发现了四个不同的集群:一个专注于狗食,一个专注于负面评论,两个专注于正面评论。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CUy23JXg-1681138519773)(…/…/…/…/…/…/workspace/vscode-project/ringyin-blog/source/images/embeddings-cluster.png)]

import numpy as np
from sklearn.cluster import KMeans

matrix = np.vstack(df.ada_embedding.values)
n_clusters = 4

kmeans = KMeans(n_clusters = n_clusters, init='k-means++', random_state=42)
kmeans.fit(matrix)
df['Cluster'] = kmeans.labels_

Collapse‍

Text search using embeddings 使用嵌入的文本搜索

Semantic_text_search_using_embeddings.ipynb
Semantic_text_search_using_embeddings.ipynb

To retrieve the most relevant documents we use the cosine similarity between the embedding vectors of the query and each document, and return the highest scored documents.
为了检索最相关的文档,我们使用查询的嵌入向量与每个文档之间的余弦相似度,并返回得分最高的文档。

from openai.embeddings_utils import get_embedding, cosine_similarity

def search_reviews(df, product_description, n=3, pprint=True):
   embedding = get_embedding(product_description, model='text-embedding-ada-002')
   df['similarities'] = df.ada_embedding.apply(lambda x: cosine_similarity(x, embedding))
   res = df.sort_values('similarities', ascending=False).head(n)
   return res

res = search_reviews(df, 'delicious beans', n=3)

Collapse‍

Code search using embeddings 使用嵌入的代码搜索

Code_search.ipynb

Code search works similarly to embedding-based text search. We provide a method to extract Python functions from all the Python files in a given repository. Each function is then indexed by the text-embedding-ada-002 model.
代码搜索的工作方式类似于基于嵌入的文本搜索。我们提供了一种从给定存储库中的所有 Python 文件中提取 Python 函数的方法。然后每个函数都由 text-embedding-ada-002 模型索引。

To perform a code search, we embed the query in natural language using the same model. Then we calculate cosine similarity between the resulting query embedding and each of the function embeddings. The highest cosine similarity results are most relevant.
为了执行代码搜索,我们使用相同的模型将查询嵌入到自然语言中。然后我们计算结果查询嵌入和每个函数嵌入之间的余弦相似度。最高的余弦相似度结果是最相关的。

from openai.embeddings_utils import get_embedding, cosine_similarity

df['code_embedding'] = df['code'].apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))

def search_functions(df, code_query, n=3, pprint=True, n_lines=7):
   embedding = get_embedding(code_query, model='text-embedding-ada-002')
   df['similarities'] = df.code_embedding.apply(lambda x: cosine_similarity(x, embedding))

   res = df.sort_values('similarities', ascending=False).head(n)
   return res
res = search_functions(df, 'Completions API tests', n=3)

Collapse‍

Recommendations using embeddings 使用嵌入的推荐

Recommendation_using_embeddings.ipynb
Recommendation_using_embeddings.ipynb

Because shorter distances between embedding vectors represent greater similarity, embeddings can be useful for recommendation.
因为嵌入向量之间的距离越短表示相似度越高,嵌入可用于推荐。

Below, we illustrate a basic recommender. It takes in a list of strings and one ‘source’ string, computes their embeddings, and then returns a ranking of the strings, ranked from most similar to least similar. As a concrete example, the linked notebook below applies a version of this function to the AG news dataset (sampled down to 2,000 news article descriptions) to return the top 5 most similar articles to any given source article.
下面,我们说明了一个基本的推荐系统。它接受一个字符串列表和一个“源”字符串,计算它们的嵌入,然后返回字符串的排名,从最相似到最不相似。作为具体示例,下面链接的笔记本将此函数的一个版本应用于 AG 新闻数据集(采样到 2,000 篇新闻文章描述),以返回与任何给定源文章最相似的前 5 篇文章。

def recommendations_from_strings(
   strings: List[str],
   index_of_source_string: int,
   model="text-embedding-ada-002",
) -> List[int]:
   """Return nearest neighbors of a given string."""

   # get embeddings for all strings
   embeddings = [embedding_from_string(string, model=model) for string in strings]

   # get the embedding of the source string
   query_embedding = embeddings[index_of_source_string]

   # get distances between the source embedding and other embeddings (function from embeddings_utils.py)
   distances = distances_from_embeddings(query_embedding, embeddings, distance_metric="cosine")

   # get indices of nearest neighbors (function from embeddings_utils.py)
   indices_of_nearest_neighbors = indices_of_nearest_neighbors_from_distances(distances)
   return indices_of_nearest_neighbors

Collapse‍

Limitations & risks 局限性和风险

Our embedding models may be unreliable or pose social risks in certain cases, and may cause harm in the absence of mitigations.
我们的嵌入模型可能不可靠或在某些情况下会带来社会风险,并且在没有缓解措施的情况下可能会造成伤害。

Social bias

Limitation: The models encode social biases, e.g. via stereotypes or negative sentiment towards certain groups.
局限性:模型对社会偏见进行编码,例如通过对某些群体的刻板印象或负面情绪。

We found evidence of bias in our models via running the SEAT (May et al, 2019) and the Winogender (Rudinger et al, 2018) benchmarks. Together, these benchmarks consist of 7 tests that measure whether models contain implicit biases when applied to gendered names, regional names, and some stereotypes.
我们通过运行 SEAT(May 等人,2019 年)和 Winogender(Rudinger 等人,2018 年)基准测试发现了模型中存在偏差的证据。这些基准一起包含 7 个测试,用于衡量模型在应用于性别名称、区域名称和某些刻板印象时是否包含隐性偏见。

For example, we found that our models more strongly associate (a) European American names with positive sentiment, when compared to African American names, and (b) negative stereotypes with black women.
例如,我们发现,与非裔美国人的名字相比,我们的模型更强烈地将 (a) 欧裔美国人的名字与积极情绪联系起来,以及 (b) 对黑人女性的负面刻板印象。

These benchmarks are limited in several ways: (a) they may not generalize to your particular use case, and (b) they only test for a very small slice of possible social bias.
这些基准在几个方面存在局限性:(a) 它们可能无法推广到您的特定用例,以及 (b) 它们仅测试极小部分可能的社会偏见。

These tests are preliminary, and we recommend running tests for your specific use cases. These results should be taken as evidence of the existence of the phenomenon, not a definitive characterization of it for your use case. Please see our usage policies for more details and guidance.
这些测试是初步的,我们建议针对您的特定用例运行测试。这些结果应被视为该现象存在的证据,而不是对您的用例的明确描述。请参阅我们的使用政策以获取更多详细信息和指导。

Please contact our support team via chat if you have any questions; we are happy to advise on this.
如果您有任何问题,请通过聊天联系我们的支持团队;我们很乐意就此提供建议。

Blindness to recent events 对最近发生的事件视而不见

Limitation: Models lack knowledge of events that occurred after August 2020.
局限性:模型缺乏对 2020 年 8 月之后发生的事件的了解。

Our models are trained on datasets that contain some information about real world events up until 8/2020. If you rely on the models representing recent events, then they may not perform well.
我们的模型在包含 8/2020 之前真实世界事件的一些信息的数据集上进行训练。如果你依赖于代表最近事件的模型,那么它们可能表现不佳。

Frequently asked questions 经常问的问题

How can I tell how many tokens a string has before I embed it?
在嵌入字符串之前,如何知道它有多少个标记?

In Python, you can split a string into tokens with OpenAI’s tokenizer tiktoken.
在 Python 中,您可以使用 OpenAI 的标记器 tiktoken 将字符串拆分为标记。

Example code:

import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

num_tokens_from_string("tiktoken is great!", "cl100k_base")

For second-generation embedding models like text-embedding-ada-002, use the cl100k_base encoding.
对于像 text-embedding-ada-002 这样的第二代嵌入模型,使用 cl100k_base 编码。

More details and example code are in the OpenAI Cookbook guide how to count tokens with tiktoken.
更多详细信息和示例代码在 OpenAI Cookbook 指南 how to count tokens with tiktoken 中。

How can I retrieve K nearest embedding vectors quickly?
如何快速检索 K 个最近的嵌入向量?

For searching over many vectors quickly, we recommend using a vector database. You can find examples of working with vector databases and the OpenAI API in our Cookbook on GitHub.
为了快速搜索多个矢量,我们建议使用矢量数据库。您可以在 GitHub 上的 Cookbook 中找到使用矢量数据库和 OpenAI API 的示例。

Vector database options include: 矢量数据库选项包括:

  • Pinecone, a fully managed vector database
    Pinecone ,一个完全托管的矢量数据库
  • Weaviate, an open-source vector search engine
    Weaviate ,一个开源矢量搜索引擎
  • Redis as a vector database
    Redis 作为矢量数据库
  • Qdrant, a vector search engine
    Qdrant ,一个矢量搜索引擎
  • Milvus, a vector database built for scalable similarity search
    Milvus ,一个为可扩展相似性搜索而构建的矢量数据库
  • Chroma, an open-source embeddings store
    Chroma ,一个开源嵌入商店

Which distance function should I use?
我应该使用哪个距离函数?

We recommend cosine similarity. The choice of distance function typically doesn’t matter much.
我们推荐余弦相似度。距离函数的选择通常无关紧要。

OpenAI embeddings are normalized to length 1, which means that:
OpenAI 嵌入被归一化为长度 1,这意味着:

  • Cosine similarity can be computed slightly faster using just a dot product
    仅使用点积可以稍微更快地计算余弦相似度
  • Cosine similarity and Euclidean distance will result in the identical rankings
    余弦相似度和欧几里德距离将导致相同的排名

Can I share my embeddings online?
我可以在线共享我的嵌入吗?

Customers own their input and output from our models, including in the case of embeddings. You are responsible for ensuring that the content you input to our API does not violate any applicable law or our Terms of Use.
客户拥有我们模型的输入和输出,包括嵌入的情况。您有责任确保您输入到我们 API 的内容不违反任何适用法律或我们的使用条款。

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值