Problems retrieving the embeddings data form OpenAI API Batch embedding job

题意:从OpenAI API批量嵌入作业中检索嵌入数据时遇到问题

问题背景:

I have to embed over 300,000 products description for a multi-classification project. I split the descriptions onto chunks of 34,337 descriptions to be under the Batch embeddings limit size.

A sample of my jsonl file for batch processing:

{"custom_id": "request-0", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": "Base L\u00edquida Maybelline Superstay 24 Horas Full Coverage Cor 220 Natural Beige 30ml", "encoding_format": "float"}}
{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": "Sand\u00e1lia Havaianas Top Animals Cinza/Gelo 39/40", "encoding_format": "float"}}

My jsonl file has 34,337 lines.

I've susscesfully uploaded the file:

File 'batch_emb_file_1.jsonl' uploaded succesfully:
 FileObject(id='redacted for work compliance', bytes=6663946, created_at=1720128016, filename='batch_emb_file_1.jsonl', object='file', purpose='batch', status='processed', status_details=None)

and ran the embedding job:

Batch job created successfully:
 Batch(id='redacted for work compliance', completion_window='24h', created_at=1720129886, endpoint='/v1/embeddings', input_file_id='redacted for work compliance', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1720216286, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'Batch job for embedding large quantity of product descriptions', 'initiated_by': 'Marcio', 'project': 'Product Classification', 'date': '2024-07-04 21:51', 'comments': 'This is the 1 batch job of embeddings'}, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))

The work was completed:

client.batches.retrieve(batch_job_1.id).status
'completed'

client.batches.retrieve('redacted for work compliance'), returns:

Batch(id='redacted for work compliance', completion_window='24h', created_at=1720129886, endpoint='/v1/embeddings', input_file_id='redacted for work compliance', object='batch', status='completed', cancelled_at=None, cancelling_at=None, completed_at=1720135956, error_file_id=None, errors=None, expired_at=None, expires_at=1720216286, failed_at=None, finalizing_at=1720133521, in_progress_at=1720129903, metadata={'description': 'Batch job for embedding large quantity of product descriptions', 'initiated_by': 'Marcio', 'project': 'Product Classification', 'date': '2024-07-04 21:51', 'comments': 'This is the 1 batch job of embeddings'}, output_file_id='redacted for work compliance', request_counts=BatchRequestCounts(completed=34337, failed=0, total=34337))

But when I try to get the content using output_file_id string

client.files.content(value of output_file_id), returns:

<openai._legacy_response.HttpxBinaryResponseContent at 0x79ae81ec7d90>

I have tried: client.files.content(value of output_file_id).content but this kills my kernel

What am I doing wrong? Also I believe I am under utilizing Batch embeddings. the 90,000 limits conflicts with Batch Queue Limit of 'text-embedding-ada-002' model which is: 3,000,000

Could someone help?

问题解决:

Retrieving the embedding data from batch file is a bit trick, this Tutorial breaks it down set by set ​​​​​​link
 

after getting the output_file_id, you need to:

output_file =client.files.content(output_files_id).text

embedding_results = []
for line in output_file.split('\n')[:-1]:
            data =json.loads(line)
            custom_id = data.get('custom_id')
            embedding = data['response']['body']['data'][0]['embedding']
            embedding_results.append([custom_id, embedding])


embedding_results = pd.DataFrame(embedding_results, columns=['custom_id', 'embedding'])

In my case, this retrieves the embedding data from the batch job file

  • 26
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

营赢盈英

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值