Problems retrieving the embeddings data form OpenAI API Batch embedding job

题意:从OpenAI API批量嵌入作业中检索嵌入数据时遇到问题

问题背景:

I have to embed over 300,000 products description for a multi-classification project. I split the descriptions onto chunks of 34,337 descriptions to be under the Batch embeddings limit size.

A sample of my jsonl file for batch processing:

{"custom_id": "request-0", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": "Base L\u00edquida Maybelline Superstay 24 Horas Full Coverage Cor 220 Natural Beige 30ml", "encoding_format": "float"}}
{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": "Sand\u00e1lia Havaianas Top Animals Cinza/Gelo 39/40", "encoding_format": "float"}}

My jsonl file has 34,337 lines.

I've susscesfully uploaded the file:

File 'batch_emb_file_1.jsonl' uploaded succesfully:
 FileObject(id='redacted for work compliance', bytes=6663946, created_at=1720128016, filename='batch_emb_file_1.jsonl', object='file', purpose='batch', status='processed', status_details=None)

and ran the embedding job:

Batch job created successfully:
 Batch(id='redacted for work compliance', completion_window='24h', created_at=1720129886, endpoint='/v1/embeddings', input_file_id='redacted for work compliance', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1720216286, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'Batch job for embedding large quantity of product descriptions', 'initiated_by': 'Marcio', 'project': 'Product Classification', 'date': '2024-07-04 21:51', 'comments': 'This is the 1 batch job of embeddings'}, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))

The work was completed:

client.batches.retrieve(batch_job_1.id).status
'completed'

client.batches.retrieve('redacted for work compliance'), returns:

Batch(id='redacted for work compliance', completion_window='24h', created_at=1720129886, endpoint='/v1/embeddings', input_file_id='redacted for work compliance', object='batch', status='completed', cancelled_at=None, cancelling_at=None, completed_at=1720135956, error_file_id=None, errors=None, expired_at=None, expires_at=1720216286, failed_at=None, finalizing_at=1720133521, in_progress_at=1720129903, metadata={'description': 'Batch job for embedding large quantity of product descriptions', 'initiated_by': 'Marcio', 'project': 'Product Classification', 'date': '2024-07-04 21:51', 'comments': 'This is the 1 batch job of embeddings'}, output_file_id='redacted for work compliance', request_counts=BatchRequestCounts(completed=34337, failed=0, total=34337))

But when I try to get the content using output_file_id string

client.files.content(value of output_file_id), returns:

<openai._legacy_response.HttpxBinaryResponseContent at 0x79ae81ec7d90>

I have tried: client.files.content(value of output_file_id).content but this kills my kernel

What am I doing wrong? Also I believe I am under utilizing Batch embeddings. the 90,000 limits conflicts with Batch Queue Limit of 'text-embedding-ada-002' model which is: 3,000,000

Could someone help?

问题解决:

Retrieving the embedding data from batch file is a bit trick, this Tutorial breaks it down set by set ​​​​​​link
 

after getting the output_file_id, you need to:

output_file =client.files.content(output_files_id).text

embedding_results = []
for line in output_file.split('\n')[:-1]:
            data =json.loads(line)
            custom_id = data.get('custom_id')
            embedding = data['response']['body']['data'][0]['embedding']
            embedding_results.append([custom_id, embedding])


embedding_results = pd.DataFrame(embedding_results, columns=['custom_id', 'embedding'])

In my case, this retrieves the embedding data from the batch job file

相关推荐

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2024-07-19 23:30:02       52 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2024-07-19 23:30:02       54 阅读
  3. 在Django里面运行非项目文件

    2024-07-19 23:30:02       45 阅读
  4. Python语言-面向对象

    2024-07-19 23:30:02       55 阅读

热门阅读

  1. ArcEngine 非SDE方式加载postgis数据

    2024-07-19 23:30:02       18 阅读
  2. C语言习题~day32

    2024-07-19 23:30:02       17 阅读
  3. 安康古韵长,汉水碧波扬

    2024-07-19 23:30:02       14 阅读
  4. python单继承和多继承实例讲解

    2024-07-19 23:30:02       16 阅读
  5. Linux的常用命令大全

    2024-07-19 23:30:02       15 阅读
  6. 监测vuex中state的变化

    2024-07-19 23:30:02       17 阅读