大模型框架：vLLM

pip install wheel packaging ninja setuptools>=49.4.0 numpy psutil

# 需要进入源码目录
pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu

5.执行安装：

VLLM_TARGET_DEVICE=cpu python setup.py install

2.3 相关配置

1. vLLM默认从HuggingFace下载模型，如果想从ModelScope下载模型，需要配置环境变量：

export VLLM_USE_MODELSCOPE=True

三、使用 vLLM

3.1 离线推理

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("/data/weisx/model/Qwen1.5-4B-Chat")

# Pass the default decoding hyperparameters of Qwen1.5-4B-Chat
# max_tokens is for the maximum length for generation.
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)

# Input the model name or path. Can be GPTQ or AWQ models.
llm = LLM(model="Qwen/l/Qwen1.5-4B-Chat", trust_remote_code=True)

# Prepare your prompts
prompt = "Tell me something about large language models."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# generate outputs
outputs = llm.generate([text], sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

3.2 适配OpenAI-API的API服务

借助vLLM，构建一个与OpenAI API兼容的API服务十分简便，该服务可以作为实现OpenAI API协议的服务器进行部署。默认情况下，它将在 http://localhost:8000 启动服务器。您可以通过 --host 和 --port 参数来自定义地址。请按照以下所示运行命令：

python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen1.5-4B-Chat

使用curl与Qwen对接：

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen1.5-4B-Chat",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
]
}'

使用python客户端与Qwen对接：

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Qwen/Qwen1.5-4B-Chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me something about large language models."},
    ]
)
print("Chat response:", chat_response)

原文地址:https://blog.csdn.net/m0_37559973/article/details/139000981 本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：https://www.suanlizi.com/kf/1794350305190744064.html 如若内容造成侵权/违法违规/事实不符，请联系《酸梨子》网邮箱：1419361763@qq.com进行投诉反馈，一经查实，立即删除！

阅读全部