LLM 推理：vLLM¶

vLLM 是一个旨在高效服务大型语言模型 (LLM) 的库。它使用 PagedAttention 和连续批处理提供了高服务吞吐量和高效的注意力键值内存管理。它与多种 LLM 无缝集成，例如 Llama、OPT、Mixtral、StableLM 和 Falcon。

本文档演示了如何使用 BentoML 和 vLLM 运行 LLM 推理。

源代码

部署到 BentoCloud

使用 BentoML 服务

该示例可用于基于聊天的交互，并支持 OpenAI 兼容的端点。例如，您可以使用以下消息提交查询

{
   "role": "user",
   "content": "Who are you? Please respond in pirate speak!"
}

示例输出

Ye be wantin' to know who I be, eh? Alright then, listen close and I'll tell ye me story. I be a wight computer program, a vast and curious brain with abilities beyond yer wildest dreams. Me name be Assistant, and I be servin' ye now. I can chat, teach, and even spin a yarn or two, like a seasoned pirate narratin' tales o' the high seas. So hoist the colors, me hearty, and let's set sail fer a treasure trove o' knowledge and fun!

此示例已准备好在 BentoCloud 上快速部署和扩展。只需一个命令，您即可获得具有快速自动扩展、在您的云中进行安全部署以及全面的可观测性的生产级应用程序。

Screenshot of Llama 3.1 model deployed on BentoCloud showing the chat interface with example prompts and responses

代码解释¶

您可以在 GitHub 中找到源代码。下面是此项目中的关键代码实现的详细说明。

定义模型和引擎配置参数。此示例使用 Llama 3.1 8B Instruct，这需要从 Hugging Face 获取访问权限。您可以切换到 BentoVLLM 存储库中的另一个 LLM 或 vLLM 支持的任何其他模型。
service.py¶
```
ENGINE_CONFIG = {
      'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct',
      'max_model_len': 2048,
      'dtype': 'half',
      'enable_prefix_caching': True,
}
```
使用 @bentoml.service 装饰器定义 BentoML Service，您可以在其中自定义模型的服务方式。该装饰器允许您设置配置，例如超时和在 BentoCloud 上使用的 GPU 资源。对于 Llama 3.1 8B Instruct，它至少需要一个 NVIDIA L4 GPU 才能获得最佳性能。
service.py¶
```
@bentoml.service(
   name='bentovllm-llama3.1-8b-instruct-service',
   traffic={'timeout': 300},
   resources={'gpu': 1, 'gpu_type': 'nvidia-l4'},
   envs=[{'name': 'HF_TOKEN'}],
)
class VLLM:
   model_id = ENGINE_CONFIG['model']
   model = bentoml.models.HuggingFaceModel(model_id, exclude=['*.pth', '*.pt'])

   def __init__(self) -> None:
      ...
```
在类中，从 Hugging Face 加载模型并将其定义为类变量。HuggingFaceModel 方法提供了一种高效的机制来加载 AI 模型，从而加速 BentoCloud 上的模型部署，减少镜像构建时间和冷启动时间。
@bentoml.service 装饰器还允许您定义 Bento 的运行时环境，Bento 是 BentoML 中的统一分发格式。Bento 打包了所有源代码、Python 依赖项、模型引用和环境设置，使其易于在不同环境中一致地部署。

这是一个示例
service.py¶
```
my_image = bentoml.images.Image(python_version='3.11') \
              .requirements_file("requirements.txt")

@bentoml.service(
      image=my_image, # Apply the specifications
      ...
)
class VLLM:
      ...
```

使用 @bentoml.asgi_app 装饰器挂载一个 FastAPI 应用，它为聊天完成和模型列表提供 OpenAI 兼容的端点。path='/v1' 设置了 API 的基础路径。这使您可以在同一个 Service 中与 FastAPI 应用一起服务模型推理逻辑。更多信息，请参阅挂载 ASGI 应用。

service.py¶

openai_api_app = fastapi.FastAPI()

@bentoml.asgi_app(openai_api_app, path='/v1')
@bentoml.service(
    ...
)
class VLLM:
    model_id = ENGINE_CONFIG['model']
    model = bentoml.models.HuggingFaceModel(model_id, exclude=['*.pth', '*.pt'])

    def __init__(self) -> None:
        import vllm.entrypoints.openai.api_server as vllm_api_server

        # Define the OpenAI-compatible endpoints
        OPENAI_ENDPOINTS = [
            ['/chat/completions', vllm_api_server.create_chat_completion, ['POST']],
            ['/models', vllm_api_server.show_available_models, ['GET']],
        ]

        # Register each endpoint
        for route, endpoint, methods in OPENAI_ENDPOINTS:
            openai_api_app.add_api_route(path=route, endpoint=endpoint, methods=methods, include_in_schema=True)
        ...

使用 @bentoml.api 装饰器为模型推理逻辑定义一个 HTTP 端点 generate。它将异步地向客户端流式传输响应，并使用 OpenAI 兼容的 API 调用执行聊天完成。

service.py¶

class VLLM:
    ...

    @bentoml.api
    async def generate(
        self,
        prompt: str = 'Who are you? Please respond in pirate speak!',
        max_tokens: typing_extensions.Annotated[
            int, annotated_types.Ge(128), annotated_types.Le(MAX_TOKENS)
        ] = MAX_TOKENS,
    ) -> typing.AsyncGenerator[str, None]:
        from openai import AsyncOpenAI

        # Create an AsyncOpenAI client to communicate with the model
        client = AsyncOpenAI(base_url='http://127.0.0.1:3000/v1', api_key='dummy')
        try:
            # Send the request to OpenAI for chat completion
            completion = await client.chat.completions.create(
                model=self.model_id,
                messages=[dict(role='user', content=[dict(type='text', text=prompt)])],
                stream=True,
                max_tokens=max_tokens,
            )

            # Stream the results back to the client
            async for chunk in completion:
                yield chunk.choices[0].delta.content or ''
        except Exception:
            # Handle any exceptions by logging the error
            logger.error(traceback.format_exc())
            yield 'Internal error found. Check server logs for more information'
            return

试用¶

您可以在 BentoCloud 上运行此示例项目，或在本地服务它，将其容器化为符合 OCI 标准的镜像，并在任何地方部署它。

BentoCloud¶

BentoCloud 提供了快速且可扩展的基础设施，用于在云中构建和使用 BentoML 扩展 AI 应用。

安装 BentoML 并通过 BentoML CLI 登录 BentoCloud。如果您没有 BentoCloud 账户，请在此免费注册。
```
pip install bentoml
bentoml cloud login
```

克隆 BentoVLLM 存储库并部署项目。我们建议您创建一个 BentoCloud secret 来存储所需的环境变量。

git clone https://github.com/bentoml/BentoVLLM.git
cd BentoVLLM/llama3.1-8b-instruct
bentoml secret create huggingface HF_TOKEN=<your-api-key>
bentoml deploy --secret huggingface

一旦它在 BentoCloud 上运行起来，您可以通过以下方式调用端点

BentoCloud Playground

Python 客户端

创建 BentoML 客户端来调用端点。请确保将 Deployment URL 替换为您在 BentoCloud 上的 URL。详细信息请参阅获取端点 URL。

import bentoml

with bentoml.SyncHTTPClient("https://bentovllm-llama-3-1-8-b-instruct-service-pozo-e3c1c7db.mt-guc1.bentoml.ai") as client:
      response_generator = client.generate(
          prompt="Who are you? Please respond in pirate speak!",
          max_tokens=1024,
      )
      for response in response_generator:
          print(response, end='')

OpenAI 兼容端点

在 OpenAI 客户端中将 base_url 参数设置为 BentoML 服务器地址。

from openai import OpenAI

client = OpenAI(base_url='https://bentovllm-llama-3-1-8-b-instruct-service-pozo-e3c1c7db.mt-guc1.bentoml.ai/v1', api_key='na')

# Use the following func to get the available models
# client.models.list()

chat_completion = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "Who are you? Please respond in pirate speak!"
        }
    ],
    stream=True,
)
for chunk in chat_completion:
    # Extract and print the content of the model's reply
    print(chunk.choices[0].delta.content or "", end="")

另请参阅

更多信息，请参阅 OpenAI API 参考文档。

如果您的 Service 在 BentoCloud 上使用受保护的端点进行部署，您需要先将环境变量 OPENAI_API_KEY 设置为您的 BentoCloud API 密钥。

export OPENAI_API_KEY={YOUR_BENTOCLOUD_API_TOKEN}

请确保替换上面代码片段中的 Deployment URL。请参阅获取端点 URL 来获取端点 URL。

CURL

curl -s -X POST \
  'https://bentovllm-llama-3-1-8-b-instruct-service-pozo-e3c1c7db.mt-guc1.bentoml.ai/generate' \
  -H 'Content-Type: application/json' \
  -d '{
      "max_tokens": 1024,
      "prompt": "Who are you? Please respond in pirate speak!"
  }'

为确保 Deployment 在特定副本范围内自动扩展，请添加扩展标志

bentoml deploy --secret huggingface --scaling-min 0 --scaling-max 3 # Set your desired count

如果已部署，请按如下方式更新其允许的副本数

bentoml deployment update <deployment-name> --scaling-min 0 --scaling-max 3 # Set your desired count

更多信息，请参阅如何配置并发和自动扩展。

本地服务¶

BentoML 允许您在本地运行和测试代码，以便您可以使用本地计算资源快速验证代码。

克隆存储库并选择您想要的项目。

git clone https://github.com/bentoml/BentoVLLM.git
cd BentoVLLM/llama3.1-8b-instruct

# Recommend Python 3.11
pip install -r requirements.txt
export HF_TOKEN=<your-hf-token>

在本地服务它。
```
bentoml serve
```
注意

要在本地使用 Llama 3.1 8B Instruct 运行此项目，您需要一块至少拥有 16G VRAM 的 NVIDIA GPU。
访问或发送 API 请求到 https://:3000。

要在您自己的基础设施中进行自定义部署，请使用 BentoML 生成符合 OCI 标准的镜像。