Skip to main content

Model Service

The LLMOS platform makes it easy to serve machine learning models using the ModelService resource. This tool provides a simple way to set up and manage model serving, powered by the vLLM engine. You can configure details like model name, Hugging Face settings, resource needs, and more to deploy models efficiently and at scale.

model-service-list

Creating a Model Service

You can create one or more model services from the LLMOS Management > Model Services page.

General Configuration

  1. Name and Namespace: Enter the model service name and namespace.
  2. Model Source and Name:
    • Select the model source: Hugging Face, ModelScope, or a Local Path.
    • For Hugging Face or ModelScope models, paste the model name from the registry (e.g., Qwen/Qwen2.5-0.5B-Instruct).
    • For local path models, specify the volume path (e.g., /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct).
  3. Engine Arguments (Optional): Add arguments like --dtype=half --max-model-len=4096 in the Arguments field if needed. More details.
  4. Hugging Face Configurations (Optional):
    • Use a secret credential for models that need authentication.
    • Add a custom Hugging Face Mirror URL if using a proxy (e.g., https://hf-mirror.com/).
  5. Environment Variables (Optional): Add any extra environment variables as needed. More details.

model-service-create-general

Resource Configuration

note

For GPU resource requirements of large language models, see LLM numbers.

  1. CPU and Memory: Assign CPU and memory resources for the model.
  2. GPU Resources:
    • Choose GPU and Runtime Class (default: nvidia).
      • Minimum: 1 GPU for the vllm-openai image.
      • For large models, use tensor parallelism to distribute across multiple GPUs on the same node. For example, with 4 GPUs, it will set the tensor parallel size to 4.
    • To share a GPU device, enable vGPU and specify the vGPU memory size (in MiB) and vGPU Cores (default: 100%).

model-service-resources

Volumes

  1. Persistent Volume:
    • Default: A persistent volume mounted at /root/.cache stores downloaded models.
    • For shared models across services, replace the default volume with a custom ReadWriteMany persistent volume.
    • For local path models:
      • Add an existing volume with model files to skip downloading.
      • Remove the default model-dir volume if unnecessary.
  2. Shared Memory(dshm):
    • Mount an emptyDir volume to /dev/shm with Medium set to Memory for temporary in-memory storage.
    • Useful for PyTorch tensor parallel inference, which needs shared memory between processes.
    • If not enabled, the default shared memory size is 64 MiB.

modelservice-create-volumes

Node Scheduling

You can specify node constraints for scheduling your model service using node labels, or leave it as default to run on any available node.

model-service-node-scheduling

note

For more details of the node scheduling, please refer to the Kubernetes Node Affinity Documentation.

Accessing Model Service APIs

The Model Service exposes a list of RESTful APIs compatible with OpenAI's API at the /v1 path. You can get the model API URL by clicking the Copy button of the selected model.

modelservice-api-url

API Endpoints

Route PathMethodsDescription
/v1/chat/completionsPOSTPerform chat completions using the model service.
/v1/completionsPOSTPerform standard completions using the model service.
/v1/embeddingsPOSTGenerate embeddings using the model service.
/v1/modelsGETList all available models.
/healthGETCheck the health of the model service HTTP server.
/tokenizePOSTTokenize text using the running model service.
/detokenizePOSTDetokenize tokens using the running model service.
/openapi.jsonGET, HEADGet the OpenAPI JSON specification for the model service.

API Usage Examples

note

The LLMOS API token can be obtained from the API Keys page.

cURL Example

export LLMOS_API_KEY=myapikey
export API_BASE=192.168.31.100:8443/api/v1/namespaces/default/services/modelservice-qwen2:http/proxy/v1
curl -k -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $LLMOS_API_KEY" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Say this is a test"
}
],
"temperature": 0.9
}' \
$API_BASE/chat/completions

Response Example:

{
"id":"chat-efffa70236bd4edda7e5420349339d45",
"object":"chat.completion",
"created":1727267645,
"model":"Qwen/Qwen2.5-0.5B-Instruct",
"choices":[
{
"index":0,
"message":{
"role":"assistant",
"content":"Yes, it is a test."
},
"logprobs":null,
"finish_reason":"stop"
}
],
"usage":{
"prompt_tokens":24,
"total_tokens":32,
"completion_tokens":8
}
}

Python Example

Since the API is compatible with OpenAI, you can use it as a drop-in replacement for OpenAI-based applications.

from openai import OpenAI
import httpx

# Set up API key and base URL
openai_api_key = "llmos-5frck:xxxxxxxxxg79c9p5"
openai_api_base = "https://192.168.31.100:8443/api/v1/namespaces/default/services/modelservice-qwen2:http/proxy/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
http_client=httpx.Client(verify=False), # Disable SSL verification or use a custom CA bundle.
)

completion = client.chat.completions.create(
model="Qwen/Qwen2.5-0.5B-Instruct",
messages=[{"role": "user", "content": "How do I output all files in a directory using Python?"}]
)
print(completion.choices[0].message.content)

Notebooks Interaction

You can also interact with model services using the Notebooks, which allows you to explore the model’s capabilities more interactively using HTML, graphs, and more (e.g., using a Jupyter Notebook as below).

model-service-notebook

note

Within your LLMOS cluster, you can connect to the model service using its internal DNS name.

To get the internal DNS name, click the Copy Internal URL button of the model service.

Model Service Monitoring

The Model Service includes built-in metrics with LLMOS Monitoring to track performance and usage.

  • Click on the model service name in the list to open its details page.
  • Use the Token Metrics tab to view token-level metrics.
  • Use the Metrics tab to see resource usage like CPU, memory, and Disk I/O.

model-service-metrics

Adding a Hugging Face Token

Some models require authentication to download. If your model needs a token, follow these steps to add a Hugging Face token:

  1. Go to Advanced > Secrets and click Create.
  2. Select the Opaque secret type.
    secret-create-opaque
  3. Choose the Namespace matching your model service and provide a clear Name (e.g., my-hf-token).
  4. Set the Key to token and paste your Hugging Face token as the Value.
    secret-create-hf-token
  5. Click Create to save the secret.

Once created, the secret will appear as an option when setting up the model service in the same namespace.