Practicus AI Large Language Model (LLM) Hosting
This example demonstrates how to leverage the Practicus AI platform's optimized LLM hosting features, powered by engines like vLLM for high-throughput and low-latency inference. We will cover:
- Experimenting in Design Time: Interactively running and testing LLMs directly within a Practicus AI Worker's Jupyter environment using the
ModelServer
utility. - Deploying for Runtime: Packaging and deploying LLMs as scalable endpoints on the Practicus AI Model Hosting platform.
This approach is the recommended method for hosting most Large Language Models on Practicus AI, offering significant performance benefits and simplified deployment compared to writing custom prediction code from scratch.
Note: If you need to host non-LLM models or require deep customization beyond the options provided by the built-in LLM serving engine (e.g., complex pre/post-processing logic tightly coupled with the model), please view the custom model serving section for guidance on building models with custom Python code.
1. Overview of the prt.models.server
Utility
The practicuscore.models.server
module (prt.models.server
) provides a high-level interface to manage LLM inference servers within the Practicus AI environment. Primarily, it controls an underlying inference engine process (like vLLM by default) and exposes its functionality.
Key capabilities include:
- Starting/Stopping Servers: Easily launch and terminate the inference server process (e.g., vLLM) with specified models and configurations (like quantization, tensor parallelism).
- Health & Status Monitoring: Check if the server is running, view logs, and diagnose issues.
- Providing Access URL: Get the local URL to interact with the running server.
- Runtime Integration: Facilitates deploying models using optimized container images, often exposing an OpenAI-compatible API endpoint for standardized interaction.
2. Experimenting in Design Time (Jupyter Notebook)
You can interactively start an LLM server, send requests, and shut it down directly within a Practicus AI Worker notebook. This is ideal for development, testing prompts, and evaluating different models or configurations before deployment.
Note: This runs the server locally within your Worker's resources. Ensure your Worker has sufficient resources (especially GPU memory) for the chosen model.
import practicuscore as prt
from openai import OpenAI
import time
# Define the model from Hugging Face and any specific engine options
# Example: Use TinyLlama and specify 'half' precision (float16) suitable for GPUs like T4
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
vllm_options = {"dtype": "half"}
# Start the vLLM server process and wait for it to become ready
print(f"Starting server for model: {model_id} with options: {vllm_options}...")
prt.models.server.start(model_id, options=vllm_options)
# The server might take a moment to initialize, especially on first download
# Get the base URL of the locally running server
# Append '/v1' for the OpenAI-compatible API endpoint
base_url = prt.models.server.get_base_url()
if not base_url:
print("Error: Server failed to start. Check logs:")
print(prt.models.server.get_output())
else:
openai_api_base = base_url + "/v1"
print(f"Server started. OpenAI compatible API Base URL: {openai_api_base}")
# Create an OpenAI client pointed at the local server
# No API key is needed ('api_key' can be anything) for local interaction
client = OpenAI(
base_url=openai_api_base,
api_key="not-needed-for-local",
)
# Send a chat completion request
print("Sending chat request...")
try:
response = client.chat.completions.create(
# The 'model' parameter should match the model loaded by the server
# You can often use model=None if only one model is served,
# or explicitly pass the model_id
model=model_id,
messages=[{"role": "user", "content": "What is the capital of France?"}],
max_tokens=50, # Limit response length
temperature=0.7
)
print("Response received:")
print(response.choices[0].message.content)
# Expected output might be similar to: 'The capital of France is Paris.'
except Exception as e:
print(f"Error during chat completion: {e}")
print("Check server status and logs.")
Additional Utilities for Design Time
While the server is running in your notebook session, you can monitor it:
# Check the server's status ('running', 'error', 'stopped', etc.)
status = prt.models.server.get_status()
print(f"Server Status: {status}")
# Get recent logs (stdout/stderr) from the server process
logs = prt.models.server.get_output()
print("Recent Server Logs:")
print(logs)
Testing with a Mock LLM Server on CPU
For testing pipelines or developing client code without requiring a GPU or a real LLM, you can run a simple mock server. This mock server just needs to implement the expected API endpoint (e.g., /v1/chat/completions
).
Create a Python file (view example mock_llm_server.py
at the bottom of this page) with a basic web server (like Flask or FastAPI) that returns predefined responses. Then, start it using prt.models.server.start()
.
# Example: Assuming you have 'mock_llm_server.py' in the same directory
# This file would contain a simple Flask/FastAPI app mimicking the OpenAI API structure
try:
print("Attempting to stop any existing server...")
prt.models.server.stop() # Stop the real LLM server if it's running
time.sleep(2)
print("Starting the mock server...")
# Make sure 'mock_llm_server.py' exists and is runnable
prt.models.server.start("mock_llm_server.py")
time.sleep(5) # Give mock server time to start
mock_base_url = prt.models.server.get_base_url()
if not mock_base_url:
print("Error: Mock server failed to start. Check logs:")
print(prt.models.server.get_output())
else:
mock_api_base = mock_base_url + "/v1"
print(f"Mock Server Running. API Base: {mock_api_base}")
# Create client for the mock server
mock_client = OpenAI(base_url=mock_api_base, api_key="not-needed")
# Send a request to the mock server
mock_response = mock_client.chat.completions.create(
model="mock-model", # Model name expected by your mock server
messages=[{"role": "user", "content": "Hello mock!"}]
)
print("Mock Response:", mock_response.choices[0].message.content)
# Example mock server might return: 'You said: Hello mock!'
except FileNotFoundError:
print("Skipping mock server test: 'mock_llm_server.py' not found.")
except Exception as e:
print(f"An error occurred during mock server test: {e}")
finally:
# Important: Stop the mock server when done
print("Stopping the mock server...")
prt.models.server.stop()
Cleaning Up the Design Time Server
When you are finished experimenting in the notebook, it's crucial to stop the server to release GPU resources.
print("Stopping any running server...")
prt.models.server.stop()
print(f"Server Status after stop: {prt.models.server.get_status()}")
3. Deploying Models for Runtime
Once you have selected and tested your model, you need to deploy it as a scalable service on the Practicus AI Model Hosting platform. This involves packaging the model and its serving configuration into a container image and creating a deployment through the Practicus AI console.
There are a few ways to configure the container for LLM serving:
Option 1: Dynamically Download Model at Runtime (No Coding Required)
This is the quickest way to get started. Use a pre-built Practicus AI vLLM image, and configure the model ID and options via environment variables in the model deployment settings. The container will download the specified model when it starts.
Pros: Simple configuration, no need to build custom images. Cons: Can lead to longer cold start times as the model needs to be downloaded on pod startup. Potential for download issues at runtime.
Steps:
-
Define Container Image in Practicus AI:
- Navigate to
Infrastructure > Container Images
in the Practicus AI console. - Add a new image. Use a vLLM-enabled image provided by Practicus AI, for example:
ghcr.io/practicusai/practicus-modelhost-gpu-vllm:25.5.0
(replace with the latest/appropriate version).
- Navigate to
-
Create Model Deployment:
- Go to
ML Model Hosting > Model Deployments
. - Create a new deployment, ensuring you allocate necessary GPU resources.
- Select the container image you added in the previous step.
- In the
Extra configuration
section (or environment variables section), define:PRT_SERVE_MODEL
: Set this to the Hugging Face model ID (e.g.,TinyLlama/TinyLlama-1.1B-Chat-v1.0
).PRT_SERVE_MODEL_OPTIONS_B64
: (Optional) Provide Base64-encoded JSON containing vLLM options (like{"dtype": "half"}
).
- Go to
-
Create Model and Version:
- Go to
ML Model Hosting > Models
. - Add a
New Model
(e.g.,my-tiny-llama
). - Add a
New Version
for this model, pointing it to theModel Deployment
you just created. - Tip: You can create
multiple versions
each pointing to different model deployments, and then performA/B testing
comparing LLM model performance.
- Go to
Example 1: Serving TinyLlama (default options)
In Model Deployment Extra configuration
section, add:
Example 2: Serving TinyLlama with Half Precision
-
Generate Base64 options:
-
In Model Deployment
Extra configuration
section, add:
Option 2: Pre-download and Bake Model into Container Image (Recommended for Production)
Build a custom container image that includes the model files. This avoids runtime downloads, leading to faster and more reliable pod startups.
Pros: Faster cold starts, improved reliability (no runtime download dependency), enables offline environments. Cons: Requires building and managing custom container images. Image size will be larger. Longer build times.
Steps:
-
Create a
Dockerfile
:- Start from a Practicus AI base vLLM image.
- Set environment variables for the model and options.
- Use
huggingface-cli download
to download the model during the image build. - (Optional but recommended) Configure vLLM to use the downloaded path and enable offline mode.
- (Optional) If you need custom logic,
COPY
yourmodel.py
(see Option 3).
-
Build and Push the Image: Build the Docker image and push it to a container registry accessible by Practicus AI.
-
Configure Practicus AI:
- Add your custom image URL in
Infrastructure > Container Images
. - Create a
Model Deployment
using your custom image. You usually don't need to setPRT_SERVE_MODEL
orPRT_SERVE_MODEL_OPTIONS_B64
as environment variables here, as they are baked into the image (unless your image startup script specifically reads them). - Create the
Model
andVersion
pointing to this deployment.
- Add your custom image URL in
Example Dockerfile
:
# Use a Practicus AI base image that includes GPU support and vLLM
FROM ghcr.io/practicusai/practicus-modelhost-base-gpu-vllm:25.5.0
# --- Configuration baked into the image ---
ENV PRT_SERVE_MODEL="TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# Example: Options for smaller GPUs like Nvidia T4 (float16)
# Generated via: echo '{"dtype": "half"}' | base64
ENV PRT_SERVE_MODEL_OPTIONS_B64="eyJkdHlwZSI6ICJoYWxmIn0K"
# --- Model Download during Build ---
# Define a persistent cache location within the image
ENV HF_HOME="/var/practicus/cache/huggingface"
RUN \
# Download model for offline use
echo "Starting to download '$PRT_SERVE_MODEL' to '$HF_HOME'." && \
mkdir -p "$HF_HOME" && \
huggingface-cli download "$PRT_SERVE_MODEL" --local-dir "$HF_HOME" && \
echo "Completed downloading '$PRT_SERVE_MODEL' to '$HF_HOME'." && \
# Create VLLM redirect file
REDIRECT_JSON="{\"$PRT_SERVE_MODEL\": \"$HF_HOME\"}" && \
REDIRECT_JSON_PATH="/var/practicus/vllm_model_redirect.json" && \
echo "Creating VLLM redirect file: $REDIRECT_JSON_PATH" && \
echo "VLLM redirect JSON content: $REDIRECT_JSON" && \
echo "$REDIRECT_JSON" > "$REDIRECT_JSON_PATH"
# --- vLLM Configuration for Baked Model ---
# Tell vLLM (via our entrypoint) to use the baked model path directly
ENV VLLM_MODEL_REDIRECT_PATH="/var/practicus/vllm_model_redirect.json"
# (Recommended for baked images) Prevent accidental downloads at runtime
ENV TRANSFORMERS_OFFLINE=1
ENV HF_HUB_OFFLINE=1
# --- Custom Logic (Optional - See Option 3) ---
# If you need custom init/serve logic, uncomment and provide your model.py
# COPY model.py /var/practicus/model.py
Option 3: Custom model.py
Implementation
If you need to add custom logic before or after the vLLM server handles requests (e.g., complex input validation/transformation, custom readiness checks, integrating external calls), you can provide a /var/practicus/model.py
file within your custom container image (usually built as described in Option 2).
Pros: Maximum flexibility for custom logic around the vLLM server. Cons: Requires Python coding; adds complexity compared to standard vLLM usage.
Steps:
- Create your
model.py
withinit
andserve
functions. - Inside
init
, callprt.models.server.start()
to launch the vLLM process. - Inside
serve
, you can add pre-processing logic, then callawait prt.models.server.serve()
to forward the request to the underlying vLLM server, and potentially add post-processing logic to the response. - Build a custom Docker image (as in Option 2), ensuring you
COPY model.py /var/practicus/model.py
.
# Example: /var/practicus/model.py
# Customize as required and place in your project and COPY into the Docker image
import practicuscore as prt
async def init(**kwargs):
if not prt.models.server.initialized:
prt.models.server.start()
async def serve(http_request, payload: dict, **kwargs):
return await prt.models.server.serve(http_request=http_request, payload=payload, **kwargs)
4. Conclusion
Practicus AI provides optimized pathways for hosting LLMs using engines like vLLM.
- Use the
prt.models.server
utility within notebooks for interactive experimentation. - For runtime deployment, choose between dynamic model downloads (easy start) or baking models into images (recommended for production) via the Practicus AI console.
- Use a custom
model.py
only when specific pre/post-processing logic around the core LLM inference is required.
Remember to consult the specific documentation for vLLM options and Practicus AI deployment configurations for advanced settings.
Supplementary Files
mock_llm_server.py
# A simple echo server you can use to test LLM functionality without GPUs
# Echoes what you request.
import argparse
import logging
import time
from fastapi.responses import PlainTextResponse
import uvicorn
from fastapi import FastAPI
from pydantic import BaseModel
from typing import Optional
# Configure logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
handlers=[logging.StreamHandler()],
)
logger = logging.getLogger("mock_llm_server")
# Create FastAPI app
app = FastAPI(title="Mock LLM Server")
class ChatMessage(BaseModel):
role: str
content: str
class ChatRequest(BaseModel):
model: str
messages: list[ChatMessage]
temperature: Optional[float] = 0.7
top_p: Optional[float] = 1.0
max_tokens: Optional[int] = 100
stream: Optional[bool] = False
@app.get("/health")
async def health_check():
# For testing you might want to add a delay to simulate startup time
# time.sleep(5)
return {"status": "healthy"}
@app.get("/metrics")
async def metrics():
# For testing you might want to add a delay to simulate startup time
# time.sleep(5)
return PlainTextResponse("""# HELP some_random_metric Some random metric that the mock server returns
# TYPE some_random_metric counter
some_random_metric 1.23""")
@app.post("/chat/completions")
async def chat_completions(request: ChatRequest):
# Log the request
logger.info(f"Received chat request for model: {request.model}")
# Extract the last message content
last_message = request.messages[-1].content if request.messages else ""
# Create a mock response
response = {
"id": "mock-response-id",
"object": "chat.completion",
"created": int(time.time()),
"model": request.model,
"choices": [
{
"index": 0,
"message": {"role": "assistant", "content": f"This is a mock response for: {last_message}"},
"finish_reason": "stop",
}
],
"usage": {
"prompt_tokens": len(last_message.split()),
"completion_tokens": 8,
"total_tokens": len(last_message.split()) + 8,
},
}
# Wait a bit to simulate processing time
time.sleep(0.5)
return response
if __name__ == "__main__":
# Parse command line arguments
parser = argparse.ArgumentParser(description="Mock LLM Server")
parser.add_argument("--port", type=int, default=8585, help="Port to run the server on")
args = parser.parse_args()
logger.info(f"Starting mock LLM server on port {args.port}")
# Run the server
uvicorn.run(app, host="0.0.0.0", port=args.port)
Previous: Build | Next: Custom > Models > Build