Skip to content

Deploy

Deploying a model with Practicus AI involves a sequence of steps designed to securely and efficiently transition a model from development to a production-ready state. Here's a step-by-step explanation, aimed at providing clarity and guidance:

Step 1: Importing Necessary Modules

import practicuscore as prt

Step 2: Setting Up Deployment Parameters

deployment_key = "deployment-gpu"
prefix = "models/practicus"
model_name = "codellama"
model_dir = None  # Current dir
Deployment Key: Identifies the deployment configuration, such as compute resources (e.g., GPU or CPU preferences). This key ensures your model is deployed with the appropriate infrastructure for its needs.
Prefix: A logical grouping or namespace for your models. This helps in organizing and managing models within Practicus AI.
Model Name: A unique identifier for your model within Practicus AI. This name is used to reference and manage the model post-deployment.
Model Directory: The directory containing your model's files. If None, it defaults to the current directory from which the deployment script is run.
Secure Password Entry: getpass.getpass() securely prompts for the password, ensuring it's not visible or stored in plaintext within the script.
Authentication Token: dp.get_auth_token() requests an authentication token from Practicus AI, using your email and password. This token is crucial for subsequent API calls to be authenticated.

Step 3: Deploying the Model

prt.models.deploy(
    deployment_key=deployment_key,
    prefix=prefix, 
    model_name=model_name, 
    model_dir=model_dir
)
Model Deployment: A call to deploy() initiates the deployment process. It requires the host URL, your email, the obtained auth_token, and other previously defined parameters.
Feedback: Upon successful deployment, you'll receive a confirmation. If authentication fails or other issues arise, you'll be prompted with an error message to help diagnose and resolve the issue.

Summary

This process encapsulates a secure and structured approach to model deployment in Practicus AI, leveraging the DataPipeline for effective model management. By following these steps, you ensure that your model is deployed to the right environment with the appropriate configurations, ready for inference at scale. This systematic approach not only simplifies the deployment process but also emphasizes security and organization, critical factors for successful AI project implementations.

Supplementary Files

model.json

{
"download_files_from": "cache/codellama-01/",
"_comment": "you can also define download_files_to otherwise, /var/practicus/cache is used"
}

model.py

import sys
from datetime import datetime

generator = None
answers = ""


async def init(model_meta=None, *args, **kwargs):
    global generator
    if generator is not None:
        print("generator exists, using")
        return

    print("generator is none, building")

    # Assuming llama library is copied into cache dir, in addition to torch .pth files
    llama_cache = "/var/practicus/cache"
    if llama_cache not in sys.path:
        sys.path.insert(0, llama_cache)

    try:
        from llama import Llama
    except Exception as e:
        raise ModuleNotFoundError("llama library not found. Have you included it in the object storage cache?") from e

    try:
        generator = Llama.build(
            ckpt_dir=f"{llama_cache}/CodeLlama-7b-Instruct/",
            tokenizer_path=f"{llama_cache}/CodeLlama-7b-Instruct/tokenizer.model",
            max_seq_len=512,
            max_batch_size=4,
            model_parallel_size=1
        )
    except:
        building_generator = False
        raise


async def cleanup(model_meta=None, *args, **kwargs):
    print("Cleaning up memory")

    global generator
    generator = None

    from torch import cuda
    cuda.empty_cache()


def _predict(http_request=None, model_meta=None, payload_dict=None, *args, **kwargs):
    start = datetime.now()

    # instructions = [[
    #     {"role": "system", "content": payload_dict["system_context"]},
    #     {"role": "user", "content": payload_dict["user_prompt"]}
    # ]]

    instructions = [[
        {"role": "system", "content": ""},
        {"role": "user", "content": "Capital of Turkey"}
    ]]

    results = generator.chat_completion(
        instructions,
        max_gen_len=None,
        temperature=0.2,
        top_p=0.95,
    )

    answer = ""
    for result in results:
        answer += f"{result['generation']['content']}\n"

    print("thread answer:", answer)
    total_time = (datetime.now() - start).total_seconds()
    print("thread answer in:", total_time)    

    global answers 
    answers += f"start:{start} end: {datetime.now()} time: {total_time} answer: {answer}\n"


async def predict(http_request, model_meta=None, payload_dict=None, *args, **kwargs):
    await init(model_meta)

    import threading 

    threads = []

    count = int(payload_dict["count"])
    thread_start = datetime.now()
    for _ in range(count):
        thread = threading.Thread(target=_predict)
        thread.start()
        threads.append(thread)

    for thread in threads:
        thread.join()

    print("Total finished in:", (datetime.now() - thread_start).total_seconds())    

    return {
        "answer": f"Time:{(datetime.now() - thread_start).total_seconds()}\nanswers:{answers}"
    }

Previous: Model Json | Next: Consume Parallel