Skip to content

Consume LLM API

This tutorial demonstrates how to interact with a PracticusAI LLM deployment for making predictions using the PracticusAI SDK. The methods used include ChatPracticus for invoking the model endpoint and practicuscore for managing API tokens.
The workflow illustrates obtaining a session token, invoking the LLM API endpoint, and processing responses in parallel.
from langchain_practicus import ChatPracticus
import practicuscore as prt

Defining parameters.

This section defines key parameters for the notebook. Parameters control the behavior of the code, making it easy to customize without altering the logic. By centralizing parameters at the start, we ensure better readability, maintainability, and adaptability for different use cases.

api_url = None # E.g. "https://company.practicus.com/llm-models/llama-3b-chain-test/"
assert api_url, "Please enter your model api url."
The test_langchain_practicus function is defined to interact with the PracticusAI model endpoint. It uses the ChatPracticus object to invoke the API with the provided URL, token, and input data. The response is printed in two formats: a raw dictionary and its content.
def test_langchain_practicus(api_url, token, inputs):
    chat = ChatPracticus(
        endpoint_url=api_url,
        api_token=token,
        model_id="current models ignore this",
    )

    response = chat.invoke(input=inputs)

    print("\n\nReceived response:\n", dict(response))
    print("\n\nReceived Content:\n", response.content)
We retrieve an API session token using PracticusAI SDK. This token is required to authenticate and interact with the PracticusAI deployment.
The method below creates a token that is valid for 4 hours, longer tokens can be retrieved from the admin console.
token = prt.models.get_session_token(api_url)
print("API session token:", token)
We invoke the test_langchain_practicus function with the API URL, session token, and an example query, 'What is the capital of England?'. The function sends the query to the PracticusAI endpoint and prints the received response.
test_langchain_practicus(api_url, token, ['What is the capital of England?'])

Supplementary Files

model.json

{
"download_files_from": "cache/llama-1b-instruct/",
"_comment": "you can also define download_files_to otherwise, /var/practicus/cache is used"
}

model.py

import sys
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from practicuscore.gen_ai import PrtLangMessage, PrtLangRequest, PrtLangResponse
import json

generator = None

async def init(model_meta=None, *args, **kwargs):
    global generator
    if generator is not None:
        print("generator exists, using")
        return

    print("generator is none, building")
    model_cache = "/var/practicus/cache"
    if model_cache not in sys.path:
        sys.path.insert(0, model_cache)

    try:
        from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
    except Exception as e:
        raise print(f"Failed to import required libraries: {e}")

    # Initialize the local LLM model using transformers:

    def load_local_llm(model_path):
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        model = AutoModelForCausalLM.from_pretrained(model_path)
        model.to('cpu') # Change with cuda or auto to use gpus.
        return pipeline('text-generation', model=model, tokenizer=tokenizer, max_new_tokens=200)

    try:
        generator = load_local_llm(model_cache)
    except Exception as e:
        print(f"Failed to build generator: {e}")
        raise

async def cleanup(model_meta=None, *args, **kwargs):
    print("Cleaning up memory")

    global generator
    generator = None

    from torch import cuda
    cuda.empty_cache()

async def predict(payload_dict: dict, **kwargs):

    from practicuscore.gen_ai import PrtLangRequest, PrtLangResponse

    # The payload dictionary is validated against PrtLangRequest.
    practicus_llm_req = PrtLangRequest.model_validate(payload_dict)

    # Converts the validated request object to a dictionary.
    data_js = practicus_llm_req.model_dump_json(indent=2, exclude_unset=True)
    payload = json.loads(data_js)

    # Joins the content field from all messages in the payload to form the prompt string.
    prompt = " ".join([item['content'] for item in payload['messages']])

    # Generate a response from the model
    response = generator(prompt)
    answer = response[0]['generated_text']

    # Creates a PrtLangResponse object with the generated content and metadata about the language model and token usage
    resp = PrtLangResponse(
        content=answer,
        lang_model=payload['lang_model'],
        input_tokens=0,
        output_tokens=0,
        total_tokens=0,
        # additional_kwargs={
        #     "some_additional_info": "test 123",
        # },
    )

    return resp

Previous: Deploy | Next: Combined Method > Model