Consume LLM API With ChatPracticus

This tutorial demonstrates how to interact with a PracticusAI LLM deployment for making predictions using the PracticusAI SDK. The methods used include `ChatPracticus` for invoking the model endpoint and `practicuscore` for managing API tokens.

The workflow illustrates obtaining a session token, invoking the LLM API endpoint, and processing responses in parallel.

Defining parameters.

This section defines key parameters for the notebook. Parameters control the behavior of the code, making it easy to customize without altering the logic. By centralizing parameters at the start, we ensure better readability, maintainability, and adaptability for different use cases.

api_url = None  # Model API e.g. "https://company.practicus.com/llm-models/llama-3b-chain-test/"

assert api_url, "Please enter your model api url."

from langchain_practicus import ChatPracticus
import practicuscore as prt

The `test_langchain_practicus` function is defined to interact with the PracticusAI model endpoint. It uses the `ChatPracticus` object to invoke the API with the provided URL, token, and input data. The response is printed in two formats: a raw dictionary and its content.

def test_langchain_practicus(api_url, token, inputs):
    chat = ChatPracticus(
        endpoint_url=api_url,
        api_token=token,
        model_id="current models ignore this",
    )

    response = chat.invoke(input=inputs)

    print("\n\nReceived response:\n", dict(response))
    print("\n\nReceived Content:\n", response.content)

We retrieve an API session token using PracticusAI SDK. This token is required to authenticate and interact with the PracticusAI deployment.

The method below creates a token that is valid for 4 hours, longer tokens can be retrieved from the admin console.

token = None  # Get a new token, or reuse existing if not expired.
token = prt.models.get_session_token(api_url, token=token)
print("API session token:", token)

We invoke the `test_langchain_practicus` function with the API URL, session token, and an example query, `'What is the capital of England?'`. The function sends the query to the PracticusAI endpoint and prints the received response.

test_langchain_practicus(api_url, token, ["What is the capital of England?"])

Consume LLM API With basic HTTP requests

Use the PracticusAI SDK to generate a session token, ensuring secure access to the LLM API.

import practicuscore as prt

# We will be using using the SDK to get a session token.
api_url = None  # Model API e.g. "https://company.practicus.com/llm-models/llama-3b-chain-test/"
token = None  # Get a new token, or reuse existing if not expired.
token = prt.models.get_session_token(api_url, token=token)
print("API session token:", token)

Send a GET request with the session token to check if the model and its API are active and ready for use.

from requests import get

headers = {"authorization": f"Bearer {token}"}
r = get(api_url + "?get_meta=true", headers=headers)

print("Model details: ", r.text)
if r.status_code != 200:
    print(f"Error code {r.status_code}")

Interacting with the LLM API to retrieve a response, measuring performance, and analyzing the results

from requests import get
import json

# Provide a user prompt to the LLM API and retrieve the generated response.
data = {
    #'system_context': '',
    "user_prompt": "Who is Nikola Tesla?"
}
r = get(api_url, headers=headers, json=data)

if r.status_code != 200:
    print(f"Error code {r.status_code}")

# Print API response for generated prediction
print("Prediction result:")
try:
    parsed = json.loads(r.text)
    print(json.dumps(parsed, indent=1))
except:
    print(r.text)

# Examine response headers for debugging or additional metadata about the request.
print("Headers: ", r.headers)

Supplementary Files

model.json

{
"download_files_from": "cache/llama-1b-instruct/",
"_comment": "you can also define download_files_to otherwise, /var/practicus/cache is used"
}

model.py

import sys
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from practicuscore.gen_ai import PrtLangMessage, PrtLangRequest, PrtLangResponse
import json

generator = None
answers = ""


async def init(model_meta=None, *args, **kwargs):
    global generator
    if generator is not None:
        print("generator exists, using")
        return

    print("generator is none, building")
    model_cache = "/var/practicus/cache"
    if model_cache not in sys.path:
        sys.path.insert(0, model_cache)

    try:
        from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
    except Exception as e:
        raise print(f"Failed to import required libraries: {e}")

    # Initialize the local LLM model using transformers:

    def load_local_llm(model_path):
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        model = AutoModelForCausalLM.from_pretrained(model_path)
        model.to("cpu")  # Change with cuda or auto to use gpus.
        return pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=200)

    try:
        generator = load_local_llm(model_cache)
    except Exception as e:
        print(f"Failed to build generator: {e}")
        raise


async def cleanup(model_meta=None, *args, **kwargs):
    print("Cleaning up memory")

    global generator
    generator = None

    from torch import cuda

    cuda.empty_cache()


async def predict(payload_dict: dict, **kwargs):
    # For basic text-in, text-out task:
    if "user_prompt" in payload_dict:
        # Recording the start time to measure execution duration.
        start = datetime.now()

        # Extracting given prompt from the http request
        sentence = payload_dict["user_prompt"]

        # Passing the prompt to the `generator`, loaded llm model to generate a response.
        res = generator([sentence])
        text = res[0]

        # Returning a structured response containing the generated text and execution time.
        total_time = (datetime.now() - start).total_seconds()
        return {"answer": f"Time:{total_time}\nanswer:{text}"}

    # For langchain applications:
    else:
        from practicuscore.gen_ai import PrtLangRequest, PrtLangResponse

        # The payload dictionary is validated against PrtLangRequest.
        practicus_llm_req = PrtLangRequest.model_validate(payload_dict)

        # Converts the validated request object to a dictionary.
        data_js = practicus_llm_req.model_dump_json(indent=2, exclude_unset=True)
        payload = json.loads(data_js)

        # Joins the content field from all messages in the payload to form the prompt string.
        prompt = " ".join([item["content"] for item in payload["messages"]])

        # Generate a response from the model
        response = generator(prompt)
        answer = response[0]["generated_text"]

        # Creates a PrtLangResponse object with the generated content and metadata about the language model and token usage
        resp = PrtLangResponse(
            content=answer,
            lang_model=payload["lang_model"],
            input_tokens=0,
            output_tokens=0,
            total_tokens=0,
            # additional_kwargs={
            #     "some_additional_info": "test 123",
            # },
        )

        return resp

Previous: Deploy | Next: Email E Assistant > Mail E-Assistant