Skip to content

Cv Asistant

Firstly we need install transformers and torch

Run at terminal: - pip install transformers sentence-transformers langchain langchain-community langchain-milvus chromadb - pip install torch --index-url https://download.pytorch.org/whl/cpu

pip install transformers sentence-transformers langchain langchain-community chromadb

  • Transformers: It allows you to easily use Transformer-based models (such as BERT, GPT, etc.).-

  • Sentence-Transformers: It produces vector representations of sentences using Transformer models.

  • LangChain: It is used to manage more complex workflows with language models.

  • Langchain-Community: Contains additional modules and components developed by the community for the LangChain library.

  • ChromaDB: Used as a vector database. It is optimized for embeddings and similarity searches.

  • PyPDF: A library used to process PDF files with Python.

pip install torch --index-url https://download.pytorch.org/whl/cpu

This command is used to install the PyTorch library with CPU support.

Details:

  • Torch (PyTorch): A library used for developing machine learning and deep learning models. It offers features such as tensor computations, automatic differentiation, and advanced modeling.
  • --index-url https://download.pytorch.org/whl/cpu: This parameter uses a specific index URL to download the CPU version of PyTorch. If you do not want to install a GPU-specific version, this URL is used.
# Prepare data

import os
import requests

repo_owner = "practicusai"
repo_name = "sample-data"
file_path = "hr_assistant"
branch = "main"


url = f"https://api.github.com/repos/{repo_owner}/{repo_name}/contents/{file_path}?ref={branch}"


response = requests.get(url)
if response.status_code == 200:
    files = response.json()  

    for file in files:
        file_url = file['download_url'] 
        file_name = file['name']  

        file_response = requests.get(file_url)
        if file_response.status_code == 200:

            with open(file_name, 'wb') as f:
                f.write(file_response.content)
            print(f"'{file_name}' successfully downloaded.")
        else:
            print(f"'{file_name}' not successfully downloaded.")
else:
    print(f"HTTP status: {response.status_code}")

Import Libraries

from langchain.prompts import PromptTemplate
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document
from langchain_community.vectorstores import Chroma

from langchain_practicus import ChatPracticus

import chromadb
import random
import pandas as pd

import warnings
warnings.filterwarnings('ignore')
  1. Server URL Validation:
  2. Defines the host URL where the application will connect.
  3. If host is not provided, an error message is raised.

  4. Embedding Model Path Validation:

  5. Validates the path for the embedding model used to generate embeddings.
  6. Raises an error if the path is not specified.

  7. Vector Store Selection:

  8. The user is required to specify one of the two options: ChromaDB or MilvusDB.
  9. An error message is raised if an invalid value is provided.

  10. Milvus Connection URI:

  11. If MilvusDB is selected, a connection URI for Milvus must be provided.
  12. An error is raised if the milvus_uri is not defined.

This setup ensures that all necessary configurations are in place for the system to initialize and function correctly.

uhost = None # Example url -> 'company.practicus.com'
assert host, "Please enter your host url" 

embedding_model_path = None
assert embedding_model_path, "Please enter your embedding model path."

# You can use one of ChromaDB or MilvusDB as vector store
vector_store = None
assert vector_store in ['ChromaDB', 'MilvusDB'], "Vector store must be 'ChromaDB' or 'MilvusDB'."

if vector_store == 'MilvusDB':
    milvus_uri = None # Milvus connection url, E.g. 'company.practicus.milvus.com'
    assert 'milvus_uri', "Please enter your milvus connection uri"
## Define llm api function and call ChatPracticus in this function

Function: call_llm_api

Description

This function interacts with the ChatPracticus API to invoke a response from a language model using the provided inputs, API URL, and API token. The function is designed to send data to the API and retrieve the response content.

Parameters

  • inputs:
    The input query or data to be sent to the API for processing. This is typically a string or JSON object depending on the API's requirements.

  • api_url:
    The endpoint URL of the ChatPracticus API. This is the location where the API call will be directed.

  • api_token:
    The authentication token for accessing the ChatPracticus API. This ensures secure communication and proper authorization.

  • The function initializes a ChatPracticus object with the specified API URL and token. The model_id parameter is currently unused or ignored but can be included for future model-specific configurations.

  • The invoke method of the ChatPracticus object is called with the given input. This sends the query to the API and retrieves the response.

  • The function returns the content attribute of the response, which contains the text generated by the language model.

def call_llm_api(inputs, api_url, api_token):
    # We need to give input to 'generate_response'. This function will use our 'api_token' and 'endpoint_url' and return the response.

    """
    This function will use our 'api_token' and 'api_url' and return the response.

    :params inputs: The input to be sent to the API.
    :params api_url: The endpoint URL of the ChatPracticus API.
    :params api_token: Token of our model.
    """

    chat = ChatPracticus(
        endpoint_url=api_url,
        api_token=api_token,
        model_id="current models ignore this",
    )

    response = chat.invoke(input=inputs)

    return(response.content)

Get all resumes and use seperator for split questions

  1. df = pd.read_csv("HR.csv"):
  2. Reads the CSV file HR.csv into a pandas DataFrame called df.
  3. The DataFrame should contain a column named Resume_str.

  4. merged_resumes = '':

  5. Initializes an empty string merged_resumes, which will store the concatenated resume data.

  6. For Loop:

  7. Iterates over each resume string in the Resume_str column of the DataFrame.
  8. Appends each resume to the merged_resumes string, preceded by the delimiter //m-n-m//.

Final Output

  • The variable merged_resumes contains all resumes concatenated into a single string, with //m-n-m// acting as a separator between each resume.
df = pd.read_csv("HR.csv")
merged_resumes = ''
for resume in df['Resume_str']:
    merged_resumes = merged_resumes + '//m-n-m//' + resume

Description

This function processes a concatenated string of resumes, splits them into individual documents based on a delimiter, and further divides these documents into smaller chunks for analysis. The function utilizes a CharacterTextSplitter to handle the chunking process.

  1. Split the Resumes:
  2. The input merged_resumes is split into individual resume strings using the //m-n-m// delimiter.

  3. Create Document Objects:

  4. Each resume is transformed into a Document object. Empty or whitespace-only resumes are excluded.

  5. Initialize the Text Splitter:

  6. A CharacterTextSplitter is set up with the following configuration:

    • separator="//m-n-m//": The delimiter used for splitting.
    • chunk_size: Controls the maximum size of each text chunk.
    • chunk_overlap: Adds overlapping text between chunks for better context retention.
  7. Split Documents:

  8. The documents are further divided into smaller chunks using the CharacterTextSplitter.

  9. Aggregate Results:

  10. The chunks are appended to the all_docs list, which is returned as the final output.

Returns

  • all_docs:
    A list of smaller text chunks, each represented as a document object, ready for further processing.
def load_and_split_resumes(merged_resumes, chunk_size=500, chunk_overlap=50):
    """
    Load and split email strings into chunks.

    :param merged_resumes: A single string containing all resumes contents, separated by '//m-n-m//'.
    :param chunk_size: The maximum number of characters in each text chunk. 
    :param chunk_overlap: The number of characters to overlap between consecutive chunks.
    """
    all_docs = []
    text_splitter = CharacterTextSplitter(
        separator="//m-n-m//",  # Defines the separator used to split the text.
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        is_separator_regex=False
    )

    # Split resumes
    resumes = merged_resumes.split('//m-n-m//')

    # Transform to Document
    documents = [Document(page_content=resume.strip()) for resume in resumes if resume.strip()]

    # Split docs
    split_docs = text_splitter.split_documents(documents) 
    all_docs.extend(split_docs)

    return all_docs
  1. Function Call:
  2. load_and_split_resumes(merged_resumes):

    • The merged_resumes string, containing all resumes concatenated and separated by //m-n-m//, is passed to the load_and_split_resumes function.
    • This function:
    • Splits the resumes into individual documents.
    • Further chunks each document into smaller pieces based on the defined chunk_size and chunk_overlap parameters.
  3. Output:

  4. The result of the function is assigned to the variable text_chunks.
  5. text_chunks is a list of chunked text segments, ready for downstream processing or analysis.
# Create our chunks
text_chunks = load_and_split_resumes(merged_resumes)

Create the vector store and model path is given

  1. Condition Check:
  2. if vector_store == 'ChromaDB':

    • Ensures that the code block is executed only if ChromaDB is selected as the vector store.
  3. create_chroma_vector_store Function:

  4. This function generates embeddings for the provided text chunks and creates a Chroma vector store.

  5. Assign Retriever:

  6. retriever_resumes = create_chroma_vector_store(text_chunks, embedding_model_path):
    • Calls the create_chroma_vector_store function with text_chunks and the embeddings model path to generate the retriever.
if vector_store == 'ChromaDB':
    # Generate embeddings and create vector store
    def create_chroma_vector_store(chunks, embeddings_model_path):
        embeddings = HuggingFaceEmbeddings( # This class is used to generate embeddings for the text chunks.
            model_name=embeddings_model_path, # Specifies the path to the pre-trained embeddings model used for generating embeddings.
            model_kwargs={'device': 'cpu'}, # Configuration for running model on cpu.
            encode_kwargs={'normalize_embeddings': False})

        db_name = str(random.random())
        vectorstore_resumes = Chroma.from_documents(collection_name=db_name, documents=chunks, embedding=embeddings) # This method creates a vector store from the provided documents (chunks) and embeddings.

        # 'search_type' parameter defines the method to use while searching the most relevant documents to the prompt and 'k' parameter defines the number of documents which will include within context
        retriever_resumes = vectorstore_resumes.as_retriever(search_type="similarity", search_kwargs={"k": 2}) 
        return retriever_resumes

    retriever_resumes = create_chroma_vector_store(text_chunks, embedding_model_path)

(OPTIONAL) Milvus Vector DB

  1. Condition Check:
  2. if vector_store == 'MilvusDB':

    • Ensures that the Milvus-based vector store logic is executed only when MilvusDB is selected.
  3. create_milvus_vector_store Function:

  4. This function generates embeddings for the provided text chunks and stores them in a Milvus vector database.

  5. Inputs:

    • chunks: The text chunks to be embedded and stored.
    • embeddings_model_path: Path to the pre-trained embeddings model.
  6. Steps:

    • Generate Embeddings:
    • HuggingFaceEmbeddings generates embeddings for the text chunks.
    • Parameters:

      • model_name: Path to the pre-trained embeddings model.
      • model_kwargs: Specifies the device to run the model ('cpu' in this case).
      • encode_kwargs: Optional configuration for encoding (e.g., normalize_embeddings).
    • Connect to Milvus:

    • connections.connect establishes a connection to the Milvus server.
    • Parameters:

      • host: The URI of the Milvus server.
      • port: Default port for Milvus, 19530.
    • Create Vector Store:

    • Milvus.from_documents creates a vector store with:
      • documents: The input text chunks.
      • embedding: The generated embeddings.
      • `collection_
from langchain_milvus import Milvus
from pymilvus import connections

if vector_store == 'MilvusDB':
    def create_milvus_vector_store(chunks, embeddings_model_path):
        embeddings = HuggingFaceEmbeddings( # This class is used to generate embeddings for the text chunks.
            model_name=embeddings_model_path, # Specifies the path to the pre-trained embeddings model used for generating embeddings.
            model_kwargs={'device': 'cpu'}, # Configuration for running model on cpu.
            encode_kwargs={'normalize_embeddings': False})

        connections.connect("default", host=milvus_uri, port="19530")

        vectorstore = Milvus.from_documents(
            documents=chunks,
            embedding=embeddings,
            collection_name="langchain_example",
            connection_args={
                "uri": f"https://{milvus_uri}:19530"
            },
            drop_old=True,  # Drop the old Milvus collection if it exists
        ) 

        # 'search_type' parameter defines the method to use while searching the most relevant documents to the prompt and 'k' parameter defines the number of documents which will include within context
        retriever_resumes = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 2}) 
        return retriever_resumes

    retriever_resumes = create_milvus_vector_store(text_chunks, embedding_model_path)

Define format_docs for join all chunks

Description

This function processes a list of document objects, extracts their page_content attributes, and concatenates all the document contents into a single string. Each document's content is separated by two newline characters (\n\n) to enhance readability.

  1. Extract Document Content:
  2. Iterates over the list docs and retrieves the page_content attribute from each document object.

  3. Join Content:

  4. Combines all extracted content into a single string, with each document's content separated by two newline characters (\n\n).

  5. Return Result:

  6. The resulting string is returned for use in other parts of the application.
def format_docs(docs):
     # Retrieves the content of each document in the `docs` list and joins the content of all documents into a single string, with each document's content separated by two newline characters.
     return "\n\n".join(doc.page_content for doc in docs) 

All chains merged into each other at this function

Description

This function retrieves relevant documents for a given question using a retriever, formats a prompt with the documents and question, and queries a language model API to generate an answer.

  1. Define Prompt Template:
  2. A PromptTemplate object is created to format the input for the language model.
  3. Template Details:

    • Includes retrieved context (relevant documents).
    • Contains the question.
    • Instructs the model to respond with "I don't know" if the answer is unavailable.
  4. Retrieve Relevant Documents:

  5. Calls retriever.get_relevant_documents(question) to fetch documents most relevant to the provided question.

  6. Format the Retrieved Documents:

  7. The retrieved documents are passed to format_docs to be concatenated into a single string, separated by two newline characters.

  8. Construct the Prompt:

  9. The PromptTemplate is used to format the prompt with the retrieved context and the question.

  10. Query the API:

  11. Calls the call_llm_api function with the formatted prompt, api_url, and api_token to query the language model.

  12. Extract and Return the Answer:

  13. Extracts the answer from the API response by splitting the output on the keyword Answer: and removing extra whitespace.
def query_resume(retriever, question, api_url, api_token):


    prompt_template = PromptTemplate( # Defines a template for the prompt sent to the LLM.
        input_variables=["context", "question"],
        template=( # The format of the prompt.
            "You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. "
            "If you don't know the answer, just say that you don't know.\n"
            "Question: {question}\nContext: {context}\nAnswer:"
        )
    )

    docs = retriever.get_relevant_documents(question) # Uses the retriever to get relevant documents based on the question.
    context = format_docs(docs) # Formats the retrieved documents

    prompt = prompt_template.format(context=context, question=question) # Formats the prompt

    answer = call_llm_api(prompt, api_url, api_token)

    return answer.strip().split('Answer:')[-1].strip()

Chat Examples

Code Explanation

This code snippet imports the practicuscore module and retrieves the current region using the get_region function.

Workflow

  1. Import the Module:
  2. import practicuscore as prt:

    • The practicuscore library is imported and aliased as prt for convenience.
    • This module likely provides utility functions or methods for interacting with Practicus-related resources.
  3. Get the Region:

  4. region = prt.get_region():
    • Calls the get_region function from the practicuscore module.
    • This function retrieves information about the current region, which may be used for region-specific configurations or operations.
import practicuscore as prt

region = prt.get_region()

Workflow

  1. Retrieve Model List:
  2. my_model_list = region.model_list:

    • Accesses the model_list attribute of the region object.
    • This attribute is expected to contain a list of models available in the current region.
  3. Display Model List:

  4. display(my_model_list.to_pandas()):

    • Converts the model list to a pandas DataFrame using .to_pandas() for better readability.
    • Displays the DataFrame, which provides a tabular view of the available models.
  5. Select the First Model:

  6. model_name = my_model_list[0].name:

    • Selects the first model in the list by accessing the first element of my_model_list using index 0.
    • Extracts the name of the model using the .name attribute.
  7. Print the Selected Model:

  8. print("Using first model name:", model_name):
    • Outputs the name of the selected model for verification.
#Let's list our model and select one of them.
my_model_list = region.model_list
display(my_model_list.to_pandas())

#We will select first model 
model_name = my_model_list[0].name
print("Using first model name:", model_name)

Workflow

  1. Retrieve Model Prefix List:
  2. my_model_prefixes = region.model_prefix_list:

    • Accesses the model_prefix_list attribute of the region object.
    • This attribute is expected to contain a list of model prefixes available in the current region.
  3. Display Model Prefixes:

  4. display(my_model_prefixes.to_pandas()):

    • Converts the model prefix list to a pandas DataFrame using .to_pandas() for better readability.
    • Displays the DataFrame, which provides a tabular view of the available prefixes.
  5. Select the First Prefix:

  6. model_prefix = my_model_prefixes[0].key:

    • Selects the first prefix in the list by accessing the first element of my_model_prefixes using index 0.
    • Extracts the key of the prefix using the .key attribute.
  7. Print the Selected Prefix:

  8. print("Using first prefix:", model_prefix):
    • Outputs the selected prefix key for verification.
#Let's list our model prefixes and select one of them.
my_model_prefixes = region.model_prefix_list
display(my_model_prefixes.to_pandas())

#We will select first prefix
model_prefix = my_model_prefixes[0].key
print("Using first prefix:", model_prefix)
  1. Construct the API URL:
  2. api_url = f"https://{host}/{model_prefix}/{model_name}/":

    • Uses string formatting to dynamically create the API URL based on:
    • host: The base URL or hostname of the server.
    • model_prefix: The selected model prefix.
    • model_name: The selected model name.
    • The resulting URL specifies the endpoint for accessing the model.
  3. Generate Session Token:

  4. token = prt.models.get_session_token(api_url=api_url):
    • Calls the get_session_token function from the practicuscore.models module.
    • The api_url parameter specifies the endpoint for which the token is generated.
    • The session token ensures secure and authorized communication with the API.
#Create api url and token
api_url = f"https://{host}/{model_prefix}/{model_name}/"
token = prt.models.get_session_token(api_url=api_url)

Create query and print answer

# Example query
answer = query_resume(retriever = retriever_resumes, question="What are the leadership qualities of an HR Director?", api_url = api_url,api_token = token)
# Get Answer
print(answer)

Previous: Consume Parallel