Cv Asistant
Firstly we need install transformers and torch
Run at terminal: - pip install transformers sentence-transformers langchain langchain-community langchain-milvus chromadb - pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install transformers sentence-transformers langchain langchain-community chromadb
-
Transformers: It allows you to easily use Transformer-based models (such as BERT, GPT, etc.).-
-
Sentence-Transformers: It produces vector representations of sentences using Transformer models.
-
LangChain: It is used to manage more complex workflows with language models.
-
Langchain-Community: Contains additional modules and components developed by the community for the LangChain library.
-
ChromaDB: Used as a vector database. It is optimized for embeddings and similarity searches.
-
PyPDF: A library used to process PDF files with Python.
pip install torch --index-url https://download.pytorch.org/whl/cpu
This command is used to install the PyTorch library with CPU support.
Details:
- Torch (PyTorch): A library used for developing machine learning and deep learning models. It offers features such as tensor computations, automatic differentiation, and advanced modeling.
- --index-url https://download.pytorch.org/whl/cpu: This parameter uses a specific index URL to download the CPU version of PyTorch. If you do not want to install a GPU-specific version, this URL is used.
# Prepare data
import os
import requests
repo_owner = "practicusai"
repo_name = "sample-data"
file_path = "hr_assistant"
branch = "main"
url = f"https://api.github.com/repos/{repo_owner}/{repo_name}/contents/{file_path}?ref={branch}"
response = requests.get(url)
if response.status_code == 200:
files = response.json()
for file in files:
file_url = file['download_url']
file_name = file['name']
file_response = requests.get(file_url)
if file_response.status_code == 200:
with open(file_name, 'wb') as f:
f.write(file_response.content)
print(f"'{file_name}' successfully downloaded.")
else:
print(f"'{file_name}' not successfully downloaded.")
else:
print(f"HTTP status: {response.status_code}")
Import Libraries
from langchain.prompts import PromptTemplate
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document
from langchain_community.vectorstores import Chroma
from langchain_practicus import ChatPracticus
import chromadb
import random
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
- Server URL Validation:
- Defines the
host
URL where the application will connect. -
If
host
is not provided, an error message is raised. -
Embedding Model Path Validation:
- Validates the path for the embedding model used to generate embeddings.
-
Raises an error if the path is not specified.
-
Vector Store Selection:
- The user is required to specify one of the two options:
ChromaDB
orMilvusDB
. -
An error message is raised if an invalid value is provided.
-
Milvus Connection URI:
- If
MilvusDB
is selected, a connection URI for Milvus must be provided. - An error is raised if the
milvus_uri
is not defined.
This setup ensures that all necessary configurations are in place for the system to initialize and function correctly.
uhost = None # Example url -> 'company.practicus.com'
assert host, "Please enter your host url"
embedding_model_path = None
assert embedding_model_path, "Please enter your embedding model path."
# You can use one of ChromaDB or MilvusDB as vector store
vector_store = None
assert vector_store in ['ChromaDB', 'MilvusDB'], "Vector store must be 'ChromaDB' or 'MilvusDB'."
if vector_store == 'MilvusDB':
milvus_uri = None # Milvus connection url, E.g. 'company.practicus.milvus.com'
assert 'milvus_uri', "Please enter your milvus connection uri"
## Define llm api function and call ChatPracticus in this function
Function: call_llm_api
Description
This function interacts with the ChatPracticus API to invoke a response from a language model using the provided inputs, API URL, and API token. The function is designed to send data to the API and retrieve the response content.
Parameters
-
inputs
:
The input query or data to be sent to the API for processing. This is typically a string or JSON object depending on the API's requirements. -
api_url
:
The endpoint URL of the ChatPracticus API. This is the location where the API call will be directed. -
api_token
:
The authentication token for accessing the ChatPracticus API. This ensures secure communication and proper authorization. -
The function initializes a
ChatPracticus
object with the specified API URL and token. Themodel_id
parameter is currently unused or ignored but can be included for future model-specific configurations. -
The
invoke
method of theChatPracticus
object is called with the given input. This sends the query to the API and retrieves the response. -
The function returns the
content
attribute of the response, which contains the text generated by the language model.
def call_llm_api(inputs, api_url, api_token):
# We need to give input to 'generate_response'. This function will use our 'api_token' and 'endpoint_url' and return the response.
"""
This function will use our 'api_token' and 'api_url' and return the response.
:params inputs: The input to be sent to the API.
:params api_url: The endpoint URL of the ChatPracticus API.
:params api_token: Token of our model.
"""
chat = ChatPracticus(
endpoint_url=api_url,
api_token=api_token,
model_id="current models ignore this",
)
response = chat.invoke(input=inputs)
return(response.content)
Get all resumes and use seperator for split questions
df = pd.read_csv("HR.csv")
:- Reads the CSV file
HR.csv
into a pandas DataFrame calleddf
. -
The DataFrame should contain a column named
Resume_str
. -
merged_resumes = ''
: -
Initializes an empty string
merged_resumes
, which will store the concatenated resume data. -
For Loop:
- Iterates over each resume string in the
Resume_str
column of the DataFrame. - Appends each resume to the
merged_resumes
string, preceded by the delimiter//m-n-m//
.
Final Output
- The variable
merged_resumes
contains all resumes concatenated into a single string, with//m-n-m//
acting as a separator between each resume.
df = pd.read_csv("HR.csv")
merged_resumes = ''
for resume in df['Resume_str']:
merged_resumes = merged_resumes + '//m-n-m//' + resume
Description
This function processes a concatenated string of resumes, splits them into individual documents based on a delimiter, and further divides these documents into smaller chunks for analysis. The function utilizes a CharacterTextSplitter
to handle the chunking process.
- Split the Resumes:
-
The input
merged_resumes
is split into individual resume strings using the//m-n-m//
delimiter. -
Create Document Objects:
-
Each resume is transformed into a
Document
object. Empty or whitespace-only resumes are excluded. -
Initialize the Text Splitter:
-
A
CharacterTextSplitter
is set up with the following configuration:separator="//m-n-m//"
: The delimiter used for splitting.chunk_size
: Controls the maximum size of each text chunk.chunk_overlap
: Adds overlapping text between chunks for better context retention.
-
Split Documents:
-
The documents are further divided into smaller chunks using the
CharacterTextSplitter
. -
Aggregate Results:
- The chunks are appended to the
all_docs
list, which is returned as the final output.
Returns
all_docs
:
A list of smaller text chunks, each represented as a document object, ready for further processing.
def load_and_split_resumes(merged_resumes, chunk_size=500, chunk_overlap=50):
"""
Load and split email strings into chunks.
:param merged_resumes: A single string containing all resumes contents, separated by '//m-n-m//'.
:param chunk_size: The maximum number of characters in each text chunk.
:param chunk_overlap: The number of characters to overlap between consecutive chunks.
"""
all_docs = []
text_splitter = CharacterTextSplitter(
separator="//m-n-m//", # Defines the separator used to split the text.
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
is_separator_regex=False
)
# Split resumes
resumes = merged_resumes.split('//m-n-m//')
# Transform to Document
documents = [Document(page_content=resume.strip()) for resume in resumes if resume.strip()]
# Split docs
split_docs = text_splitter.split_documents(documents)
all_docs.extend(split_docs)
return all_docs
- Function Call:
-
load_and_split_resumes(merged_resumes)
:- The
merged_resumes
string, containing all resumes concatenated and separated by//m-n-m//
, is passed to theload_and_split_resumes
function. - This function:
- Splits the resumes into individual documents.
- Further chunks each document into smaller pieces based on the defined
chunk_size
andchunk_overlap
parameters.
- The
-
Output:
- The result of the function is assigned to the variable
text_chunks
. text_chunks
is a list of chunked text segments, ready for downstream processing or analysis.
Create the vector store and model path is given
- Condition Check:
-
if vector_store == 'ChromaDB':
- Ensures that the code block is executed only if
ChromaDB
is selected as the vector store.
- Ensures that the code block is executed only if
-
create_chroma_vector_store
Function: -
This function generates embeddings for the provided text chunks and creates a Chroma vector store.
-
Assign Retriever:
retriever_resumes = create_chroma_vector_store(text_chunks, embedding_model_path)
:- Calls the
create_chroma_vector_store
function withtext_chunks
and the embeddings model path to generate the retriever.
- Calls the
if vector_store == 'ChromaDB':
# Generate embeddings and create vector store
def create_chroma_vector_store(chunks, embeddings_model_path):
embeddings = HuggingFaceEmbeddings( # This class is used to generate embeddings for the text chunks.
model_name=embeddings_model_path, # Specifies the path to the pre-trained embeddings model used for generating embeddings.
model_kwargs={'device': 'cpu'}, # Configuration for running model on cpu.
encode_kwargs={'normalize_embeddings': False})
db_name = str(random.random())
vectorstore_resumes = Chroma.from_documents(collection_name=db_name, documents=chunks, embedding=embeddings) # This method creates a vector store from the provided documents (chunks) and embeddings.
# 'search_type' parameter defines the method to use while searching the most relevant documents to the prompt and 'k' parameter defines the number of documents which will include within context
retriever_resumes = vectorstore_resumes.as_retriever(search_type="similarity", search_kwargs={"k": 2})
return retriever_resumes
retriever_resumes = create_chroma_vector_store(text_chunks, embedding_model_path)
(OPTIONAL) Milvus Vector DB
- Condition Check:
-
if vector_store == 'MilvusDB':
- Ensures that the Milvus-based vector store logic is executed only when
MilvusDB
is selected.
- Ensures that the Milvus-based vector store logic is executed only when
-
create_milvus_vector_store
Function: -
This function generates embeddings for the provided text chunks and stores them in a Milvus vector database.
-
Inputs:
chunks
: The text chunks to be embedded and stored.embeddings_model_path
: Path to the pre-trained embeddings model.
-
Steps:
- Generate Embeddings:
HuggingFaceEmbeddings
generates embeddings for the text chunks.-
Parameters:
model_name
: Path to the pre-trained embeddings model.model_kwargs
: Specifies the device to run the model ('cpu'
in this case).encode_kwargs
: Optional configuration for encoding (e.g.,normalize_embeddings
).
-
Connect to Milvus:
connections.connect
establishes a connection to the Milvus server.-
Parameters:
host
: The URI of the Milvus server.port
: Default port for Milvus,19530
.
-
Create Vector Store:
Milvus.from_documents
creates a vector store with:documents
: The input text chunks.embedding
: The generated embeddings.- `collection_
from langchain_milvus import Milvus
from pymilvus import connections
if vector_store == 'MilvusDB':
def create_milvus_vector_store(chunks, embeddings_model_path):
embeddings = HuggingFaceEmbeddings( # This class is used to generate embeddings for the text chunks.
model_name=embeddings_model_path, # Specifies the path to the pre-trained embeddings model used for generating embeddings.
model_kwargs={'device': 'cpu'}, # Configuration for running model on cpu.
encode_kwargs={'normalize_embeddings': False})
connections.connect("default", host=milvus_uri, port="19530")
vectorstore = Milvus.from_documents(
documents=chunks,
embedding=embeddings,
collection_name="langchain_example",
connection_args={
"uri": f"https://{milvus_uri}:19530"
},
drop_old=True, # Drop the old Milvus collection if it exists
)
# 'search_type' parameter defines the method to use while searching the most relevant documents to the prompt and 'k' parameter defines the number of documents which will include within context
retriever_resumes = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 2})
return retriever_resumes
retriever_resumes = create_milvus_vector_store(text_chunks, embedding_model_path)
Define format_docs for join all chunks
Description
This function processes a list of document objects, extracts their page_content
attributes, and concatenates all the document contents into a single string. Each document's content is separated by two newline characters (\n\n
) to enhance readability.
- Extract Document Content:
-
Iterates over the list
docs
and retrieves thepage_content
attribute from each document object. -
Join Content:
-
Combines all extracted content into a single string, with each document's content separated by two newline characters (
\n\n
). -
Return Result:
- The resulting string is returned for use in other parts of the application.
def format_docs(docs):
# Retrieves the content of each document in the `docs` list and joins the content of all documents into a single string, with each document's content separated by two newline characters.
return "\n\n".join(doc.page_content for doc in docs)
All chains merged into each other at this function
Description
This function retrieves relevant documents for a given question using a retriever, formats a prompt with the documents and question, and queries a language model API to generate an answer.
- Define Prompt Template:
- A
PromptTemplate
object is created to format the input for the language model. -
Template Details:
- Includes retrieved context (relevant documents).
- Contains the question.
- Instructs the model to respond with "I don't know" if the answer is unavailable.
-
Retrieve Relevant Documents:
-
Calls
retriever.get_relevant_documents(question)
to fetch documents most relevant to the provided question. -
Format the Retrieved Documents:
-
The retrieved documents are passed to
format_docs
to be concatenated into a single string, separated by two newline characters. -
Construct the Prompt:
-
The
PromptTemplate
is used to format the prompt with the retrieved context and the question. -
Query the API:
-
Calls the
call_llm_api
function with the formatted prompt,api_url
, andapi_token
to query the language model. -
Extract and Return the Answer:
- Extracts the answer from the API response by splitting the output on the keyword
Answer:
and removing extra whitespace.
def query_resume(retriever, question, api_url, api_token):
prompt_template = PromptTemplate( # Defines a template for the prompt sent to the LLM.
input_variables=["context", "question"],
template=( # The format of the prompt.
"You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. "
"If you don't know the answer, just say that you don't know.\n"
"Question: {question}\nContext: {context}\nAnswer:"
)
)
docs = retriever.get_relevant_documents(question) # Uses the retriever to get relevant documents based on the question.
context = format_docs(docs) # Formats the retrieved documents
prompt = prompt_template.format(context=context, question=question) # Formats the prompt
answer = call_llm_api(prompt, api_url, api_token)
return answer.strip().split('Answer:')[-1].strip()
Chat Examples
Code Explanation
This code snippet imports the practicuscore
module and retrieves the current region using the get_region
function.
Workflow
- Import the Module:
-
import practicuscore as prt
:- The
practicuscore
library is imported and aliased asprt
for convenience. - This module likely provides utility functions or methods for interacting with Practicus-related resources.
- The
-
Get the Region:
region = prt.get_region()
:- Calls the
get_region
function from thepracticuscore
module. - This function retrieves information about the current region, which may be used for region-specific configurations or operations.
- Calls the
Workflow
- Retrieve Model List:
-
my_model_list = region.model_list
:- Accesses the
model_list
attribute of theregion
object. - This attribute is expected to contain a list of models available in the current region.
- Accesses the
-
Display Model List:
-
display(my_model_list.to_pandas())
:- Converts the model list to a pandas DataFrame using
.to_pandas()
for better readability. - Displays the DataFrame, which provides a tabular view of the available models.
- Converts the model list to a pandas DataFrame using
-
Select the First Model:
-
model_name = my_model_list[0].name
:- Selects the first model in the list by accessing the first element of
my_model_list
using index0
. - Extracts the name of the model using the
.name
attribute.
- Selects the first model in the list by accessing the first element of
-
Print the Selected Model:
print("Using first model name:", model_name)
:- Outputs the name of the selected model for verification.
#Let's list our model and select one of them.
my_model_list = region.model_list
display(my_model_list.to_pandas())
#We will select first model
model_name = my_model_list[0].name
print("Using first model name:", model_name)
Workflow
- Retrieve Model Prefix List:
-
my_model_prefixes = region.model_prefix_list
:- Accesses the
model_prefix_list
attribute of theregion
object. - This attribute is expected to contain a list of model prefixes available in the current region.
- Accesses the
-
Display Model Prefixes:
-
display(my_model_prefixes.to_pandas())
:- Converts the model prefix list to a pandas DataFrame using
.to_pandas()
for better readability. - Displays the DataFrame, which provides a tabular view of the available prefixes.
- Converts the model prefix list to a pandas DataFrame using
-
Select the First Prefix:
-
model_prefix = my_model_prefixes[0].key
:- Selects the first prefix in the list by accessing the first element of
my_model_prefixes
using index0
. - Extracts the key of the prefix using the
.key
attribute.
- Selects the first prefix in the list by accessing the first element of
-
Print the Selected Prefix:
print("Using first prefix:", model_prefix)
:- Outputs the selected prefix key for verification.
#Let's list our model prefixes and select one of them.
my_model_prefixes = region.model_prefix_list
display(my_model_prefixes.to_pandas())
#We will select first prefix
model_prefix = my_model_prefixes[0].key
print("Using first prefix:", model_prefix)
- Construct the API URL:
-
api_url = f"https://{host}/{model_prefix}/{model_name}/"
:- Uses string formatting to dynamically create the API URL based on:
host
: The base URL or hostname of the server.model_prefix
: The selected model prefix.model_name
: The selected model name.- The resulting URL specifies the endpoint for accessing the model.
-
Generate Session Token:
token = prt.models.get_session_token(api_url=api_url)
:- Calls the
get_session_token
function from thepracticuscore.models
module. - The
api_url
parameter specifies the endpoint for which the token is generated. - The session token ensures secure and authorized communication with the API.
- Calls the
#Create api url and token
api_url = f"https://{host}/{model_prefix}/{model_name}/"
token = prt.models.get_session_token(api_url=api_url)
Create query and print answer
# Example query
answer = query_resume(retriever = retriever_resumes, question="What are the leadership qualities of an HR Director?", api_url = api_url,api_token = token)
# Get Answer
print(answer)
Previous: Consume Parallel