I’m going to implement local chatbot on my laptop to talk to my book, “Designed4Devops”. This will allow a user (me!) to be able to ask the book questions and for the large language model (LLM) to summarise its contents. My book is self-published and copywrite so it shouldn’t appear in models’ training data. To achieve this, I’m going to RAG or Retrieval Augmented Generation.
RAG
RAG is a technique that allows you to add data to a LLM after the model was trained, without retraining or finetuning it. Training models requires access to large and often numerous high-end GPUs. This can be expensive. It also has the downside that if you want to update the data, you need to retrain the model again. RAG overcomes this by adding semi-structured or unstructured information to the model environment after training, in a way that the model can understand so that it can query it. The information can be updated quickly without retraining. Remember that models work by matrix multiplications of numbers not text. We use a model to embed the text as numbers in a vector store. This allows the LLM to query the data with semantic searching. The model then returns results based on the question and the context of the query given. The context is the relationship between the tokens (words or word fragments) in the vector store. To refresh the information available to the model only the vector store embeddings need to be updated.
First, we’ll install the dependencies and set up the model.
Let’s import the dependencies for setting up the model.
!pip install --force-reinstall -Uq torch datasets accelerate peft bitsandbytes transformers trl
import transformers
import torch
import datasets
import accelerate
import peft
import bitsandbytes
import trl from transformers
import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
Let’s load the model
I’m going to use Mistral 7B as it offers a good performance but has a low overhead of processing and memory.
model_name='../models/Mistral-7B-Instruct-v0.1'
model_config = transformers.AutoConfig.from_pretrained(model_name, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_tokentokenizer.padding_side = "right"
Quantization of the Model
I’m going to quantize the model to 4 bits. This lowers the precision of the data types (int4 vs fp16 or fp32), which reduces the overheads even further. An int4 datatype uses ¼ of the memory of an fp16 (16bit floating point) data type. I’ll use ‘bitsandbytes’ software to do this.
# Activate 4-bit precision base model loading
use_4bit = True
# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"
# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"
# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False
##############################################################
Set up quantization config
##############################################################
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
load_in_4bit=use_4bit,
bnb_4bit_quant_type=bnb_4bit_quant_type,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=use_nested_quant,
)
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
major, _ = torch.cuda.get_device_capability()
if major >= 8:
print(“=” * 80)
print(“Your GPU supports bfloat16: accelerate training with bf16=True”)
print(“=” * 80)
==============================================================
Your GPU supports bfloat16: accelerate training with bf16=True
==============================================================
torch.cuda.get_device_capability()
(8, 6)
##############################################################
Load pre-trained config
##############################################################
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config
Let’s test it..
This query asks the model a question. We haven’t loaded any of our data into it yet, this is all information held within the model from its training data set.
messages = [{
"role":"user",
"content": "Can you tell us 3 reasons why Eryri is a good place to visit?"}]
tokenizer.apply_chat_template(messages, tokenize=True)
model_inputs = tokenizer.apply_chat_template(messages, return_tensors = “pt”).to(‘cuda:0’)
model_inputs
tensor([[ 1, 733, 16289, 28793, 2418, 368, 1912, 592, 28705, 28770, 6494, 2079, 413, 643, 373, 349, 264, 1179, 1633, 298, 3251, 28804, 733, 28748, 16289, 28793]], device='cuda:0')
generated_ids = model.generate(
model_inputs,
max_new_tokens = 1000,
do_sample = True,
pad_token_id=tokenizer.eos_token_id,
)
decoded = tokenizer.batch_decode(generated_ids,padding=True)
print(decoded[0])
[INST] Can you tell us 3 reasons why Eryri is a good place to visit? [/INST] 1. Scenic Beauty: Eryri is known for its stunning natural landscapes and picturesque views. From the majestic mountains to the serene lakes and rolling hills, the region offers a range of breathtaking sights to behold. The area is also home to several national parks and nature reserves, providing visitors with opportunities to hike, bike, and explore the great outdoors.
2. Rich Cultural Heritage: Eryri has a rich and diverse cultural heritage, with a number of ancient sites, historic buildings, and museums to explore. The region is home to several castles, including the famous Conwy Castle, which dates back to the 13th century. Visitors can also learn about the region’s Celtic and medieval history at the National Museum of Wales in Conwy.
3. Delicious Food and Drink: Eryri is known for its delicious cuisine, which combines traditional Welsh flavors with modern twists. The region is famous for its lamb, beef, and seafood, and visitors can sample these local specialties at a number of restaurants, cafes, and markets. The area is also home to several craft breweries and distilleries, offering visitors the chance to taste some of the region’s finest beverages.</s>
The model is working!
Create the vector database
I’m going to use ChromaDB, which is a lightweight local vector store, to hold the embeddings of the books text that will come from the PDF.
!pip install --force-reinstall -Uq langchain chromadb openai tiktoken sentence-transformers pypdf fastembed
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import AsyncChromiumLoader
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.document_loaders import PyPDFLoader
This loads the PDF into memory, sets up the tokenizer and uses LangChain with FlagEmbedding to embed the text from the PDF in the ChromaDB vector store. The tokeniser breaks the text up into tokens (chunks) which can be individual words or fragments of words.
##############################################################
Load the book PDF, tokenise and embed in ChromaDB vector store
#############################################################
embeddings = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")
# Load the book
loader = PyPDFLoader(“/tf/docker-shared-data/rag-data/d4do_paperback.pdf”)
documents = loader.load()
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 150,
chunk_overlap = 50,
length_function = len,
is_separator_regex = False,
)
chunks = text_splitter.split_documents(documents)
chunks[0] store = Chroma.from_documents(
chunks,
embeddings,
ids = [f”{item.metadata[‘source’]}-{index}” for index, item in enumerate(chunks)],
collection_name=”D4DO-Embeddings”
)
Test the vector store
This tests that the data exists within the vector store.
query = "What did Conway say?"
docs = store.similarity_search(query)
print(docs[0].page_content)
context of discussions that happened at the time. For example, an architect, a UI designer, and a developer
Test the model with a prompt
messages = [{
"role": "user",
"content": "Act as a consultant. I have a client who needing to make his software product company more efficient. \
I want to impress my client by providing advice from the book Designed4Devops. \ What do you recommend? \
Give me two options, along with how to go about it for each"
}]
model_inputs = tokenizer.apply_chat_template(messages,return_tensors = “pt”,padding=True).to(‘cuda:0’)
generated_ids = model.generate(
model_inputs,
max_new_tokens = 1000,
do_sample = True,
pad_token_id=tokenizer.eos_token_id,
)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
<s> [INST] Act as a consultant. I have a client who needing to make his software product company more efficient. I want to impress my client by providing advice from the book Designed4Devops. What do you recommend? Give me two options, along with how to go about it for each [/INST] Option 1: Automation and Continuous Integration
One way to improve efficiency in a software product company is to implement automation and continuous integration (CI) practices. This can help streamline the development process, reduce manual errors, and improve the speed and reliability of software releases.
To do this, the first step is to identify areas of the development process where automation and continuous integration can be implemented. This may include building automated tests, automating the deployment process, and implementing a CI tool, such as Jenkins or Travis CI, to orchestrate these processes.
To ensure a successful implementation, it’s important to involve the entire development team in the process to ensure that everyone understands the benefits of automation and continuous integration, and to ensure that the tools and processes are tailored to the specific needs of the company.
Option 2: DevOps Practices and Culture
Another way to improve efficiency in a software product company is to adopt DevOps practices and culture. This involves breaking down the traditional silos between development, operations, and IT teams, and promoting a culture of collaboration and communication between these teams.
To do this, the first step is to assess the current state of practices and culture within the company. This can help identify areas where changes are needed, such as improving communication and collaboration between teams, implementing automation and continuous integration practices, and adopting a version control system, such as Git, to version control code and ensure that changes can be tracked and managed consistently.
To ensure a successful implementation, it’s important to involve the entire organization in the process and to provide training and support to help employees adopt the new practices and culture.</s>
Create the LLM chain
To create a semantically aware search, we need to chain the question with the context from the book’s vectors and engineer a prompt that focuses the model on answering questions using the data from our vector store instead of making it up (hallucinating). Prompt engineering is a way to coach the model into giving the sort of answers that you want return and filter those that you don’t. This block sets up the chain and the template for the query that brings the context and question together to engineer the prompt.
langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.chains import LLMChain
text_generation_pipeline = transformers.pipeline(
model=model,
tokenizer=tokenizer,
task=”text-generation”,
repetition_penalty=1.1,
return_full_text=True,
max_new_tokens=300,
)
prompt_template = “””
### [INST]
Instruction: Answer the question based on your designed4devops knowledge. Don’t make up answers, just say there is no answer in the book. Here is context to help:
{context}
### QUESTION:
{question}
[/INST]
“””
mistral_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)
# Create prompt from prompt template
prompt = PromptTemplate(
input_variables=[“context”, “question”],
template=prompt_template,
)
# Create llm chain
llm_chain = LLMChain(llm=mistral_llm, prompt=prompt)
Create RAG Chain
This chain allows us to engineer our prompt by adding context to the question to hopefully get a stronger answer. The context will come from our vector store where we embedded the book as tokens.
retriever = store.as_retriever()
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| llm_chain
)
The Result:
query = "How do I improve the speed of changes that we are making to my product?"
rag_chain.invoke(query)
{'context': [Document(page_content='we process changes to our product. Digital marketplaces move quickly, so we need to introduce change', metadata={'page': 32, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='process and within the product to create short feedback loops that allow us to keep improving our product', metadata={'page': 230, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='concern yet. Improving them comes later. If you can identify separate workflows within your product, you', metadata={'page': 57, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='making minor changes and integrating, testing, and releasing them more often.', metadata={'page': 125, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'})], 'question': 'How do I improve the speed of changes that we are making to my product?', 'text': "\n### [INST] \nInstruction: Answer the question based on your \ndesigned4devops knowledge. Don't make up answers, just say there is no answer in the book. Here is context to help:\n\n[Document(page_content='we process changes to our product. Digital marketplaces move quickly, so we need to introduce change', metadata={'page': 32, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='process and within the product to create short feedback loops that allow us to keep improving our product', metadata={'page': 230, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='concern yet. Improving them comes later. If you can identify separate workflows within your product, you', metadata={'page': 57, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='making minor changes and integrating, testing, and releasing them more often.', metadata={'page': 125, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'})]\n\n### QUESTION:\nHow do I improve the speed of changes that we are making to my product? \n\n[/INST]\n \nThere are several ways to improve the speed of changes being made to a product. One approach is to break down the product into smaller, manageable workflows that can be improved independently. This allows for faster iteration and testing of individual components, which can lead to quicker overall improvements. Additionally, implementing continuous integration and delivery practices can help streamline the development process and reduce the time it takes to release new features or updates"}
query = "How do I speed up the transfer of workflow tickets for new releases, from the development team to the operations team?"
rag_chain.invoke(query)
{'context': [Document(page_content='delivery before the physical installation. Anything we can do to streamline this workflow will significantly', metadata={'page': 158, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='continuous flow of minor changes. You might charge on a per-transaction for APIs or per-user for', metadata={'page': 100, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='existing pipelines to increase flow. Mapping a value stream from an idea to release or a backlog entry to', metadata={'page': 54, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='difficult to automate with continuous integration and deployment tools, which slows the flow. \nMicroservices', metadata={'page': 127, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'})], 'question': 'How do I speed up the transfer of workflow tickets for new releases, from the development team to the operations team?', 'text': "\n### [INST] \nInstruction: Answer the question based on your \ndesigned4devops knowledge. Don't make up answers, just say there is no answer in the book. Here is context to help:\n\n[Document(page_content='delivery before the physical installation. Anything we can do to streamline this workflow will significantly', metadata={'page': 158, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='continuous flow of minor changes. You might charge on a per-transaction for APIs or per-user for', metadata={'page': 100, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='existing pipelines to increase flow. Mapping a value stream from an idea to release or a backlog entry to', metadata={'page': 54, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='difficult to automate with continuous integration and deployment tools, which slows the flow. \\nMicroservices', metadata={'page': 127, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'})]\n\n### QUESTION:\nHow do I speed up the transfer of workflow tickets for new releases, from the development team to the operations team? \n\n[/INST]\n \nThere are several ways to speed up the transfer of workflow tickets for new releases from the development team to the operations team. One approach is to use a continuous integration and deployment (CI/CD) pipeline that automates the process of building, testing, and deploying code changes. This can help reduce the time and effort required to manually transfer workflow tickets between teams. Additionally, using microservices architecture can help break down complex applications into smaller, more manageable components, making it easier to automate and streamline processes. Another approach is to map the value stream from an idea to release or a backlog entry to delivery, which can help ensure that all necessary steps are taken to deliver a new release on time."}
query = "How do I reduce the waste in my delivery pipelines?"
rag_chain.invoke(query)
{'context': [Document(page_content='negotiation from your pipeline and remove the dependency. It removes waste. You do need to exercise', metadata={'page': 100, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='lower the risk of waste creeping into the pipelines, products, and processes. When we get down to the flow', metadata={'page': 53, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='methodologies to optimize production lines, reduce waste, and increase the agility they deliver physical', metadata={'page': 25, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='our pipeline and retest the build from scratch. It is adding waste to the system and potentially', metadata={'page': 138, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'})], 'question': 'How do I reduce the waste in my delivery pipelines?', 'text': "\n### [INST] \nInstruction: Answer the question based on your \ndesigned4devops knowledge. Don't make up answers, just say there is no answer in the book. Here is context to help:\n\n[Document(page_content='negotiation from your pipeline and remove the dependency. It removes waste. You do need to exercise', metadata={'page': 100, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='lower the risk of waste creeping into the pipelines, products, and processes. When we get down to the flow', metadata={'page': 53, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='methodologies to optimize production lines, reduce waste, and increase the agility they deliver physical', metadata={'page': 25, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='our pipeline and retest the build from scratch. It is adding waste to the system and potentially', metadata={'page': 138, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'})]\n\n### QUESTION:\nHow do I reduce the waste in my delivery pipelines? \n\n[/INST]\n \nThere are several ways to reduce waste in delivery pipelines. One approach is to negotiate with your team to remove dependencies that are not necessary for the project. This can help streamline the pipeline and reduce unnecessary steps. Additionally, it's important to lower the risk of waste creeping into the pipelines, products, and processes by focusing on the flow and methodologies that optimize production lines, reduce waste, and increase agility. Another way to reduce waste is to retest the build from scratch if a change is made to the pipeline. However, this should be done carefully as it may add additional waste to the system."}
query = "How will Designed4Devops help me cook a risotto without burning the rice?"
rag_chain.invoke(query)
{'context': [Document(page_content='designed4devops', metadata={'page': 0, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='how designed4devops helps to make this easier to implement. It is a more technical discussion of', metadata={'page': 17, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='designed4devops sets out to increase the flow of novemes through this lifecycle while improving', metadata={'page': 50, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='in the Designed4: Analysis section. \n \nApplication Development', metadata={'page': 163, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'})], 'question': 'How will Designed4Devops help me cook a risotto withour burning the rice?', 'text': "\n### [INST] \nInstruction: Answer the question based on your \ndesigned4devops knowledge. Don't make up answers, just say there is no answer in the book. Here is context to help:\n\n[Document(page_content='designed4devops', metadata={'page': 0, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='how designed4devops helps to make this easier to implement. It is a more technical discussion of', metadata={'page': 17, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='designed4devops sets out to increase the flow of novemes through this lifecycle while improving', metadata={'page': 50, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}), Document(page_content='in the Designed4: Analysis section. \\n \\nApplication Development', metadata={'page': 163, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'})]\n\n### QUESTION:\nHow will Designed4Devops help me cook a risotto withour burning the rice? \n\n[/INST]\n \nThere is no direct answer to this question in the provided documents. However, it is important to note that Designed4DevOps is a methodology for software development and delivery that aims to improve the efficiency and reliability of the entire software development lifecycle. While it may not directly address cooking a risotto, it can potentially help in implementing automated processes and tools to streamline software development and delivery, which could indirectly lead to better outcomes."}
Conclusion
This demonstrates that the barrier to entry for prototyping generative AI service is surprisingly low. This was achieved on relatively modest compute resources in in a few hours. Some models, such as Mistral 7B used here, can provide references from where it found the information which will provide even greater value to the users. Before you jump in, be sure to check out my blog on Generative AI and RAG Security. I’ll be taking this project further and blogging along the way. I’ll be talking about how I productionise the system, package it and host it, and adding a front end so that you can interact with the book yourselves! You can download this blog as a Jupyter notebook file here. As ever, if you need help with AI projects you can get in touch with Methods to discuss further.