Introduction
Today, we will delve into Vector Databases and how they can be utilized to build advanced conversational agents.
Conversational agents, also known as chatbots or virtual assistants, have become increasingly prevalent in our daily lives. These intelligent systems are designed to simulate human-like conversations and provide users with information, assistance, or entertainment. One crucial aspect of building effective conversational agents is their ability to accurately understand and generate natural language responses.
We can leverage the Vector Databases in conversational agents to understand natural language and enhance their conversational capabilities.
We have explored the concept of retrieval augmentation, which involves using a large language model to search through a vector database for relevant information. One common question that arises from this is how the model determines when to query the database. After all, if you’re engaging in a casual chat, the model doesn’t necessarily need to refer to external knowledge.
In such cases, there’s no need for the model to retrieve information from the database. So, how can we make the querying of the database optional? In this blog post, we’ll discuss one approach to achieve this, which involves using a retrieval tool as part of an AI agent.
Code Implementation
A dataset is required to build our knowledge base. To keep things simple, we’ll utilize a preprocessed dataset that doesn’t need any additional chunking or preprocessing. When working with PDFs or lengthy texts, we usually need to divide the content into smaller chunks for indexing. The information in this scenario already includes contexts, which are paragraphs or slightly longer texts that we will index in our knowledge base.
!pip install -q openai "pinecone-client[grpc]" langchain tiktoken datasets
from datasets import load_dataset
data = load_dataset('squad', split='train')
data
Output:
Dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 87599
})
Code:
data = data.to_pandas()
data.head()
Output:
id title context question answers
0 5733be284776f41900661182 University_of_Notre_Dame Architecturally, the school has a Catholic cha... To whom did the Virgin Mary allegedly appear i... {'text': ['Saint Bernadette Soubirous'], 'answ...
1 5733be284776f4190066117f University_of_Notre_Dame Architecturally, the school has a Catholic cha... What is in front of the Notre Dame Main Building? {'text': ['a copper statue of Christ'], 'answe...
2 5733be284776f41900661180 University_of_Notre_Dame Architecturally, the school has a Catholic cha... The Basilica of the Sacred heart at Notre Dame... {'text': ['the Main Building'], 'answer_start'...
3 5733be284776f41900661181 University_of_Notre_Dame Architecturally, the school has a Catholic cha... What is the Grotto at Notre Dame? {'text': ['a Marian place of prayer and reflec...
4 5733be284776f4190066117e University_of_Notre_Dame Architecturally, the school has a Catholic cha... What sits on top of the Main Building at Notre... {'text': ['a golden statue of the Virgin Mary'...
now we see that context contains duplicate values so we will remove them so that it would be easier to understand this example:
data.drop_duplicates(subset=['context'], inplace=True,keep='first')
data.head()
Output:
| | id | title | context | question | answers |
|---:|:-------------------------|:-------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------|
| 0 | 5733be284776f41900661182 | University_of_Notre_Dame | Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary. | To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? | {'text': array(['Saint Bernadette Soubirous'], dtype=object), 'answer_start': array([515], dtype=int32)} |
| 5 | 5733bf84d058e614000b61be | University_of_Notre_Dame | As at most other universities, Notre Dame's students run a number of news media outlets. The nine student-run outlets include three newspapers, both a radio and television station, and several magazines and journals. Begun as a one-page journal in September 1876, the Scholastic magazine is issued twice monthly and claims to be the oldest continuous collegiate publication in the United States. The other magazine, The Juggler, is released twice a year and focuses on student literature and artwork. The Dome yearbook is published annually. The newspapers have varying publication interests, with The Observer published daily and mainly reporting university and other news, and staffed by students from both Notre Dame and Saint Mary's College. Unlike Scholastic and The Dome, The Observer is an independent publication and does not have a faculty advisor or any editorial oversight from the University. In 1987, when some students believed that The Observer began to show a conservative bias, a liberal newspaper, Common Sense was published. Likewise, in 2003, when other students believed that the paper showed a liberal bias, the conservative paper Irish Rover went into production. Neither paper is published as often as The Observer; however, all three are distributed to all students. Finally, in Spring 2008 an undergraduate journal for political science research, Beyond Politics, made its debut. | When did the Scholastic Magazine of Notre dame begin publishing? | {'text': array(['September 1876'], dtype=object), 'answer_start': array([248], dtype=int32)} |
| 10 | 5733bed24776f41900661188 | University_of_Notre_Dame | The university is the major seat of the Congregation of Holy Cross (albeit not its official headquarters, which are in Rome). Its main seminary, Moreau Seminary, is located on the campus across St. Joseph lake from the Main Building. Old College, the oldest building on campus and located near the shore of St. Mary lake, houses undergraduate seminarians. Retired priests and brothers reside in Fatima House (a former retreat center), Holy Cross House, as well as Columba Hall near the Grotto. The university through the Moreau Seminary has ties to theologian Frederick Buechner. While not Catholic, Buechner has praised writers from Notre Dame and Moreau Seminary created a Buechner Prize for Preaching. | Where is the headquarters of the Congregation of the Holy Cross? | {'text': array(['Rome'], dtype=object), 'answer_start': array([119], dtype=int32)} |
| 15 | 5733a6424776f41900660f51 | University_of_Notre_Dame | The College of Engineering was established in 1920, however, early courses in civil and mechanical engineering were a part of the College of Science since the 1870s. Today the college, housed in the Fitzpatrick, Cushing, and Stinson-Remick Halls of Engineering, includes five departments of study – aerospace and mechanical engineering, chemical and biomolecular engineering, civil engineering and geological sciences, computer science and engineering, and electrical engineering – with eight B.S. degrees offered. Additionally, the college offers five-year dual degree programs with the Colleges of Arts and Letters and of Business awarding additional B.A. and Master of Business Administration (MBA) degrees, respectively. | How many BS level degrees are offered in the College of Engineering at Notre Dame? | {'text': array(['eight'], dtype=object), 'answer_start': array([487], dtype=int32)} |
| 20 | 5733a70c4776f41900660f64 | University_of_Notre_Dame | All of Notre Dame's undergraduate students are a part of one of the five undergraduate colleges at the school or are in the First Year of Studies program. The First Year of Studies program was established in 1962 to guide incoming freshmen in their first year at the school before they have declared a major. Each student is given an academic advisor from the program who helps them to choose classes that give them exposure to any major in which they are interested. The program also includes a Learning Resource Center which provides time management, collaborative learning, and subject tutoring. This program has been recognized previously, by U.S. News & World Report, as outstanding. | What entity provides help with the management of time for new students at Notre Dame? | {'text': array(['Learning Resource Center'], dtype=object), 'answer_start': array([496], dtype=int32)} |
we see that now each context is unique. Now that we have the data prepared we need our embedding model and pinecone database to upload our data and process it
from langchain.embeddings.openai import OpenAIEmbeddings
embed = OpenAIEmbeddings(
model="text-embedding-ada-002",
openai_api_key="YOUR_API_KEY"
)
next, we need to initialize our pinecone database to connect and store the documents. For that you will need a pinecone API key and relevant pinecone env which you can get from pinecone.io.
import pinecone
# find API key in console at app.pinecone.io
# find ENV (cloud region) next to API key in console
index_name = 'langchain-retrieval-agent'
pinecone.init(
api_key="YOUR_API_KEY",
environment="YOUR_ENV"
)
if index_name not in pinecone.list_indexes():
# we create a new index
pinecone.create_index(
name=index_name,
metric='dotproduct',
dimension=1536 # 1536 dim of text-embedding-ada-002
)
we will use the metric of dot product as it works best with ada embeddings. As with the majority of models, cosine similarity works the best. Further, this snippet will create a database named Langchain retrieval agent if it does not exist yet:
index = pinecone.Index(index_name)
index.describe_index_stats()
{'dimension': 1536,
'index_fullness': 0.0,
'namespaces': {},
'total_vector_count': 0}
We should see that the new Pinecone index has a total_vector_count
of 0
, as we haven’t added any vectors yet.
Indexing
We can perform the indexing task using the LangChain vector store object. But for now, it is much faster to do it via the Pinecone Python client directly. We will do this in batches of 100
or more.
from tqdm.auto import tqdm
from uuid import uuid4
batch_size = 100
texts = []
metadatas = []
for i in tqdm(range(0, len(data), batch_size)):
# get end of batch
i_end = min(len(data), i+batch_size)
batch = data.iloc[i:i_end]
# first get metadata fields for this record
metadatas = [{
'title': record['title'],
'text': record['context']
} for j, record in batch.iterrows()]
# get the list of contexts / documents
documents = batch['context']
# create document embeddings
embeds = embed.embed_documents(documents)
# get IDs
ids = batch['id']
# add everything to pinecone
index.upsert(vectors=zip(ids, embeds, metadatas))
Creating a Vector Store & Querying
We start by initializing a vector store using the same index we just built. We do that like so:
from langchain.vectorstores import Pinecone
text_field = "text"
# switch back to normal index for langchain
index = pinecone.Index(index_name)
vectorstore = Pinecone(
index, embed.embed_query, text_field
)
Now let us try and perform a semantic search over it to see how it functions
query = "when was the college of engineering in the University of Notre Dame established?"
vectorstore.similarity_search(
query, # our search query
k=3 # return 3 most relevant docs
)
| | id | title | context | question | answers |
|---:|:-------------------------|:-------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------|
| 0 | 5733be284776f41900661182 | University_of_Notre_Dame | Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary. | To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? | {'text': array(['Saint Bernadette Soubirous'], dtype=object), 'answer_start': array([515], dtype=int32)} |
| 5 | 5733bf84d058e614000b61be | University_of_Notre_Dame | As at most other universities, Notre Dame's students run a number of news media outlets. The nine student-run outlets include three newspapers, both a radio and television station, and several magazines and journals. Begun as a one-page journal in September 1876, the Scholastic magazine is issued twice monthly and claims to be the oldest continuous collegiate publication in the United States. The other magazine, The Juggler, is released twice a year and focuses on student literature and artwork. The Dome yearbook is published annually. The newspapers have varying publication interests, with The Observer published daily and mainly reporting university and other news, and staffed by students from both Notre Dame and Saint Mary's College. Unlike Scholastic and The Dome, The Observer is an independent publication and does not have a faculty advisor or any editorial oversight from the University. In 1987, when some students believed that The Observer began to show a conservative bias, a liberal newspaper, Common Sense was published. Likewise, in 2003, when other students believed that the paper showed a liberal bias, the conservative paper Irish Rover went into production. Neither paper is published as often as The Observer; however, all three are distributed to all students. Finally, in Spring 2008 an undergraduate journal for political science research, Beyond Politics, made its debut. | When did the Scholastic Magazine of Notre dame begin publishing? | {'text': array(['September 1876'], dtype=object), 'answer_start': array([248], dtype=int32)} |
| 10 | 5733bed24776f41900661188 | University_of_Notre_Dame | The university is the major seat of the Congregation of Holy Cross (albeit not its official headquarters, which are in Rome). Its main seminary, Moreau Seminary, is located on the campus across St. Joseph lake from the Main Building. Old College, the oldest building on campus and located near the shore of St. Mary lake, houses undergraduate seminarians. Retired priests and brothers reside in Fatima House (a former retreat center), Holy Cross House, as well as Columba Hall near the Grotto. The university through the Moreau Seminary has ties to theologian Frederick Buechner. While not Catholic, Buechner has praised writers from Notre Dame and Moreau Seminary created a Buechner Prize for Preaching. | Where is the headquarters of the Congregation of the Holy Cross? | {'text': array(['Rome'], dtype=object), 'answer_start': array([119], dtype=int32)} |
| 15 | 5733a6424776f41900660f51 | University_of_Notre_Dame | The College of Engineering was established in 1920, however, early courses in civil and mechanical engineering were a part of the College of Science since the 1870s. Today the college, housed in the Fitzpatrick, Cushing, and Stinson-Remick Halls of Engineering, includes five departments of study – aerospace and mechanical engineering, chemical and biomolecular engineering, civil engineering and geological sciences, computer science and engineering, and electrical engineering – with eight B.S. degrees offered. Additionally, the college offers five-year dual degree programs with the Colleges of Arts and Letters and of Business awarding additional B.A. and Master of Business Administration (MBA) degrees, respectively. | How many BS level degrees are offered in the College of Engineering at Notre Dame? | {'text': array(['eight'], dtype=object), 'answer_start': array([487], dtype=int32)} |
| 20 | 5733a70c4776f41900660f64 | University_of_Notre_Dame | All of Notre Dame's undergraduate students are a part of one of the five undergraduate colleges at the school or are in the First Year of Studies program. The First Year of Studies program was established in 1962 to guide incoming freshmen in their first year at the school before they have declared a major. Each student is given an academic advisor from the program who helps them to choose classes that give them exposure to any major in which they are interested. The program also includes a Learning Resource Center which provides time management, collaborative learning, and subject tutoring. This program has been recognized previously, by U.S. News & World Report, as outstanding. | What entity provides help with the management of time for new students at Notre Dame? | {'text': array(['Learning Resource Center'], dtype=object), 'answer_start': array([496], dtype=int32)} |
[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: pip install --upgrade pip
/home/psaraf/Lancer/blog/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Found cached dataset squad (/home/psaraf/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)
Dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 87599
})
Dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 87599
})
id title context question answers
0 5733be284776f41900661182 University_of_Notre_Dame Architecturally, the school has a Catholic cha... To whom did the Virgin Mary allegedly appear i... {'text': ['Saint Bernadette Soubirous'], 'answ...
1 5733be284776f4190066117f University_of_Notre_Dame Architecturally, the school has a Catholic cha... What is in front of the Notre Dame Main Building? {'text': ['a copper statue of Christ'], 'answe...
2 5733be284776f41900661180 University_of_Notre_Dame Architecturally, the school has a Catholic cha... The Basilica of the Sacred heart at Notre Dame... {'text': ['the Main Building'], 'answer_start'...
3 5733be284776f41900661181 University_of_Notre_Dame Architecturally, the school has a Catholic cha... What is the Grotto at Notre Dame? {'text': ['a Marian place of prayer and reflec...
4 5733be284776f4190066117e University_of_Notre_Dame Architecturally, the school has a Catholic cha... What sits on top of the Main Building at Notre... {'text': ['a golden statue of the Virgin Mary'...
{'dimension': 1536,
'index_fullness': 0.0,
'namespaces': {},
'total_vector_count': 0}
100%|██████████| 189/189 [19:00<00:00, 6.03s/it]
[Document(page_content="In 1919 Father James Burns became president of Notre Dame, and in three years he produced an academic revolution that brought the school up to national standards by adopting the elective system and moving away from the university's traditional scholastic and classical emphasis. By contrast, the Jesuit colleges, bastions of academic conservatism, were reluctant to move to a system of electives. Their graduates were shut out of Harvard Law School for that reason. Notre Dame continued to grow over the years, adding more colleges, programs, and sports teams. By 1921, with the addition of the College of Commerce, Notre Dame had grown from a small college to a university with five colleges and a professional law school. The university continued to expand and add new residence halls and buildings with each subsequent president.", metadata={'title': 'University_of_Notre_Dame'}),
Document(page_content='The College of Engineering was established in 1920, however, early courses in civil and mechanical engineering were a part of the College of Science since the 1870s. Today the college, housed in the Fitzpatrick, Cushing, and Stinson-Remick Halls of Engineering, includes five departments of study – aerospace and mechanical engineering, chemical and biomolecular engineering, civil engineering and geological sciences, computer science and engineering, and electrical engineering – with eight B.S. degrees offered. Additionally, the college offers five-year dual degree programs with the Colleges of Arts and Letters and of Business awarding additional B.A. and Master of Business Administration (MBA) degrees, respectively.', metadata={'title': 'University_of_Notre_Dame'}),
Document(page_content='Since 2005, Notre Dame has been led by John I. Jenkins, C.S.C., the 17th president of the university. Jenkins took over the position from Malloy on July 1, 2005. In his inaugural address, Jenkins described his goals of making the university a leader in research that recognizes ethics and building the connection between faith and studies. During his tenure, Notre Dame has increased its endowment, enlarged its student body, and undergone many construction projects on campus, including Compton Family Ice Arena, a new architecture hall, additional residence halls, and the Campus Crossroads, a $400m enhancement and expansion of Notre Dame Stadium.', metadata={'title': 'University_of_Notre_Dame'})]
Initializing the Conversational Agent
Our conversational agent needs a Chat LLM, conversational memory, and a RetrievalQA chain to initialize. We create these using:
from langchain.chat_models import ChatOpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.chains import RetrievalQA
# chat completion llm
llm = ChatOpenAI(
openai_api_key=OPENAI_API_KEY,
model_name='gpt-3.5-turbo',
temperature=0.0
)
# conversational memory
conversational_memory = ConversationBufferWindowMemory(
memory_key='chat_history',
k=5,
return_messages=True
)
# retrieval qa chain
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
Using these we can generate an answer using the run
method:
qa.run(query)
'The College of Engineering was established in 1920 at the University of Notre Dame.'
But this isn’t yet ready for our conversational agent. For that, we need to convert this retrieval chain into a tool. We do that like so that it can be used as a chain tool:
from langchain.agents import Tool
tools = [
Tool(
name='Knowledge Base',
func=qa.run,
description=(
'use this tool when answering general knowledge queries to get '
'more information about the topic'
)
)
]
from langchain.agents import initialize_agent
agent = initialize_agent(
agent='chat-conversational-react-description'
tools=tools,
llm=llm,
verbose=True,
max_iterations=3,
early_stopping_method='generate',
memory=conversational_memory
)
Querying using the Conversational Agent
Now we can use our agent to generate responses to the queries:
agent(query)
> Entering new chain...
I'm not sure about the answer to this question, so I should use a tool to help me find the information.
Action: Knowledge Base
Action Input: "University of Notre Dame College of Engineering establishment date"
Observation: The University of Notre Dame College of Engineering was established in 1920.
Thought:I now know the answer to the question.
Final Answer: The College of Engineering in the University of Notre Dame was established in 1920.
> Finished chain.
It looks great the agent is able to answer our queries regarding knowledgebase as well now what if we ask it something which is not present in the knowledge base?
agent("what is 2 * 10")
> Entering new chain...
This is a simple math problem.
Action: Knowledge Base
Action Input: "2 * 10"
Observation: 2 * 10 = 20
Thought:That was easy.
Final Answer: 20
> Finished chain.
Now we can see our model performs great with the queries as well. We can use this model to generate responses to various queries and answer them as well.
With this, we can conclude with the fact that Vector databases in Conversational Agent help us create better agents that can answer queries related to the knowledge base better.