In this article, we will explore the concept of Retrieval Augmentation with LangChain and explain how to create one. We will delve into the benefits of this approach and its implementation strategies.
Retrieval Augmentation with LangChain
Large language models have revolutionized natural language processing (NLP) by exhibiting outstanding powers in creating coherent text and providing accurate replies to user queries. These models undergo training on massive volumes of text data to comprehend patterns, syntax, and semantic linkages, enabling them to generate responses that resemble human-like replies.
One significant constraint of huge language models is a lack of access to current or real-time information. These models remain frozen in time as they undergo training on a fixed dataset, often including data up to a specific cutoff date. Consequently, individuals using these models may lack awareness of current events, developments, or changes that have taken place after the collection of their training data.
When customers demand up-to-date information or require models to grasp current situations, the problem of data freshness might present issues. A possible option is to retrain huge language models on new datasets incorporating recent information. However, this strategy is time-consuming and resource-intensive due to the scarcity of labeled data and computing restrictions.
To get around this issue, we’ll look at a technique called retrieval augmentation with langchain.
Retrieval augmentation with LangChain involves utilizing external knowledge bases or information repositories to provide huge language models with current information. By integrating these external resources, the models can supplement their replies with up-to-date data and expertise.
Parametric Knowledge V/S Source Knowledge
The model acquires parametric knowledge during the training phase, which refers to the information it learns. In the case of GPT-3.5, a large language model, it is exposed to a significant amount of text data and learns to capture patterns, word connections, and contextual knowledge. The model’s parameters, which consist of learned weights and biases, contain this information and enable the model to generate text output.
Parametric knowledge is a static picture of the world as viewed through the eyes of the training data. It contains the model’s general knowledge and language understanding as a result of its training process. So technically the world is frozen in time according to the Language model!
In contrast, during the inference or generation stage, the model can receive external information or knowledge, which is referred to as source knowledge. This source knowledge can originate from a variety of sources, including knowledge bases, databases, and APIs.
Source knowledge is utilized to enrich the model’s parametric knowledge without directly changing its parameters. It provides the model with up-to-date or particular information that may not be available in the model’s initial training data.
Let us try to understand with an example!
Example
Imagine using a large language models like GPT or other LLMs trained on a vast dataset of books, journals, and online texts. This training provides the model with parametric information, giving it a comprehensive grasp of grammar, vocabulary, and general world knowledge. For example, the model observes that “water boils at 100 degrees Celsius” and “the Earth revolves around the Sun.”
Now, suppose you want to use this language model to answer a question about the current temperature in a city. As the model’s training data is not real-time, it doesn’t have access to current weather information. This is where source knowledge comes into play.
To provide the model with the necessary current information, you can integrate a weather API as a source of knowledge. This weather API can provide real-time temperature data for various locations. When a user inquires about the temperature in a particular city, the language model can utilize its parametric knowledge to generate an initial response, stating that “The average temperature in that city is approximately 25 degrees Celsius.”
The language model can access the current temperature for the desired city by utilizing the source information, which, in this example, is the weather API. The retrieval model connects to the weather API, retrieves the most recent temperature data, and modifies the first answer appropriately.
The algorithm can now deliver a more precise and contextually relevant response, such as “The current temperature in that city is 28 degrees Celsius.”
Implementating Retrieval Augmentation using LangChain
To elaborate on the process of creating a knowledge base for retrieval augmentation in langchain, let’s break it down step by step!
In this example, we choose the Wikipedia dataset as our source of information. Wikipedia is a valuable resource for building a knowledge base due to its vast amount of textual data covering a wide range of topics. To obtain the Wikipedia dataset, one can utilize platforms like Hugging Face Datasets, which offer pre-processed and structured versions of the dataset.
from datasets import load_dataset
data = load_dataset("wikipedia", "20220301.simple" , split="train[:3000]")
This code snippet would load the first 3000 samples from the hugging face of simple Wikipedia text let us see some samples from it:
data[7]
{'id': '14',
'url': 'https://simple.wikipedia.org/wiki/Alanis%20Morissette',
'title': 'Alanis Morissette',
'text': 'Alanis Nadine Morissette (born June 1, 1974) is a Grammy Award-winning Canadian-American singer and songwriter. She was born in Ottawa, Canada. She began singing in Canada as a teenager in 1990. In 1995, she became popular all over the world.\n\nAs a young child in Canada, Morissette began to act on television, including 5 episodes of the long-running series, You Can\'t Do That on Television. Her first album was released only in Canada in 1990.\n\nHer first international album was Jagged Little Pill, released in 1995. It was a rock-influenced album. Jagged has sold more than 33 million units globally. It became the best-selling debut album in music history. Her next album, Supposed Former Infatuation Junkie, was released in 1998. It was a success as well. Morissette took up producing duties for her next albums, which include Under Rug Swept, So-Called Chaos and Flavors of Entanglement. Morissette has sold more than 60 million albums worldwide.\n\nShe also acted in several movies, including Kevin Smith\'s Dogma, where she played God.\n\nAbout her life\nAlanis Morissette was born in Riverside Hospital of Ottawa in Ottawa, Ontario. Her father is French-Canadian. Her mother is from Hungary. She has an older brother, Chad, and a twin brother, Wade, who is 12 minutes younger than she is. Her parents had worked as teachers at a military base in Lahr, Germany.\n\nMorissette became an American citizen in 2005. She is still Canadian citizen.\n\nOn May 22, 2010, Morissette married rapper Mario "MC Souleye" Treadway.\n\nJagged Little Pill\nMorissette has had many albums. Her 1995 album Jagged Little Pill became a very popular album. It has sold over 30 million copies worldwide. The album caused Morissette to win four Grammy Awards. The album Jagged Little Pill touched many people.\n\nOn the album, Morissette sang songs about many different things. These things include:\nlove (in the song "Head Over Feet")\nlife (in the songs "Ironic" and "You Learn")\nher feelings (in the songs "Hand In My Pocket" and "All I Really Want")\nsadness (in the song "Mary Jane")\nanger (in the song "You Oughta Know")\nfrustration (in the songs "Not the Doctor" and "Wake Up")\n\nDiscography\n\nAlbums\nAlanis (Canada-only, 1991)\nNow Is the Time (Canada-only, 1992)\nJagged Little Pill (1995)\nSupposed Former Infatuation Junkie (1998)\nAlanis Unplugged (1999)\nUnder Rug Swept (2002)\nFeast on Scraps (CD/DVD, 2002)\nSo-Called Chaos (2004)\nJagged Little Pill Acoustic (2005)\nAlanis Morissette: The Collection (2005)\nFlavors of Entanglement (2008)\nHavoc and Bright Lights (2012)\n\nSelected songs\nMorissette has written many songs. Some of her most famous songs are:\n"You Oughta Know" - This song is to Morissette\'s ex-boyfriend, a man she once loved. In this song, Morissette is very angry. She wants her ex-boyfriend to know that he caused many problems after leaving her for another woman.\n"Ironic" - This song is about life. It contains several stories about unlucky people. In one of the stories, a man is afraid of flying on airplanes. He finally flies in one, but the airplane crashes.\n"You Learn" - In this song, Morissette says that bad things happen in life, but people learn from them. Anyone can make bad things into good things. She wants people to try new things in life.\n"Uninvited" - In this song, Morissette is not happy because she is famous. She does not know whether she wants to continue to be famous or not.\n"Thank U" - In this song, she thanks many things that have helped her. She thanks India, a country she visited and almost died in. She also lists ways she can improve herself.\n"Hands Clean" - In this song, a man does something bad, and tells Morissette not to tell anyone else the bad thing the man did. She hides the man\'s secret for many years.\n\nReferences\n\nOther websites \n\n Official website\n\n1974 births\nLiving people\n \nAmerican child actors\nAmerican movie actors\nAmerican pop musicians\nAmerican rock singers\nAmerican singer-songwriters\nAmerican television actors\nCanadian movie actors\nCanadian pop singers\nCanadian rock singers\nCanadian singer-songwriters\nCanadian television actors\nGrammy Award winners\nPeople from Ottawa\nSingers from Ontario\nTwin people from Canada'}
Preprocessing the Wikipedia Text: Before incorporating Wikipedia data into the knowledge base, some preprocessing steps are necessary. Large language models have limitations on the amount of text they can efficiently process. To overcome this limitation, we split the lengthy Wikipedia text into smaller chunks based on token counts. Embedding large chunks of text produces poorer-quality embeddings.
import tiktoken
tiktoken.encoding_for_model('gpt-3.5-turbo')
<Encoding 'cl100k_base'>
we see that GPT 3.5 turbo uses cl100k_base as its encoding so let us continue
Tokenization: Tokenization is the process of splitting text into individual tokens or units such as words or subwords. In this case, a tokenizer like TIk-token is used to tokenize the Wikipedia text into smaller chunks. TIC-token is a tokenizer that splits text based on token counts rather than fixed-length segments. This ensures that the resulting chunks are within the token limits of the large language model.
import tiktoken
tokenizer = tiktoken.get_encoding('cl100k_base')
# create the length function
def tiktoken_len(text):
tokens = tokenizer.encode(
text,
disallowed_special=()
)
return len(tokens)
print(tiktoken_len("""
The quick brown fox jumps over the lazy dog.
"""))
print(len(("THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.").split(" ")))
12
9
This output demonstrates that the tokenizer divides sentences into tokens instead of relying on word-based divisions. This approach leads to higher-quality embeddings when compared to embeddings generated through word-based splitting.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
length_function=tiktoken_len,
separators=["\n\n", "\n", " ", ""]
)
This splitter takes the following arguments:
- Chunk_size = 500 denotes the number of chunks in each split.
- Chunk_overlap denotes the overlap between the consecutive chunks this helps us maintain continuity in chunks.
- We use the tiktoken length function that we defined above to count the number of tokens and specify the length function.
- Separators specifies the separators based on which recursive text splitter tries to split the text initially at “\n\n” if it is unable to do that it tries it on the next specified word and so on.
So now let us try to split the text and see it in action:
chunks = splitter.split_text(data[7]['text'])[:3]
chunks
['Alanis Nadine Morissette (born June 1, 1974) is a Grammy Award-winning Canadian-American singer and songwriter. She was born in Ottawa, Canada. She began singing in Canada as a teenager in 1990. In 1995, she became popular all over the world.\n\nAs a young child in Canada, Morissette began to act on television, including 5 episodes of the long-running series, You Can\'t Do That on Television. Her first album was released only in Canada in 1990.\n\nHer first international album was Jagged Little Pill, released in 1995. It was a rock-influenced album. Jagged has sold more than 33 million units globally. It became the best-selling debut album in music history. Her next album, Supposed Former Infatuation Junkie, was released in 1998. It was a success as well. Morissette took up producing duties for her next albums, which include Under Rug Swept, So-Called Chaos and Flavors of Entanglement. Morissette has sold more than 60 million albums worldwide.\n\nShe also acted in several movies, including Kevin Smith\'s Dogma, where she played God.\n\nAbout her life\nAlanis Morissette was born in Riverside Hospital of Ottawa in Ottawa, Ontario. Her father is French-Canadian. Her mother is from Hungary. She has an older brother, Chad, and a twin brother, Wade, who is 12 minutes younger than she is. Her parents had worked as teachers at a military base in Lahr, Germany.\n\nMorissette became an American citizen in 2005. She is still Canadian citizen.\n\nOn May 22, 2010, Morissette married rapper Mario "MC Souleye" Treadway.\n\nJagged Little Pill\nMorissette has had many albums. Her 1995 album Jagged Little Pill became a very popular album. It has sold over 30 million copies worldwide. The album caused Morissette to win four Grammy Awards. The album Jagged Little Pill touched many people.',
'On the album, Morissette sang songs about many different things. These things include:\nlove (in the song "Head Over Feet")\nlife (in the songs "Ironic" and "You Learn")\nher feelings (in the songs "Hand In My Pocket" and "All I Really Want")\nsadness (in the song "Mary Jane")\nanger (in the song "You Oughta Know")\nfrustration (in the songs "Not the Doctor" and "Wake Up")\n\nDiscography\n\nAlbums\nAlanis (Canada-only, 1991)\nNow Is the Time (Canada-only, 1992)\nJagged Little Pill (1995)\nSupposed Former Infatuation Junkie (1998)\nAlanis Unplugged (1999)\nUnder Rug Swept (2002)\nFeast on Scraps (CD/DVD, 2002)\nSo-Called Chaos (2004)\nJagged Little Pill Acoustic (2005)\nAlanis Morissette: The Collection (2005)\nFlavors of Entanglement (2008)\nHavoc and Bright Lights (2012)',
'Selected songs\nMorissette has written many songs. Some of her most famous songs are:\n"You Oughta Know" - This song is to Morissette\'s ex-boyfriend, a man she once loved. In this song, Morissette is very angry. She wants her ex-boyfriend to know that he caused many problems after leaving her for another woman.\n"Ironic" - This song is about life. It contains several stories about unlucky people. In one of the stories, a man is afraid of flying on airplanes. He finally flies in one, but the airplane crashes.\n"You Learn" - In this song, Morissette says that bad things happen in life, but people learn from them. Anyone can make bad things into good things. She wants people to try new things in life.\n"Uninvited" - In this song, Morissette is not happy because she is famous. She does not know whether she wants to continue to be famous or not.\n"Thank U" - In this song, she thanks many things that have helped her. She thanks India, a country she visited and almost died in. She also lists ways she can improve herself.\n"Hands Clean" - In this song, a man does something bad, and tells Morissette not to tell anyone else the bad thing the man did. She hides the man\'s secret for many years.\n\nReferences\n\nOther websites \n\n Official website\n\n1974 births\nLiving people\n \nAmerican child actors\nAmerican movie actors\nAmerican pop musicians\nAmerican rock singers\nAmerican singer-songwriters\nAmerican television actors\nCanadian movie actors\nCanadian pop singers\nCanadian rock singers\nCanadian singer-songwriters\nCanadian television actors\nGrammy Award winners\nPeople from Ottawa\nSingers from Ontario\nTwin people from Canada']
We convert the text chunks into vector embeddings to facilitate efficient searching and retrieval from the knowledge base. Vector embeddings are numerical representations of the text’s meaning. We will now use this functionality to convert textual chunks into numerical vectors. These numerical representations further help the model find the relevant chunks to the current query.
If we think about it the chunks which are closer to the query will be more relative in the embedded space.’
import os
# get openai api key from platform.openai.com
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
from langchain.embeddings.openai import OpenAIEmbeddings
model_name = 'text-embedding-ada-002'
embed = OpenAIEmbeddings(
model=model_name,
openai_api_key=OPENAI_API_KEY
)
Now we will use Open AI’s ADA embedding model to generate embeddings of the chunk text
Vector Databases
Now once we have created these embeddings we need to store them somewhere for quicker retrieval and searching. We encode user queries into vector embeddings and store them along with their associated text chunks in a vector database.
Pinecone is a strong tool for scalable vector database operations, allowing us to swiftly search through billions of entries. The vector database enables efficient retrieval and record administration, including adding, updating, and removing information. Metadata filtering may also be used to filter records based on specified criteria such as date or category.
With the vector database powered by Pinecone, searching for relevant text chunks becomes fast and efficient. We encode user queries into vector embeddings using the retrieval model’s mentioned technique. The vector database enables fast search to identify relevant text chunks by comparing their embeddings to the query embedding. This enables quick retrieval of contextually relevant information from the knowledge base.
With the vector embeddings in place, we create a vector database to store and manage the chunks of text and their corresponding embeddings. Pinecone is a powerful tool that enables scalable vector database operations, allowing us to search through billions of records quickly. The vector database not only facilitates efficient retrieval but also enables record management, including adding, updating, and deleting records.
Furthermore, we can implement metadata filtering to filter records based on specific criteria, such as date or category.
Learn more about Pinecone for Retrieval Augmented Generation here.
Takeaways
In this article, we learnt about Retrieval Augmented Generation (RAG) with LangChain. With this, we can bridge the gap between the static training data and the dynamic nature of real-time information, empowering language models to deliver more accurate and contextually relevant responses.