LLM models have a notable limitation in that they do not have access to real-time information. These models undergo training on fixed datasets that include information up to a specific cutoff date. This creates a challenge when users require up-to-date information or contextually relevant responses. Researchers introduced retrieval augmentation with Pinecone to address the issue.
By integrating external knowledge bases, language models gain access to current data and expertise. In this blog post, we will continue implementing retrieval augmentation using Pinecone LangChain.
Before continuing, you can learn more about Retrieval Augmentation with LangChain.
Pinecone: A Vector Database
Vector databases are essential for updating language models and ensuring accurate responses. Their importance was highlighted in our previous post. One of the Vector DB we will be using is Pinecone.
Pinecone is a tool that allows you to perform scalable vector database operations. Its primary goal is to efficiently handle large-scale vector data, enabling fast searches across billions of entries. Pinecone has high-performance indexing and retrieval capabilities, making it ideal for vector embedding management.
Let’s delve into code to understand it further:
index_name = 'langchain-retrieval-augmentation'
import pinecone
# find API key in console at app.pinecone.io
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY') or 'PINECONE_API_KEY'
# find ENV (cloud region) next to API key in console
PINECONE_ENVIRONMENT = os.getenv('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'
pinecone.init(
api_key=PINECONE_API_KEY,
environment=PINECONE_ENVIRONMENT
)
if index_name not in pinecone.list_indexes():
# we create a new index
pinecone.create_index(
name=index_name,
metric='cosine',
dimension=len(res[0]) # 1536 dim of text-embedding-ada-002
)
Now in this snippet we first initialize a pinecone database for that we need an API key of the pinecone. Now we define an index name.
Pinecone Indexes
In Pinecone, an index acts as the highest-level organizational unit for vector data. It stores vectors, performs operations, and serves queries, acting as a container for efficient vector handling. Each index in Pinecone runs on at least one pod.`
When you initialize a Pinecone database, you need to specify an index name. The index name acts as a unique identifier for the index within your Pinecone account. It allows you to manage and interact with the specific index and its associated data.
You can think of an index as a collection of vectors that are related in some way. For example, you might create an index for images of animals. In this index, each vector represents the visual features of a particular animal. For text documents, it is possible to create another index where vectors encode the semantic meaning of the documents. By organizing vectors into separate indexes, you can apply different configurations, and perform specialized operations. This optimizes search and retrieval for specific types of data.
Adding more pods scales Pinecone’s indexes by storing and serving vectors, expanding computational units for horizontal scalability. Scalability in Pinecone facilitates efficient handling of large-scale vector datasets, enabling high-performance vector search and similarity matching.
Distance Metrics
Next, we define a metric. In Pinecone, we have the flexibility to choose from different metrics to calculate the distance or similarity between vectors. The choice of metric depends on the nature of your data and the specific use case you are working on. Let’s explore the three metrics we can utilize to effectively create our database:
- Euclidean: The Euclidean distance metric finds common usage in calculating the distance between two points in a plane. It measures the straight-line distance between two vectors in the vector space. When using the Euclidean metric, the most similar results are those with the lowest distance score. In applications like image similarity search, the Euclidean distance metric is frequently employed to identify visually similar images based on their features.
- Cosine: The cosine similarity metric is frequently used to measure the similarity between different documents or vectors. It calculates the cosine of the angle between two vectors, which ranges from -1 to 1. A value of 1 indicates that the vectors have the same direction (high similarity), while a value of -1 indicates they are exactly opposite (low similarity). The advantage of cosine similarity is that the scores are normalized to the [-1, 1] range, making it easier to compare similarities across different vectors.
- Dot product: The dot product metric is used to multiply two vectors and provides a measure of how similar the vectors are in terms of their directions. The dot product of two vectors is positive if they are pointing in a similar direction and negative if they are pointing in opposite directions. The magnitude of the dot product indicates the degree of similarity. A higher positive value suggests greater similarity, while a more negative value suggests dissimilarity. The dot product metric can be useful in tasks where you want to determine the similarity between vectors based on their orientations or directions.
Data Upsertion
When creating a vector index in Pinecone, we can specify the desired metric based on your specific use case and the properties of your vector data. Choosing the appropriate metric can greatly impact the effectiveness and relevance of the search results or similarity calculations in the application.
Now once we have created the index, we need to connect to it and upsert (upload and insert ) data into the Pinecone database.
index = pinecone.Index(index_name=index_name)
index.describe_index_stats()
{'dimension': 1536,
'index_fullness': 0.0,
'namespaces': {},
'total_vector_count': 0}
We should see that the new Pinecone index has a total_vector_count of 0, as we haven’t added any vectors yet.
from tqdm.auto import tqdm
from uuid import uuid4
batch_limit = 100
texts = []
metadatas = []
for i, record in enumerate(tqdm(data)):
# first get metadata fields for this record
metadata = {
'wiki-id': str(record['id']),
'source': record['url'],
'title': record['title']
}
# now we create chunks from the record text
record_texts = splitter.split_text(record['text'])
# create individual metadata dicts for each chunk
record_metadatas = [{
"chunk": j, "text": text, **metadata
} for j, text in enumerate(record_texts)]
# append these to current batches
texts.extend(record_texts)
metadatas.extend(record_metadatas)
# if we have reached the batch_limit we can add texts
if len(texts) >= batch_limit:
ids = [str(uuid4()) for _ in range(len(texts))]
embeds = embed.embed_documents(texts)
index.upsert(vectors=zip(ids, embeds, metadatas))
texts = []
metadatas = []
if len(texts) > 0:
ids = [str(uuid4()) for _ in range(len(texts))]
embeds = embed.embed_documents(texts)
index.upsert(vectors=zip(ids, embeds, metadatas))
now let us try to understand this assertion piece by piece initially we set a batch limit of 100 as we can overload the open ai and pinecone API if we try to send all the data together. Next to easily find and search for data we set up metadata. In this case, it includes the Wikipedia id, source URL and its title then it is split into chunks and appended into the database after encoding with the open ai model
index.describe_index_stats()
{'dimension': 1536,
'index_fullness': 0.1,
'namespaces': {'': {'vector_count': 8545}},
'total_vector_count': 8545}
we now see that index contains some vectors these are the same embedded textual documents we embedded to store in our database
Vector Store
Now that we’ve built our index we can switch back over to LangChain. We start by initializing a vector store using the same index we just built. We do that like so:
from langchain.vectorstores import Pinecone
text_field = "text"
# switch back to normal index for langchain
index = pinecone.Index(index_name)
vectorstore = Pinecone(
index, embed.embed_query, text_field
)
If you look at the initialization of vector store, we may see a text_field which is declared as text which we can see from the upsertion snippet
RECORD_METADATAS = [{
"CHUNK": J, "TEXT": TEXT, **METADATA
}
Following this we can use it to retrieve the most relevant passages to any given query let’s do that as well.
query = "what is Fascism?"
vectorstore.similarity_search(
query, # our search query
k=3 # return 3 most relevant docs
)
this snippet returns us an output of
[Document(page_content='Benito Amilcare Andrea Mussolini KSMOM GCTE (29 July 1883 – 28 April 1945) was an Italian politician and journalist. He was also the Prime Minister of Italy from 1922 until 1943. He was the leader of the National Fascist Party.\n\nBiography\n\nEarly life\nBenito Mussolini was named after Benito Juarez, a Mexican opponent of the political power of the Roman Catholic Church, by his anticlerical (a person who opposes the political interference of the Roman Catholic Church in secular affairs) father. Mussolini\'s father was a blacksmith. Before being involved in politics, Mussolini was a newspaper editor (where he learned all his propaganda skills) and elementary school teacher.\n\nAt first, Mussolini was a socialist, but when he wanted Italy to join the First World War, he was thrown out of the socialist party. He \'invented\' a new ideology, Fascism, much out of Nationalist\xa0and Conservative views.\n\nRise to power and becoming dictator\nIn 1922, he took power by having a large group of men, "Black Shirts," march on Rome and threaten to take over the government. King Vittorio Emanuele III gave in, allowed him to form a government, and made him prime minister. In the following five years, he gained power, and in 1927 created the OVRA, his personal secret police force. Using the agency to arrest, scare, or murder people against his regime, Mussolini was dictator\xa0of Italy by the end of 1927. Only the King and his own Fascist party could challenge his power.\n\nFascism as practiced by Mussolini\nMussolini\'s form of Fascism, "Italian Fascism"- unlike Nazism, the racist ideology that Adolf Hitler followed- was different and less destructive than Hitler\'s. Although a believer in the superiority of the Italian nation and national unity, Mussolini, unlike Hitler, is quoted "Race? It is a feeling, not a reality. Nothing will ever make me believe that biologically pure races can be shown to exist today".', metadata={'chunk': 0.0, 'source': 'https://simple.wikipedia.org/wiki/Benito%20Mussolini', 'title': 'Benito Mussolini', 'wiki-id': '6754'}),
Document(page_content="Nazism (or National Socialism; ) is a set of political beliefs associated with the Nazi Party of Germany. It started in the 1920s, but the Nazi Party gained power in 1933 and started carrying out their ideas in Germany, which they called the Third Reich. They stayed in power in Germany until 1945, when they lost World War II.\n\nNational Socialism is an extreme right-wing, fascist, ideology that is heavily inspired from the works of Oswald Spengler. The Nazis believed that only the Aryan (German) race was capable of building nations and other races, notably the Jewish race, were agents of the corruptive forces of capitalism and Marxism, both of which the Nazis opposed. They considered the Aryan race the 'Master race', which meant that they thought that the Aryans were the most biologically evolved of humans and therefore deserved to have power over all other humans. They adapted Charles Darwin's theory of evolution and applied it to humans.\n\nThe leader of the Nazis Adolf Hitler wanted to a create a country where all Aryans were treated equally. They spent heavily on poorer people and began several huge government programs to help Germany deal with the unemployment and economic crisis caused by the Wall Street Crash of 1929 and the Great Depression which followed it. Some Nazis, such as Ernst Röhm, wanted the reforms to go further and called for a revolution, completely eliminating economic classes in Germany and for the government to take control of major businesses. Many of these Nazis were murdered on Hitler's orders during the Night of the Long Knives because they were a threat to his leadership of the Nazi Party.\n\nThe Nazis blamed the Jewish people for Germany's defeat in World War I. This is known as the Stab in the Back Myth. The Nazis also blamed the Jewish people for rapid inflation and practically every other economic woe facing Germany at the time as a result of their defeat in World War I. For this reason, the Nazis not only viewed the Jewish people as inferior to them, but as oppressors of the Aryan people who were creating inequality. The Nazis' tactic of lazily albeit effectively blaming the Jewish people for all of Germany's problems is a propaganda tactic known as scapegoating and was used to justify the great atrocities committed by the Nazis against the Jewish people.", metadata={'chunk': 0.0, 'source': 'https://simple.wikipedia.org/wiki/Nazism', 'title': 'Nazism', 'wiki-id': '7749'}),
Document(page_content='Mussolini wanted Italy to become a new Roman Empire. In 1923, he attacked the island of Corfu, and in 1924, he occupied the city state of Fiume. In 1935, he attacked the African country Abyssinia (now called Ethiopia). His forces occupied it in 1936. Italy was thrown out of the League of Nations because of this aggression. In 1939, he occupied the country Albania. In 1936, Mussolini signed an alliance with Adolf Hitler, the dictator of Germany.\n\nFall from power and death\nIn 1940, he sent Italy into the Second World War on the side of the Axis countries. Mussolini attacked Greece, but he failed to conquer it. In 1943, the Allies landed in Southern Italy. The Fascist party and King Vittorio Emanuel III deposed Mussolini and put him in jail, but he was set free by the Germans, who made him ruler of the Italian Social Republic puppet state which was in a small part of Central Italy. When the war was almost over, Mussolini tried to escape to Switzerland with his mistress, Clara Petacci, but they were both captured and shot by partisans. Mussolini\'s dead body was hanged upside-down, together with his mistress and some of Mussolini\'s helpers, on a pole at a gas station in the village of Millan, which is near the border between Italy and Switzerland.\n\nAfter death\nAfter the war, several Neo-Fascist movements have had success in Italy, the most important being the Movimento Sociale Italiano. His granddaughter Alessandra Mussolini has outspoken views similar to Fascism.\n\nSources\n\nOther websites \n Mussolini\'s "Last Will", April 22, 1945 (in Italian). From \'Il Giornale\' newspaper \n\n1883 births\n1945 deaths\nDeaths by firearm in Italy\nFormer dictators\nFormer members of the Order of the Bath\nGenerals\nItalian atheists\nItalian military people\nPeople executed by firearm\nPoliticians from Emilia-Romagna\nPrime Ministers of Italy\nWorld War II people', metadata={'chunk': 1.0, 'source': 'https://simple.wikipedia.org/wiki/Benito%20Mussolini', 'title': 'Benito Mussolini', 'wiki-id': '6754'})]
All of these are good, relevant results. But what can we do with this? Among the many tasks, “Generative Question-Answering” or GQA stands out as one of the most interesting and well-supported by LangChain.
Generative QA
In GQA, LLM answers the query which acts as a question to the model. However, the LLM’s response is based on the information it receives from the vector store.
To do this we initialize a RetrievalQA object like so:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
# completion llm
llm = ChatOpenAI(
openai_api_key=OPENAI_API_KEY,
model_name='gpt-3.5-turbo',
temperature=0.0
)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
qa.run(query)
when we run this query we get the following output
'Fascism is a political ideology that was developed in Italy by Benito Mussolini in the early 20th century. It is characterized by extreme nationalism, authoritarianism, and the suppression of political opposition. Fascists believe in the superiority of their own nation or race and often use violence and propaganda to achieve their goals.'
but in this case, we may also want to list the sources for the same we can do that by this
from langchain.chains import RetrievalQAWithSourcesChain
qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
qa_with_sources(query)
{'question': 'what is Fascism?',
'answer': 'Fascism is an extreme right-wing, fascist ideology that is heavily inspired from the works of Oswald Spengler. The Nazis believed that only the Aryan (German) race was capable of building nations and other races, notably the Jewish race, were agents of the corruptive forces of capitalism and Marxism, both of which the Nazis opposed. They considered the Aryan race the \'Master race\', which meant that they thought that the Aryans were the most biologically evolved of humans and therefore deserved to have power over all other humans. They adapted Charles Darwin\'s theory of evolution and applied it to humans. Mussolini\'s form of Fascism, "Italian Fascism"- unlike Nazism, the racist ideology that Adolf Hitler followed- was different and less destructive than Hitler\'s. Although a believer in the superiority of the Italian nation and national unity, Mussolini, unlike Hitler, is quoted "Race? It is a feeling, not a reality. Nothing will ever make me believe that biologically pure races can be shown to exist today". \n',
'sources': 'https://simple.wikipedia.org/wiki/Nazism, https://simple.wikipedia.org/wiki/Benito%20Mussolini'}
And that’s how you can augment your langchain chains with the latest information to keep them up to date and improve the answers for queries using pinecone and retrieval augmentation.
Takeaways
In this article, we learnt about vector database Pincone and how to do retrieval augmentated generation with Pinecone to enhance large language models.