Introduction
Today, we will dig into the intriguing realm of natural language processing and the potential of LangChain. Our objective is to develop an accurate and efficient method of document summarization with LangChain.
We will learn three distinct summarising approaches to do this: stuff, map_reduce, and refine. Each of these strategies makes use of LangChain’s different chain types. So let’s get started and see how LangChain can help us build good document summaries!
LangChain as a Tool for Document Summarization
When faced with the task of summarizing extensive or multiple documents through natural language processing (NLP), the process can become overwhelming due to the immense amount of data. This can lead to slower processing times and potential memory complications. As a result, substantial investments in high-performance computing infrastructure are often required to handle such data volumes.
LangChain offers an innovative solution by breaking down large documents into smaller sections. Depending on the chosen chain type, these sections can then be processed in parallel or sequentially. This approach effectively addresses the challenges posed by the maximum token limit and ensures efficient summarization.
Another obstacle in document summarization is the amalgamation of information from multiple documents. These documents may utilize distinct terminology, contain conflicting details, or cover different aspects of the subject matter.
LangChain effectively tackles this issue by incorporating information from previous documents into the current one, forming a chain of documents that emphasizes the significance of context during the summarization process.
Moreover, this chain guarantees the appropriate ordering of sentences in the summarized content, resulting in summaries that are more coherent and precise.
Open AI key
import os
os.environ["OPENAI_API_KEY"] = "Your openai key"
from langchain import OpenAI, PromptTemplate, LLMChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.mapreduce import MapReduceChain
from langchain.prompts import PromptTemplate
llm = OpenAI(temperature=0)
Now we need a sample document. For this, we will download a Wikipedia page as a pdf. For that navigate to a Wikipedia page. Click on tools >> Download as PDF. For this tutorial, we will use a world war 2 pdf from Wikipedia.
Now to work with this pdf we will use PyMuPDF:
!pip install pymupdf
import fitz
def load_pdf(path):
with fitz.open(path) as doc:
text = ""
for page in doc:
text += page.get_text("text")
text += "\n"
return text
text = load_pdf("World_War_II.pdf")
Now we have loaded our pdf let us extract text from it. Let us create a document out of it which can be processed by langchain. But before that we need to split the text into documents for this we will use a character text splitter from the langchain.
from langchain.docstore.document import Document
text_splitter = CharacterTextSplitter()
docs = text_splitter.split_text(text)
docs = [Document(page_content=t) for t in docs[:4]]
docs
It is important to chunk the document because processing large documents as a single unit can be computationally expensive and time-consuming.
Output:
[Document(page_content='World War II\nClockwise from top left:\nGerman Stuka dive bombers on the\nEastern Front in December 1943 ·\nAustralian Ordnance QF 25-\nToggle the table of contents\nWorld War II\n\npounder guns during the First Battle\nof El Alamein · Atomic bombing of\nNagasaki · US naval force in the\nLingayen Gulf · Raising a flag over\nthe Reichstag · Soviet troops in the\nBattle of Stalingrad\n\nDate\n1 September 1939 – 2\nSeptember 1945[a]\n(6 years, 1 day)\nLocation\nMajor theaters:\nEurope · Pacific ·\nAtlantic · Indian Ocean\n· South-East Asia ·\nChina · Japan · Middle\nEast · Mediterranean ·\nNorth Africa · Horn of\nAfrica · Central Africa ·\nAustralia · Caribbean ·\nNorth and South\nAmerica\nResult\nAllied victory\nFall of Nazi\nGermany, Fascist\nItaly, and Imperial\nJapan\nAllied military\noccupations of\nGermany, Japan,\nAustria, and Korea\nBeginning of the\nNuclear Age\nDissolution of the\nLeague of Nations\nand creation of the\nUnited Nations\nDecolonisation of\nAsia and Africa\nand decline of\nEuropean\ninternational\ninfluence\nEmergence of the\nUnited States and\nthe Soviet Union\nas rival\nsuperpowers and\nbeginning of the\nCold War (see\nAftermath of World\nWar II)\n\nParticipants\nAllies\nAxis\nCommanders and leaders\nMain Allied\nleaders:\n Joseph\nStalin\n Franklin D.\nRoosevelt\n Winston\nChurchill\n Chiang Kai-\nshek\nMain Axis\nleaders:\n Adolf Hitler\n Hirohito\n Benito\nMussolini\nCasualties and losses\nMilitary dead:\nOver 16,000,000\nCivilian dead:\nOver 45,000,000\nTotal dead:\nOver 61,000,000\n(1937–1945)\n...further details\nMilitary dead:\nOver 8,000,000\nCivilian dead:\nOver 4,000,000\nTotal dead:\nOver 12,000,000\n(1937–1945)\n...further details\nWorld War II or the Second World War, often abbreviated as WWII or WW2, was a global conflict that\nlasted from 1939 to 1945. The vast majority of the world\'s countries, including all of the great powers,\nfought as part of two opposing military alliances: the Allies and the Axis. Many participants threw their\neconomic, industrial, and scientific capabilities behind this total war, blurring the distinction between\ncivilian and military resources. Aircraft played a major role, enabling the strategic bombing of population\ncentres and the delivery of the only two nuclear weapons ever used in war. World War II was by far the\ndeadliest conflict in history, resulting in an estimated 70 to 85 million fatalities, mostly among civilians.\nTens of millions died due to genocides (including the Holocaust), starvation, massacres, and disease. In the\nwake of the Axis defeat, Germany and Japan were occupied, and war crimes tribunals were conducted\nagainst German and Japanese leaders.\nThe causes of World War II are debated, but contributing factors included the Second Italo-Ethiopian War,\nSpanish Civil War, Second Sino-Japanese War, Soviet–Japanese border conflicts, the rise of fascism in\nEurope, and European tensions in the aftermath of World War I. World War II is generally considered to\nhave begun on 1 September 1939, when Nazi Germany, under Adolf Hitler, invaded Poland. The United\nKingdom and France subsequently declared war on Germany on 3 September. Under the Molotov–\nRibbentrop Pact of August 1939, Germany and the Soviet Union had partitioned Poland and marked out\ntheir "spheres of influence" across Finland, Estonia, Latvia, Lithuania and Romania. From late 1939 to\nearly 1941, in a series of campaigns and treaties, Germany conquered or controlled much of continental\nEurope, in a military alliance with Italy, Japan and other countries called the Axis. Following the onset of\ncampaigns in North Africa and East Africa, and the fall of France in mid-1940, the war continued primarily\nbetween the European Axis powers and the British Empire, with war in the Balkans, the aerial Battle of', metadata={}),
Document(page_content="Britain, the Blitz of the United Kingdom, and the Battle of the Atlantic. On 22 June 1941, Germany led the\nEuropean Axis powers in an invasion of the Soviet Union, opening the Eastern Front, the largest land\ntheatre of war in history.\nJapan, which aimed to dominate Asia and the Pacific, was at war with the Republic of China by 1937. In\nDecember 1941, Japan attacked American and British territories with near-simultaneous offensives against\nSoutheast Asia and the Central Pacific, including an attack on the U.S. fleet at Pearl Harbor which resulted\nin the United States and United Kingdom declaring war against Japan. The European Axis powers declared\nwar on the United States in solidarity. Japan soon captured much of the western Pacific, but its advances\nwere halted in 1942 after losing the critical Battle of Midway; later, Germany and Italy were defeated in\nNorth Africa and at Stalingrad in the Soviet Union. Key setbacks in 1943—including a series of German\ndefeats on the Eastern Front, the Allied invasions of Sicily and the Italian mainland, and Allied offensives in\nthe Pacific—cost the Axis powers their initiative and forced them into strategic retreat on all fronts. In 1944,\nthe Western Allies invaded German-occupied France, while the Soviet Union regained its territorial losses\nand pushed Germany and its allies back. During 1944 and 1945, Japan suffered reversals in mainland Asia,\nwhile the Allies crippled the Japanese Navy and captured key western Pacific islands. The war in Europe\nconcluded with the liberation of German-occupied territories and the invasion of Germany by the Western\nAllies and the Soviet Union, culminating in the Fall of Berlin to Soviet troops, Hitler's suicide, and the\nGerman unconditional surrender on 8 May 1945. Following the refusal of Japan to surrender on the terms\nof the Potsdam Declaration (issued 26 July 1945), the United States dropped the first atomic bombs on the\nJapanese cities of Hiroshima on 6 August and Nagasaki on 9 August. Faced with an imminent invasion of\nthe Japanese archipelago, the possibility of additional atomic bombings, and the Soviet Union's declared\nentry into the war against Japan on the eve of invading Manchuria, Japan announced on 10 August its\nintention to surrender, signing a surrender document on 2 September 1945.\nWorld War II changed the political alignment and social structure of the globe and set the foundation for the\ninternational order of the world's nations for the rest of the 20th century and into the present day. The\nUnited Nations was established to foster international co-operation and prevent future conflicts, with the\nvictorious great powers—China, France, the Soviet Union, the United Kingdom, and the United States—\nbecoming the permanent members of its Security Council. The Soviet Union and the United States\nemerged as rival superpowers, setting the stage for the nearly half-century-long Cold War. In the wake of\nEuropean devastation, the influence of its great powers waned, triggering the decolonisation of Africa and\nAsia. Most countries whose industries had been damaged moved towards economic recovery and\nexpansion. Political and economic integration, especially in Europe, began as an effort to forestall future\nhostilities, end pre-war enmities, and forge a sense of common identity.\nIt is generally considered that, in Europe, World War II started on 1 September 1939,[1][2] beginning with\nthe German invasion of Poland and the United Kingdom and France's declaration of war on Germany two\ndays later on 3 September 1939. Dates for the beginning of the Pacific War include the start of the Second\nSino-Japanese War on 7 July 1937,[3][4] or the earlier Japanese invasion of Manchuria, on 19 September\n1931.[5][6] Others follow the British historian A. J. P. Taylor, who held that the Sino-Japanese War and war\nin Europe and its colonies occurred simultaneously, and the two wars became World War II in 1941.[7]\nOther starting dates sometimes used for World War II include the Italian invasion of Abyssinia on 3 October\n1935.[8] The British historian Antony Beevor views the beginning of World War II as the Battles of\nKhalkhin Gol fought between Japan and the forces of Mongolia and the Soviet Union from May to\nSeptember 1939.[9] Others view the Spanish Civil War as the start or prelude to World War II.[10][11]\nStart and end dates", metadata={}),
Document(page_content='The League of Nations assembly,\nheld in Geneva, Switzerland, 1930\nThe exact date of the war\'s end is also not universally agreed upon. It was generally accepted at the time\nthat the war ended with the armistice of 15 August 1945 (V-J Day), rather than with the formal surrender of\nJapan on 2 September 1945, which officially ended the war in Asia. A peace treaty between Japan and the\nAllies was signed in 1951.[12] A 1990 treaty regarding Germany\'s future allowed the reunification of East\nand West Germany to take place and resolved most post–World War II issues.[13] No formal peace treaty\nbetween Japan and the Soviet Union was ever signed,[14] although the state of war between the two\ncountries was terminated by the Soviet–Japanese Joint Declaration of 1956, which also restored full\ndiplomatic relations between them.[15]\nWorld War I had radically altered the political European map with the defeat of the Central Powers—\nincluding Austria-Hungary, Germany, Bulgaria, and the Ottoman Empire—and the 1917 Bolshevik seizure\nof power in Russia, which led to the founding of the Soviet Union. Meanwhile, the victorious Allies of\nWorld War I, such as France, Belgium, Italy, Romania, and Greece, gained territory, and new nation-states\nwere created out of the collapse of Austria-Hungary and the Ottoman and Russian Empires.\nTo prevent a future world war, the League of Nations was created\nduring the 1919 Paris Peace Conference. The organisation\'s\nprimary goals were to prevent armed conflict through collective\nsecurity, military and naval disarmament, and settling international\ndisputes through peaceful negotiations and arbitration.[16]\nDespite strong pacifist sentiment after World War I,[17] irredentist\nand revanchist nationalism emerged in several European states in\nthe same period. These sentiments were especially marked in\nGermany because of the significant territorial, colonial, and\nfinancial losses imposed by the Treaty of Versailles. Under the\ntreaty, Germany lost around 13 percent of its home territory and all\nits overseas possessions, while German annexation of other states was prohibited, reparations were\nimposed, and limits were placed on the size and capability of the country\'s armed forces.[18]\nThe German Empire was dissolved in the German Revolution of 1918–1919, and a democratic\ngovernment, later known as the Weimar Republic, was created. The interwar period saw strife between\nsupporters of the new republic and hardline opponents on both the right and left. Italy, as an Entente ally,\nhad made some post-war territorial gains; however, Italian nationalists were angered that the promises made\nby the United Kingdom and France to secure Italian entrance into the war were not fulfilled in the peace\nsettlement. From 1922 to 1925, the Fascist movement led by Benito Mussolini seized power in Italy with a\nnationalist, totalitarian, and class collaborationist agenda that abolished representative democracy, repressed\nsocialist, left-wing and liberal forces, and pursued an aggressive expansionist foreign policy aimed at\nmaking Italy a world power, and promising the creation of a "New Roman Empire".[19]\nHistory\nBackground\nAftermath of World War I\nGermany', metadata={}),
Document(page_content="Adolf Hitler at a German\nNazi political rally in\nNuremberg, August 1933\nAdolf Hitler, after an unsuccessful attempt to overthrow the German\ngovernment in 1923, eventually became the Chancellor of Germany in\n1933 when Paul Von Hindenburg and the Reichstag appointed him.\nFollowing Hindenburg's death in 1934, Hitler proclaimed himself Führer\nof Germany and abolished democracy, espousing a radical, racially\nmotivated revision of the world order, and soon began a massive\nrearmament campaign.[20] Meanwhile, France, to secure its alliance,\nallowed Italy a free hand in Ethiopia, which Italy desired as a colonial\npossession. The situation was aggravated in early 1935 when the Territory\nof the Saar Basin was legally reunited with Germany, and Hitler repudiated\nthe Treaty of Versailles, accelerated his rearmament programme, and\nintroduced conscription.[21]\nThe United Kingdom, France and Italy formed the Stresa Front in April\n1935 in order to contain Germany, a key step towards military\nglobalisation; however, that June, the United Kingdom made an\nindependent naval agreement with Germany, easing prior restrictions. The\nSoviet Union, concerned by Germany's goals of capturing vast areas of\nEastern Europe, drafted a treaty of mutual assistance with France. Before taking effect, though, the Franco-\nSoviet pact was required to go through the bureaucracy of the League of Nations, which rendered it\nessentially toothless.[22] The United States, concerned with events in Europe and Asia, passed the\nNeutrality Act in August of the same year.[23]\nHitler defied the Versailles and Locarno Treaties by remilitarising the Rhineland in March 1936,\nencountering little opposition due to the policy of appeasement.[24] In October 1936, Germany and Italy\nformed the Rome–Berlin Axis. A month later, Germany and Japan signed the Anti-Comintern Pact, which\nItaly joined the following year.[25]\nThe Kuomintang (KMT) party in China launched a unification campaign against regional warlords and\nnominally unified China in the mid-1920s, but was soon embroiled in a civil war against its former Chinese\nCommunist Party allies[26] and new regional warlords. In 1931, an increasingly militaristic Empire of\nJapan, which had long sought influence in China[27] as the first step of what its government saw as the\ncountry's right to rule Asia, staged the Mukden Incident as a pretext to invade Manchuria and establish the\npuppet state of Manchukuo.[28]\nChina appealed to the League of Nations to stop the Japanese invasion of Manchuria. Japan withdrew from\nthe League of Nations after being condemned for its incursion into Manchuria. The two nations then fought\nseveral battles, in Shanghai, Rehe and Hebei, until the Tanggu Truce was signed in 1933. Thereafter,\nEuropean treaties\nAsia", metadata={})]
Using “map_reduce” chain
The ‘map_reduce’ chain is designed to handle document processing by breaking a large document into smaller, manageable chunks. This chain employs an initial prompt on each piece to generate a summary or answer based on that specific section of the document.
Moreover, the MapReduceDocumentsChain takes the generated outputs and combines them using a different prompt, resulting in a comprehensive and coherent summary or answer for the entire document. We use the load_summarize_chain function to set up the ‘map_reduce’ chain and obtain the output summary.
from langchain.chains.summarize import load_summarize_chain
import textwrap
chain = load_summarize_chain(llm,
chain_type="map_reduce",
verbose = True)
output_summary = chain.run(docs)
wrapped_text = textwrap.fill(output_summary, width=100)
print(wrapped_text)
We are creating an object of load_summarizer_chain where we are passing three arguments:
- model: We pass the large language model of our choice which will query the user input.
- chain_type: We pass the type of langchain chain to use for summarization of docs.
- verbose: It is a boolean argument and if set to True it will show us all the intermediate steps between processing the user request and generating the output
Here is the generated summary:
World War II was a global conflict that lasted from 1939 to 1945 and resulted in an estimated 70 to
85 million fatalities. Causes of the war included the Second Italo-Ethiopian War, Spanish Civil War,
Second Sino-Japanese War, and European tensions in the aftermath of World War I. The war began on 1
September 1939 when Nazi Germany invaded Poland, and the United Kingdom and France declared war on
Germany on 3 September. The war ended in 1945 with the unconditional surrender of Germany and the
dropping of atomic bombs on Hiroshima and Nagasaki. The war changed the political and social
structure of the world and led to the establishment of the United Nations.
Advantages
- It can handle larger documents and a greater number of documents compared to StuffDocumentsChain.
- Parallelization allows for independent processing of each document’s calls to the Language Model (LLM).
Disadvantages
- It necessitates a higher number of LLM calls compared to StuffDocumentsChain.
- Some information is lost during the final combining call.
Using “stuff” chain
The ‘stuff’ chain utilizes a simpler approach known as stuffing. In this approach, the prompt passes all the related data as context to the language model. While this approach works well for smaller pieces of data, it becomes impractical when dealing with many pieces of data.
To overcome this limitation, we introduce a custom PromptTemplate that enables us to generate a concise bullet point summary. We set up the ‘stuff’ chain using the load_summarize_chain function and obtain the summarized output.
prompt_template = """Write a concise bullet point summary of the following:
{text}
CONSCISE SUMMARY IN BULLET POINTS:"""
BULLET_POINT_PROMPT = PromptTemplate(template=prompt_template,
input_variables=["text"])
chain = load_summarize_chain(llm,
chain_type="stuff",
prompt=BULLET_POINT_PROMPT)
output_summary = chain.run(docs)
wrapped_text = textwrap.fill(output_summary,
width=100,
break_long_words=False,
replace_whitespace=False)
print(wrapped_text)
Here is the generated summary:
• World War II was a global conflict that lasted from 1939 to 1945, resulting in an estimated 70 to
85 million fatalities.
• It was fought between two opposing military alliances: the Allies and the
Axis.
• Causes of the war include the Second Italo-Ethiopian War, Spanish Civil War, Second Sino-
Japanese War, Soviet–Japanese border conflicts, the rise of fascism in Europe, and European tensions
in the aftermath of World War I.
• The war began on 1 September 1939 with the German invasion of
Poland and the United Kingdom and France's declaration of war on Germany two days later.
• The war
ended with the armistice of 15 August 1945 (V-J Day) and the formal surrender of Japan on 2
September 1945.
• The war resulted in the dissolution of the League of Nations and the creation of
the United Nations, the emergence of the United States and the Soviet Union as rival superpowers,
and the decolonisation of Asia and Africa.
Now if you see this summary you may see that it is far more detailed than the map-reduce method
Advantages
- It only requires a single call to the Language Model (LLM), simplifying the process.
- The LLM has access to all the data simultaneously when generating text, allowing for comprehensive context.
Disadvantages
- Many Language Models have a context length limitation, meaning that large documents or a high volume of documents may exceed the maximum allowed prompt size.
- When the prompt size surpasses the context length, it becomes infeasible to process the data accurately with the LLM.
Using “refine” chain
The ‘refine’ chain involves an initial prompt on the first chunk of data, generating an output. The language model refines the output based on the new document by passing along this output with the next document.
This iterative refinement process helps create a more accurate summary. We set up the ‘refine’ chain using the load_summarize_chain function and obtain the summarized output.
chain = load_summarize_chain(llm, chain_type="refine")
output_summary = chain.run(docs)
wrapped_text = textwrap.fill(output_summary, width=100)
print(wrapped_text)
Here is the generated summary:
World War II was a global conflict that lasted from 1939 to 1945. It involved the majority of the
world's countries, including all of the great powers, and resulted in an estimated 70 to 85 million
fatalities, mostly among civilians. Causes of the war included the Second Italo-Ethiopian War,
Spanish Civil War, Second Sino-Japanese War, and European tensions in the aftermath of World War I.
The war began on 1 September 1939 when Nazi Germany invaded Poland, and the United Kingdom and
France declared war on Germany on 3 September. Germany and the Soviet Union had partitioned Poland
and marked out their "spheres of influence" across Finland, Estonia, Latvia, Lithuania and Romania.
The war continued primarily between the European Axis powers and the British Empire, with war in the
Balkans, the aerial Battle of Britain, and the Blitz. Japan, which aimed to dominate Asia and the
Pacific, was at war with the Republic of China by 1937. In December 1941, Japan attacked American
and British territories with near-simultaneous offensives against Southeast Asia and the Central
Pacific, including an attack on the U.S. fleet at Pearl Harbor which resulted in the United States
and United Kingdom declaring war against Japan. The European Axis powers declared war on the
The summary might seem incomplete that is because we used only 4 of the chunks for the purposes of the tutorial but if we were to use the entire document it would have been longer and much better
Advantages
- It can retrieve and incorporate more relevant context, potentially resulting in less information loss compared to MapReduceDocumentsChain.
- It may offer better accuracy and fidelity in processing the data by leveraging the specific context.
Disadvantages
- It requires a higher number of calls to the Language Model (LLM) compared to StuffDocumentsChain, potentially increasing the computational overhead.
- The calls to the LLM are not independent, meaning they cannot be parallelized like in MapReduceDocumentsChain, potentially impacting the overall processing time.
- There could be dependencies on the ordering of the documents, introducing complexities in the workflow and potentially affecting the desired outcome.
Conclusion
In conclusion, LangChain offers significant advantages as a tool for document summarization. It addresses the challenges of processing large volumes of data by breaking down documents into smaller sections, enabling parallel or sequential processing. It also incorporates contextual information from previous documents, ensuring coherent and precise summaries.
With its distinct chain types, such as “stuff,” “map_reduce,” and “refine,” LangChain provides efficient and accurate methods for document summarization. By leveraging these capabilities, we can create high-quality document summaries with ease. Its innovative approach to natural language processing makes it a valuable tool in the field of document summarization.
Now you can also learn about OpenAI’s ChatGPT Endpoint with LangChain here.