Llama 2: Next Big Thing after ChatGPT (Training & Comparison)

LLAMA 2 Training and Fine Tuning Guide

In this article, We will discuss training, comparisons and the Impact of the latest Meta’s open-source model in the Generative AI space called LLAMA 2.

What is Llama 2?

Meta in collaboration with Microsoft has released its next iteration of Llama 1 models. Llama 2 is a collection of pre-trained large language models (LLMs) ranging from 7 billion to 70 billion parameters.

The release includes Llama 2-Chat models (7,13,70 billion parameters), a fine-tuned version of Llama 2 for dialogue use cases. The majority of the current top open-source LLMs are built on top of Llama, apart from Falcon models.

Llama 2-based models outperform current open-source models and can replace closed models like ChatGPT for suitable applications. The most exciting aspect of the release is that Meta wants AI models open to benefit everyone.

Meta believes open-source AI will open up a world of opportunities for everyone economically and socially. It is available free of charge for research and commercial use.

Llama 2 Architecture and Training

The architecture of Llama 2 models is similar to Llama 1 models, with the addition of Group Query Attention (GQA) (for 34B and 70B parameter models) for higher quality and speed. Llama 2 contains some ChatGPT parameters. It is trained on 40% more data than Llama 1, and it has double the context length of Llama 1.


The authors of Llama 2 utilized a novel mix of publicly accessible data sources to train the models. Excluding any data from Meta’s products or services. They made efforts to remove data from sites known for containing personal information about individuals. A massive amount of data trained the models.

Specifically 2 trillion tokens, which the authors believed provided a good balance between performance and cost. To enhance knowledge and reduce false information generation, the authors up-sampled the most factual sources.

Fine Tuning

Fine Tuning for LLAMA 2
Training Llama-2-chat

Supervised Fine Tuning(SFT)

It took months of research for meta for fine-tuning the Llama 2-Chat model. Meta used the Supervised Fine-Tuning (SFT) stage using publicly available instruction tuning data similar to Llama.

But researchers observed that much of the data lacks sufficient diversity and quality. Particularly for aligning LLMs towards dialogue-style instructions. Setting aside third-party datasets and using fewer but high-quality examples improved the results notably.

Researchers observed, Tens of thousands of SFT annotations (around 27,000 such prompts were used) are enough for high-quality results.

Reinforcement Learning with Human Feedback (RLHF)

In the RLHF model training procedure, we apply it to a fine-tuned language model to further align the model’s behaviour with human preferences and instruction tuning. The researchers collected a massive set of human preference data for reward methodology using teams of annotators.

The annotation procedure involved annotators writing a prompt, and then choosing between two responses from different model variants(with varying temperature hyper-parameter). Annotators also labelled the degree of their preference for the chosen response over the alternative: either their choice is significantly better, better, slightly better, or negligibly better/ unsure.

The preference focus was on helpfulness (how well responses fulfil users’ requests and provide requested information) and safety (whether responses are unsafe).

Reward Modelling

The reward model takes model takes the model response and its corresponding prompt as inputs (including previous contexts) and outputs a scalar score that increases the quality(Eg: helpfulness and safety) of the response. LLama-2-Chat optimizes for better-quality responses during RLHF using these reward scores.

It is difficult for a single model to give helpfulness and reward scores. To address this, Meta researchers train two separate reward models: one optimized for helpfulness (referred to as Helpfulness RM) and another for safety (Safety RM). Meta’s research combines reward models to train on existing preference datasets and newly collected data.

Iterative Fine-tuning

RLHF fine-tuning uses 2 main algorithms, Proximal Policy Optimization (PPO) and Rejection Sampling fine-tuning.

PPO helps the model to learn from feedback and adjust its strategy. While rejection sampling helps the model pick the best answer from a sample of generated answers and returns this info to the model to improve it. With each iteration, the model becomes smarter and more skilled at talking like a human. For more information on these Reinforcement learning algorithms refer to the Llama2, PPO and Llama papers.

Evaluating LLMs is a complex problem. While human evaluation is best but it is not scalable, to tackle this meta researchers used reward models to measure the improvement of the model over interaction during the RLHF process. The authors test the reliability of reward models on a test set they collected and judge the reward model’s alignment with human preferences.

Ghost Attention (System Message for multi-turn consistency)

Ghost Attention(GAtt) is a technique, which helps AI remember initial instructions throughout a dialogue. This technique, which is based on the idea of Context Distillation, introduces an instruction that must be obeyed throughout the conversation and is added to all user messages. 

Researchers report that after applying GAtt, the instructions consistently follow up to 20+ turns until reaching the maximum context length. 

Ghost Attention LLAMA 2

Benchmarks and Comparisons

Figure 12: Human evaluation results for Llama 2-Chat models compared ~4,000 helpfulness prompts(three raters per prompt)
Safety human evaluation results for Llama 2-Chat compared to other models

Based on Human EvaluationBenchmarks, The largest Llama 2-Chat model is competitive with ChatGPT. Llama 2-Chat 70B model has a win rate of 36% and a tie rate of 31.5% relative to ChatGPT. Llama 2-Chat 70B model outperforms the PaLM-bison chat model by a large percentage on meta’s prompt set.

But there is always the risk of safety in how these models are used. The rest of the limitations are some inherited ones like bias and hallucinations which are far low in version 2 of Llama compared to other models.

Recently there have been articles about how adding safety and restrictions to ChatGPT made it lose its performance over time and Llama 2 has addressed this issue in a far better method. The llama2 models are not multi-language and are only proficient in English text generation. They are also not good at coding tasks as they are built for chat purposes.

Why LLAMA 2 the next big thing after ChatGPT?

Llama 2 offers a diverse set of models and you can access it through Azure services, Aws Cloud, Hugging Face or running on a local device. The open-source nature of the model allows users to utilize it for generating desired results and fine-tune it for specific environments and custom use cases. Llama2 70 B’s performance is close to closed-source models.

The lesser parameter models are good enough to use in less powerful devices. Imagine talking to a chat model on your smartphone and getting a response even when you are not connected to the internet! Qualcomm announced a partnership with Meta to have Llama 2 run locally on Qualcomm-powered smartphones starting in 2024.

Every company, cannot afford to train their custom chatbot/model/assistants for their employees or their customers to use, but this is possible now with the latest release of Llama. Being open source, it can be fine-tuned, trained, added with layers and engineered accordingly based on requirements. The release has increased access to foundational Ai tech in the world.

Get Ready to witness a world of upcoming Ai applications that are going to be built on these models.

Try Out Llama 2

To examine Llama 2 models by yourself you can check out the free online demo offered by perplexity.ai and Hugging Face. If you like to test it out by running locally refer to the following collab code that runs llama-2 13b models locally. To run other Llama-2 models locally refer to the GitHub repository text-generation-webui-colab.

Before you run the code get access to llama 2 models from Meta by filling out this request access form. You will receive the access confirmation mail from Meta within a day. You will get a custom URL through which you can access and download model parameters. As you run the following code, you will need to input the URL for downloading the model parameters.

%cd /content
!apt-get -y install -qq aria2

!git clone -b v1.8 https://github.com/camenduru/text-generation-webui
%cd /content/text-generation-webui
!pip install -r requirements.txt

!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/4bit/Llama-2-7b-chat-hf/resolve/main/model-00001-of-00002.safetensors -d /content/text-generation-webui/models/Llama-2-7b-chat-hf -o model-00001-of-00002.safetensors
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/4bit/Llama-2-7b-chat-hf/resolve/main/model-00002-of-00002.safetensors -d /content/text-generation-webui/models/Llama-2-7b-chat-hf -o model-00002-of-00002.safetensors
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/4bit/Llama-2-7b-chat-hf/raw/main/model.safetensors.index.json -d /content/text-generation-webui/models/Llama-2-7b-chat-hf -o model.safetensors.index.json
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/4bit/Llama-2-7b-chat-hf/raw/main/special_tokens_map.json -d /content/text-generation-webui/models/Llama-2-7b-chat-hf -o special_tokens_map.json
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/4bit/Llama-2-7b-chat-hf/resolve/main/tokenizer.model -d /content/text-generation-webui/models/Llama-2-7b-chat-hf -o tokenizer.model
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/4bit/Llama-2-7b-chat-hf/raw/main/tokenizer_config.json -d /content/text-generation-webui/models/Llama-2-7b-chat-hf -o tokenizer_config.json
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/4bit/Llama-2-7b-chat-hf/raw/main/config.json -d /content/text-generation-webui/models/Llama-2-7b-chat-hf -o config.json
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/4bit/Llama-2-7b-chat-hf/raw/main/generation_config.json -d /content/text-generation-webui/models/Llama-2-7b-chat-hf -o generation_config.json

%cd /content/text-generation-webui
!python server.py --share --chat --model /content/text-generation-webui/models/Llama-2-7b-chat-hf


The article tries to explain key points from the technical paper of the Llama-2 release. For further information about training and testing refer to the research paper Open Foundation and Fine-Tuned Chat Models.

The release of Llama 2 is a game changer in the space of text generation and interactive bots. It competes & beats large models which nearly have double their parameter size. Here’s an exciting article about orca 13B that discusses the instruction tuning process in detail, helping you understand it better.

(Text generated by LLama-2-13b-chat)I’m confident that the knowledge gained from this technical blog post will serve you well in your future endeavours. Keep exploring and learning, and remember that the possibilities are endless!

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like