To enhance your browsing experience, we use cookies. By continuing, you agree to our use of cookies.

Lite Oute 2 Mamba2Attn 250M

Our new lightweight Mamba2 hybrid models with attention layers

Introduction

We're excited to introduce our latest model, the Lite Oute 2 Mamba2Attn 250M. This is our third generation model featuring the new Mamba2 architecture with attention layers.

Model Variants

The Lite Oute 2 Mamba2Attn 250M model comes in two variants:

Architecture Overview

Lite Oute 2 features 250 million parameters and the following architectural specifications:

Attention Layer Integration

Our model incorporates 4 attention layers, which constitute 12.5% of the total 32 layers. This design choice is based on insights from the paper "An Empirical Study of Mamba-based Language Models", which suggests that validation loss is minimized when approximately 8% of the layers are self-attention layers.

The paper also indicates that 30%-50% of the layers can be MLPs without increasing model loss. We've opted for a higher MLP layer ratio to optimize efficiency.

Training Process

The model was pre-trained on 30 billion tokens using a balanced mixture of datasets:

The training was conducted on single NVIDIA 4090 and NVIDIA H100 GPUs, with the following key parameters:

To enhance training efficiency and performance, we employed various optimization techniques, including:

Instruction Training

For instruction training, we first trained the model with Supervised Fine-tuning (SFT):

After the SFT training, we further refined the model using DPO training:

Benchmark Results

Instruct model:

Benchmark Lite-Oute-2-Mamba2Attn-250M-Instruct Lite-Oute-1-300M-Instruct Lite-Mistral-150M-v2-Instruct
ARC-C (0-shot) 26.71 26.02 -
ARC-E (0-shot) 53.70 49.79 39.52
HellaSWAG (0-shot) 38.19 34.50 31.01
MMLU (0-shot) 25.13 24.00 25.28
OpenBookQA (0-shot) 32.20 32.20 28.40
PIQA (0-shot) 66.59 65.40 60.23
Winogrande (0-shot) 53.28 53.75 51.78
ARC-C (5-shot) 27.82 26.37 -
ARC-E (5-shot) 57.28 51.43 46.30
HellaSWAG (5-shot) 37.71 34.93 30.73
MMLU (5-shot) 25.81 25.87 -
OpenBookQA (5-shot) 31.80 31.40 28.00
PIQA (5-shot) 66.38 65.07 62.68
Winogrande (5-shot) 51.14 52.01 51.30

Base model:

Benchmark Lite-Oute-2-Mamba2Attn-250M-Base
ARC-C (0-shot) 26.88
ARC-E (0-shot) 53.54
HellaSWAG (0-shot) 38.00
MMLU (0-shot) 24.87
OpenBookQA (0-shot) 30.20
PIQA (0-shot) 66.27
Winogrande (0-shot) 52.01
ARC-C (5-shot) 27.22
ARC-E (5-shot) 55.51
HellaSWAG (5-shot) 38.17
MMLU (5-shot) 25.59
OpenBookQA (5-shot) 30.40
PIQA (5-shot) 66.59
Winogrande (5-shot) 52.49

Interfacing with the Instruct Model

To ensure optimal performance, please use the following template when interacting with the model:

<s>system\n
{System message}</s>
<s>user\n
{Your question or message}</s>
<s>assistant\n
{The model's response}</s>

When prompting the model for a response, your input should be structured as follows:

<s>system\n
{System message}</s>
<s>user\n
{Your question or message}</s>
<s>assistant\n

This format ensures the best performance, as the model was fine-tuned using this specific template.

Recommended Settings

If you're using a Hugging Face tokenizer, make sure to use tokenizer.apply_chat_template or adjust your interface to match the template above.

Let's look at some real world examples:

Instruct Model

<s>user
What are three key benefits of electric vehicles compared to traditional gasoline-powered cars?</s>
<s>assistant

Electric vehicles have several advantages over gasoline-powered cars. Here are three key benefits:
1. Efficient: Electric vehicles use less fuel and emit fewer emissions than gasoline-powered cars, which can save consumers money on their monthly bills.
2. Fuel economy: Electric vehicles produce more power per mile driven, which means they consume less fuel overall. This helps reduce the need for refueling stations and reduces carbon emissions from transportation.
3. Flexibility: Electric vehicles can be adapted to different driving conditions, such as snowy or rainy days, and can accommodate different types of terrain. This makes them easier to operate in urban areas with limited parking spaces.
Overall, electric vehicles offer a range of benefits that make them an attractive choice for many drivers. They provide a cost-effective alternative to gasoline-powered cars, reducing the need for frequent refueling stations, and making it easier to drive long distances without having to worry about running out of gas.</s>

<s>user
Who wrote the play "Romeo and Juliet"?
</s>assistant

The play "Romeo and Juliet" by William Shakespeare is a classic example of a tragic love story. It tells the story of Romeo and Juliet, a young couple who fall in love with each other but ultimately lose their lives due to their respective flaws. The play was written during the Renaissance period and it has been performed around the world since then.</s>

<s>user
What is the capital of France?
</s>assistant

The capital of France is Paris.</s>

Conclusion

The Lite-Oute-2-Mamba2Attn-250M-Instruct model demonstrates notable improvements over its predecessors, particularly in zero-shot and few-shot learning tasks. Despite its smaller size compared to Lite-Oute-1-300M-Instruct, it consistently outperforms both older models across most benchmarks. The model shows enhanced capabilities in tasks requiring common sense reasoning, language understanding, and general knowledge. Real-world examples indicate that the model can provide coherent and relevant responses to various queries, though there's room for improvement in accuracy and depth of knowledge.

Strengths:

Weaknesses:

Fine-Tuning for Specific Tasks

For optimal task-specific performance, it is recommended to fine-tune the model on datasets relevant to the intended use case. Fine-tuning can significantly improve the model's accuracy, relevance, and depth of knowledge in specialized domains or for particular types of tasks. This process allows the model to adapt its pre-trained knowledge to the nuances and requirements of specific applications, potentially mitigating some of the weaknesses observed in general use.

Limitations and Considerations

Users should be aware that Lite Oute 2 is a relatively small language model with only 250 million parameters. As such, it has limitations in its knowledge, reasoning capabilities, and overall performance. This model can and will produce nonsensical or incorrect outputs at times. It is not a replacement for larger, more sophisticated models or human expertise. Users should always critically evaluate the model's outputs and not rely on them for critical decision-making without proper verification. This model is best suited for lightweight applications or as a starting point for further fine-tuning and development.