Lite Oute 2 Mamba2Attn

Introduction

We're excited to introduce our latest model, the Lite Oute 2 Mamba2Attn 250M. This is our third generation model featuring the new Mamba2 architecture with attention layers.

Model Variants

The Lite Oute 2 Mamba2Attn 250M model comes in two variants:

Architecture Overview

Lite Oute 2 features 250 million parameters and the following architectural specifications:

d_model: 1024
n_layer: 32
vocab_size: 32768
num_heads: 16
head_dim: 128
attn_layer_idx: [6, 12, 18, 24]

Attention Layer Integration

Our model incorporates 4 attention layers, which constitute 12.5% of the total 32 layers. This design choice is based on insights from the paper "An Empirical Study of Mamba-based Language Models", which suggests that validation loss is minimized when approximately 8% of the layers are self-attention layers.

The paper also indicates that 30%-50% of the layers can be MLPs without increasing model loss. We've opted for a higher MLP layer ratio to optimize efficiency.

Training Process

The model was pre-trained on 30 billion tokens using a balanced mixture of datasets:

50% dclm-baseline-1.0
50% fineweb-edu

The training was conducted on single NVIDIA 4090 and NVIDIA H100 GPUs, with the following key parameters:

Max learning rate: 4e-4
Min learning rate: 1e-4
Block size: 4096
Token batches: ~100k tokens

To enhance training efficiency and performance, we employed various optimization techniques, including:

Paged AdamW 32-bit optimizer
Mixed precision training with bfloat16
Gradient accumulation
Flash Attention 2

Instruction Training

For instruction training, we first trained the model with Supervised Fine-tuning (SFT):

Max learning rate: 2e-5
Scheduler: Cosine
Max block size: 4096
Token batches: ~100k tokens

After the SFT training, we further refined the model using DPO training:

Max learning rate: 9e-7
Scheduler: Cosine
Max block size: 4096
Token batches: ~100k tokens

Benchmark Results

Instruct model:

Benchmark	Lite-Oute-2-Mamba2Attn-250M-Instruct	Lite-Oute-1-300M-Instruct	Lite-Mistral-150M-v2-Instruct
ARC-C (0-shot)	26.71	26.02	-
ARC-E (0-shot)	53.70	49.79	39.52
HellaSWAG (0-shot)	38.19	34.50	31.01
MMLU (0-shot)	25.13	24.00	25.28
OpenBookQA (0-shot)	32.20	32.20	28.40
PIQA (0-shot)	66.59	65.40	60.23
Winogrande (0-shot)	53.28	53.75	51.78
ARC-C (5-shot)	27.82	26.37	-
ARC-E (5-shot)	57.28	51.43	46.30
HellaSWAG (5-shot)	37.71	34.93	30.73
MMLU (5-shot)	25.81	25.87	-
OpenBookQA (5-shot)	31.80	31.40	28.00
PIQA (5-shot)	66.38	65.07	62.68
Winogrande (5-shot)	51.14	52.01	51.30

Base model:

Benchmark	Lite-Oute-2-Mamba2Attn-250M-Base
ARC-C (0-shot)	26.88
ARC-E (0-shot)	53.54
HellaSWAG (0-shot)	38.00
MMLU (0-shot)	24.87
OpenBookQA (0-shot)	30.20
PIQA (0-shot)	66.27
Winogrande (0-shot)	52.01
ARC-C (5-shot)	27.22
ARC-E (5-shot)	55.51
HellaSWAG (5-shot)	38.17
MMLU (5-shot)	25.59
OpenBookQA (5-shot)	30.40
PIQA (5-shot)	66.59
Winogrande (5-shot)	52.49

Interfacing with the Instruct Model

To ensure optimal performance, please use the following template when interacting with the model:

<s>system\n
{System message}</s>
<s>user\n
{Your question or message}</s>
<s>assistant\n
{The model's response}</s>

When prompting the model for a response, your input should be structured as follows:

<s>system\n
{System message}</s>
<s>user\n
{Your question or message}</s>
<s>assistant\n

This format ensures the best performance, as the model was fine-tuned using this specific template.

Recommended Settings

Temperature: 0.1 - 0.4
Repetition Penalty: 1.10 - 1.12

If you're using a Hugging Face tokenizer, make sure to use tokenizer.apply_chat_template or adjust your interface to match the template above.

Let's look at some real world examples:

Instruct Model

<s>user
What are three key benefits of electric vehicles compared to traditional gasoline-powered cars?</s>
<s>assistant

Electric vehicles have several advantages over gasoline-powered cars. Here are three key benefits:
1. Efficient: Electric vehicles use less fuel and emit fewer emissions than gasoline-powered cars, which can save consumers money on their monthly bills.
2. Fuel economy: Electric vehicles produce more power per mile driven, which means they consume less fuel overall. This helps reduce the need for refueling stations and reduces carbon emissions from transportation.
3. Flexibility: Electric vehicles can be adapted to different driving conditions, such as snowy or rainy days, and can accommodate different types of terrain. This makes them easier to operate in urban areas with limited parking spaces.
Overall, electric vehicles offer a range of benefits that make them an attractive choice for many drivers. They provide a cost-effective alternative to gasoline-powered cars, reducing the need for frequent refueling stations, and making it easier to drive long distances without having to worry about running out of gas.</s>

<s>user
Who wrote the play "Romeo and Juliet"?
</s>assistant

The play "Romeo and Juliet" by William Shakespeare is a classic example of a tragic love story. It tells the story of Romeo and Juliet, a young couple who fall in love with each other but ultimately lose their lives due to their respective flaws. The play was written during the Renaissance period and it has been performed around the world since then.</s>

<s>user
What is the capital of France?
</s>assistant

The capital of France is Paris.</s>

Conclusion

The Lite-Oute-2-Mamba2Attn-250M-Instruct model demonstrates notable improvements over its predecessors, particularly in zero-shot and few-shot learning tasks. Despite its smaller size compared to Lite-Oute-1-300M-Instruct, it consistently outperforms both older models across most benchmarks. The model shows enhanced capabilities in tasks requiring common sense reasoning, language understanding, and general knowledge. Real-world examples indicate that the model can provide coherent and relevant responses to various queries, though there's room for improvement in accuracy and depth of knowledge.

Strengths:

Adaptability: The model demonstrates the ability to handle various types of queries, from simple factual questions to more complex, multi-faceted inquiries.
Structured Responses: For more detailed questions, the model often provides organized answers with numbered points, enhancing readability and comprehension.
Context Utilization: When given additional context or information, the model shows capability in extracting and using relevant details in its responses.

Weaknesses:

Inconsistent Accuracy: The model sometimes provides incomplete or partially incorrect information, indicating potential gaps in its knowledge base or reasoning abilities.
Lack of Depth: In some instances, especially for simpler queries, the model's responses lack depth or additional context that could be beneficial to the user.
Verbosity Control: The model occasionally struggles with balancing conciseness and detail, either providing overly brief answers or unnecessarily repeating given information.

Fine-Tuning for Specific Tasks

For optimal task-specific performance, it is recommended to fine-tune the model on datasets relevant to the intended use case. Fine-tuning can significantly improve the model's accuracy, relevance, and depth of knowledge in specialized domains or for particular types of tasks. This process allows the model to adapt its pre-trained knowledge to the nuances and requirements of specific applications, potentially mitigating some of the weaknesses observed in general use.

Limitations and Considerations

Users should be aware that Lite Oute 2 is a relatively small language model with only 250 million parameters. As such, it has limitations in its knowledge, reasoning capabilities, and overall performance. This model can and will produce nonsensical or incorrect outputs at times. It is not a replacement for larger, more sophisticated models or human expertise. Users should always critically evaluate the model's outputs and not rely on them for critical decision-making without proper verification. This model is best suited for lightweight applications or as a starting point for further fine-tuning and development.

Lite Oute 2 Mamba2Attn 250M