Our new lightweight Mamba2 hybrid models with attention layers
We're excited to introduce our latest model, the Lite Oute 2 Mamba2Attn 250M. This is our third generation model featuring the new Mamba2 architecture with attention layers.
The Lite Oute 2 Mamba2Attn 250M model comes in two variants:
Lite Oute 2 features 250 million parameters and the following architectural specifications:
Our model incorporates 4 attention layers, which constitute 12.5% of the total 32 layers. This design choice is based on insights from the paper "An Empirical Study of Mamba-based Language Models", which suggests that validation loss is minimized when approximately 8% of the layers are self-attention layers.
The paper also indicates that 30%-50% of the layers can be MLPs without increasing model loss. We've opted for a higher MLP layer ratio to optimize efficiency.
The model was pre-trained on 30 billion tokens using a balanced mixture of datasets:
The training was conducted on single NVIDIA 4090 and NVIDIA H100 GPUs, with the following key parameters:
To enhance training efficiency and performance, we employed various optimization techniques, including:
For instruction training, we first trained the model with Supervised Fine-tuning (SFT):
After the SFT training, we further refined the model using DPO training:
Benchmark | Lite-Oute-2-Mamba2Attn-250M-Instruct | Lite-Oute-1-300M-Instruct | Lite-Mistral-150M-v2-Instruct |
---|---|---|---|
ARC-C (0-shot) | 26.71 | 26.02 | - |
ARC-E (0-shot) | 53.70 | 49.79 | 39.52 |
HellaSWAG (0-shot) | 38.19 | 34.50 | 31.01 |
MMLU (0-shot) | 25.13 | 24.00 | 25.28 |
OpenBookQA (0-shot) | 32.20 | 32.20 | 28.40 |
PIQA (0-shot) | 66.59 | 65.40 | 60.23 |
Winogrande (0-shot) | 53.28 | 53.75 | 51.78 |
ARC-C (5-shot) | 27.82 | 26.37 | - |
ARC-E (5-shot) | 57.28 | 51.43 | 46.30 |
HellaSWAG (5-shot) | 37.71 | 34.93 | 30.73 |
MMLU (5-shot) | 25.81 | 25.87 | - |
OpenBookQA (5-shot) | 31.80 | 31.40 | 28.00 |
PIQA (5-shot) | 66.38 | 65.07 | 62.68 |
Winogrande (5-shot) | 51.14 | 52.01 | 51.30 |
Benchmark | Lite-Oute-2-Mamba2Attn-250M-Base |
---|---|
ARC-C (0-shot) | 26.88 |
ARC-E (0-shot) | 53.54 |
HellaSWAG (0-shot) | 38.00 |
MMLU (0-shot) | 24.87 |
OpenBookQA (0-shot) | 30.20 |
PIQA (0-shot) | 66.27 |
Winogrande (0-shot) | 52.01 |
ARC-C (5-shot) | 27.22 |
ARC-E (5-shot) | 55.51 |
HellaSWAG (5-shot) | 38.17 |
MMLU (5-shot) | 25.59 |
OpenBookQA (5-shot) | 30.40 |
PIQA (5-shot) | 66.59 |
Winogrande (5-shot) | 52.49 |
To ensure optimal performance, please use the following template when interacting with the model:
<s>system\n {System message}</s> <s>user\n {Your question or message}</s> <s>assistant\n {The model's response}</s>
When prompting the model for a response, your input should be structured as follows:
<s>system\n {System message}</s> <s>user\n {Your question or message}</s> <s>assistant\n
This format ensures the best performance, as the model was fine-tuned using this specific template.
If you're using a Hugging Face tokenizer, make sure to use tokenizer.apply_chat_template
or adjust your interface to match the template above.
The Lite-Oute-2-Mamba2Attn-250M-Instruct model demonstrates notable improvements over its predecessors, particularly in zero-shot and few-shot learning tasks. Despite its smaller size compared to Lite-Oute-1-300M-Instruct, it consistently outperforms both older models across most benchmarks. The model shows enhanced capabilities in tasks requiring common sense reasoning, language understanding, and general knowledge. Real-world examples indicate that the model can provide coherent and relevant responses to various queries, though there's room for improvement in accuracy and depth of knowledge.
For optimal task-specific performance, it is recommended to fine-tune the model on datasets relevant to the intended use case. Fine-tuning can significantly improve the model's accuracy, relevance, and depth of knowledge in specialized domains or for particular types of tasks. This process allows the model to adapt its pre-trained knowledge to the nuances and requirements of specific applications, potentially mitigating some of the weaknesses observed in general use.
Users should be aware that Lite Oute 2 is a relatively small language model with only 250 million parameters. As such, it has limitations in its knowledge, reasoning capabilities, and overall performance. This model can and will produce nonsensical or incorrect outputs at times. It is not a replacement for larger, more sophisticated models or human expertise. Users should always critically evaluate the model's outputs and not rely on them for critical decision-making without proper verification. This model is best suited for lightweight applications or as a starting point for further fine-tuning and development.