Training Mamba for Long Context Sequences using Deepspeed

Overview

Mamba is a selective state space model architecture that performs competitively with transformers on benchmarks while scaling linearly with sequence length. It holds great promise for processing long input sequences, but the base model was only pre-trained on 2048 token context length. This project continues pretraining Mamba with longer sequences from the SlimPajama dataset to test whether the model can process long context accurately.

Links

Tech stack