Lance: ByteDance's 3B Unified Model for Image and Video Understanding, Generation, and Editing
ByteDance's Lance is a 3B-parameter unified multimodal model that handles image/video understanding, generation, and editing with competitive benchmarks.
ByteDance has open-sourced Lance, a 3B-active-parameter native unified multimodal model that handles image and video understanding, generation, and editing within a single framework. Trained from scratch with a staged multi-task recipe on up to 128 A100 GPUs, Lance achieves competitive performance across multiple benchmarks despite its relatively small size.
Why Lance Matters
Most multimodal models today are either specialized (generation-only or understanding-only) or require massive parameter counts (7B–20B+) to unify capabilities. Lance demonstrates that a 3B model can match or exceed larger models across image generation, image editing, video generation, and video understanding tasks. This is significant for:
- Deployment efficiency: Lower VRAM requirements (40GB+ GPU) and faster inference
- Research accessibility: Smaller compute budget for training and fine-tuning
- Unified architecture: Single model for multiple tasks without task-specific heads
Architecture and Training
Lance is a native unified model, meaning it uses a single architecture for both understanding and generation tasks. Key details:
- Active parameters: 3B (not total parameters, but those activated during inference)
- Training: Staged multi-task recipe from scratch
- Compute: Up to 128 A100 GPUs
- Resolution: Up to 768×768 for images, 480p at 12 FPS for video
Supported Tasks
Lance supports seven task types out of the box:
| Task | Description |
|---|---|
| t2i | Text-to-Image generation |
| t2v | Text-to-Video generation |
| i2v | Image-to-Video generation |
| image_edit | Image editing |
| video_edit | Video editing |
| x2t_image | Image understanding (captioning, VQA) |
| x2t_video | Video understanding (captioning, QA) |
Installation and Setup
Requirements
- Python 3.10+
- CUDA 12.4+
- GPU with at least 40GB VRAM (tested on A100)
Quick Start
git clone https://github.com/bytedance/Lance.git
cd Lance
conda create -n Lance python=3.11 -y
conda activate Lance
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt
pip install flash-attn==2.8.3 --no-build-isolation
Download model weights from Hugging Face:
from huggingface_hub import snapshot_download
save_dir = "./downloads/"
repo_id = "bytedance-research/Lance"
cache_dir = save_dir + "/cache"
snapshot_download(
cache_dir=cache_dir,
local_dir=save_dir,
repo_id=repo_id,
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt", "*.pth"]
)
Inference Examples
Text-to-Video
bash inference_lance.sh \
--TASK_NAME t2v \
--MODEL_PATH downloads/Lance_3B_Video \
--RESOLUTION video_480p \
--NUM_FRAMES 121 \
--VIDEO_HEIGHT 480 \
--VIDEO_WIDTH 848 \
--SAVE_PATH_GEN results/t2v
Image-to-Video (New!)
bash inference_lance.sh \
--TASK_NAME i2v \
--MODEL_PATH downloads/Lance_3B_Video \
--RESOLUTION video_480p \
--NUM_FRAMES 61 \
--VIDEO_HEIGHT 480 \
--VIDEO_WIDTH 848 \
--SAVE_PATH_GEN results/i2v
Text-to-Image
bash inference_lance.sh \
--TASK_NAME t2i \
--MODEL_PATH downloads/Lance_3B \
--RESOLUTION image_768res \
--VIDEO_HEIGHT 768 \
--VIDEO_WIDTH 768 \
--SAVE_PATH_GEN results/t2i
Video Editing
bash inference_lance.sh \
--TASK_NAME video_edit \
--MODEL_PATH downloads/Lance_3B_Video \
--RESOLUTION video_480p \
--SAVE_PATH_GEN results/video_edit
Image Understanding
bash inference_lance.sh \
--TASK_NAME x2t_image \
--MODEL_PATH downloads/Lance_3B \
--RESOLUTION image_768res \
--SAVE_PATH_GEN results/x2t_image
Key Parameters
| Parameter | Default | Description |
|---|---|---|
MODEL_PATH |
downloads/Lance_3B |
Path to weights (Lance_3B or Lance_3B_Video) |
NUM_GPUS |
1 | Number of GPUs |
VALIDATION_NUM_TIMESTEPS |
30 | Denoising steps |
CFG_TEXT_SCALE |
4.0 | Classifier-Free Guidance scale |
NUM_FRAMES |
50 | Max 121 for video |
ENHANCE_PROMPT |
false | Enable prompt rewriting (requires OpenAI API key) |
Benchmark Performance
Lance achieves competitive results across multiple benchmarks:
Image Generation (GenEval)
- Overall: 0.90 (matches TUNA-7B, beats Janus-Pro-7B's 0.80)
- Colors: 0.97 (best among all models)
- Position: 0.87 (ties with TUNA-7B)
Image Editing (GEdit-Bench)
- Average: 7.30 (beats InternVL-U 1.7B's 6.66 and BAGEL's 6.52)
Video Generation (VBench)
- Total Score: 85.11 (beats TUNA 1.5B's 84.06 and Hunyuan Video's 83.43)
Gradio Demo
Lance includes a local Gradio interface for interactive use:
python lance_gradio.py --server-name 0.0.0.0 --server-port 7860
This provides a web UI for all supported tasks.
Roadmap
The team plans to release fine-tuning code, enabling customization for specific domains or tasks.
Limitations
- Research project, not a polished product
- Output quality varies across prompts, resolutions, and motion complexity
- Trained up to 768×768 images and 480p video
- Requires 40GB+ VRAM GPU
Conclusion
Lance represents a significant step toward efficient unified multimodal models. At 3B active parameters, it demonstrates that smaller models can compete with much larger ones when trained with a well-designed multi-task recipe. For developers and researchers looking to experiment with unified image/video understanding and generation, Lance provides a practical, open-source starting point.
Get started: GitHub Repository | Hugging Face Model | Technical Report