Lance: ByteDance's 3B Unified Model for Image and Video Understanding, Generation, and Editing

ByteDance has open-sourced Lance, a 3B-active-parameter native unified multimodal model that handles image and video understanding, generation, and editing within a single framework. Trained from scratch with a staged multi-task recipe on up to 128 A100 GPUs, Lance achieves competitive performance across multiple benchmarks despite its relatively small size.

Why Lance Matters

Most multimodal models today are either specialized (generation-only or understanding-only) or require massive parameter counts (7B–20B+) to unify capabilities. Lance demonstrates that a 3B model can match or exceed larger models across image generation, image editing, video generation, and video understanding tasks. This is significant for:

Deployment efficiency: Lower VRAM requirements (40GB+ GPU) and faster inference
Research accessibility: Smaller compute budget for training and fine-tuning
Unified architecture: Single model for multiple tasks without task-specific heads

Architecture and Training

Lance is a native unified model, meaning it uses a single architecture for both understanding and generation tasks. Key details:

Active parameters: 3B (not total parameters, but those activated during inference)
Training: Staged multi-task recipe from scratch
Compute: Up to 128 A100 GPUs
Resolution: Up to 768×768 for images, 480p at 12 FPS for video

Supported Tasks

Lance supports seven task types out of the box:

Task	Description
t2i	Text-to-Image generation
t2v	Text-to-Video generation
i2v	Image-to-Video generation
image_edit	Image editing
video_edit	Video editing
x2t_image	Image understanding (captioning, VQA)
x2t_video	Video understanding (captioning, QA)

Installation and Setup

Requirements

Python 3.10+
CUDA 12.4+
GPU with at least 40GB VRAM (tested on A100)

Quick Start

git clone https://github.com/bytedance/Lance.git
cd Lance

conda create -n Lance python=3.11 -y
conda activate Lance
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt
pip install flash-attn==2.8.3 --no-build-isolation

Download model weights from Hugging Face:

from huggingface_hub import snapshot_download

save_dir = "./downloads/"
repo_id = "bytedance-research/Lance"
cache_dir = save_dir + "/cache"

snapshot_download(
    cache_dir=cache_dir,
    local_dir=save_dir,
    repo_id=repo_id,
    local_dir_use_symlinks=False,
    resume_download=True,
    allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt", "*.pth"]
)

Inference Examples

Text-to-Video

bash inference_lance.sh \
  --TASK_NAME t2v \
  --MODEL_PATH downloads/Lance_3B_Video \
  --RESOLUTION video_480p \
  --NUM_FRAMES 121 \
  --VIDEO_HEIGHT 480 \
  --VIDEO_WIDTH 848 \
  --SAVE_PATH_GEN results/t2v

Image-to-Video (New!)

bash inference_lance.sh \
  --TASK_NAME i2v \
  --MODEL_PATH downloads/Lance_3B_Video \
  --RESOLUTION video_480p \
  --NUM_FRAMES 61 \
  --VIDEO_HEIGHT 480 \
  --VIDEO_WIDTH 848 \
  --SAVE_PATH_GEN results/i2v

Text-to-Image

bash inference_lance.sh \
  --TASK_NAME t2i \
  --MODEL_PATH downloads/Lance_3B \
  --RESOLUTION image_768res \
  --VIDEO_HEIGHT 768 \
  --VIDEO_WIDTH 768 \
  --SAVE_PATH_GEN results/t2i

Video Editing

bash inference_lance.sh \
  --TASK_NAME video_edit \
  --MODEL_PATH downloads/Lance_3B_Video \
  --RESOLUTION video_480p \
  --SAVE_PATH_GEN results/video_edit

Image Understanding

bash inference_lance.sh \
  --TASK_NAME x2t_image \
  --MODEL_PATH downloads/Lance_3B \
  --RESOLUTION image_768res \
  --SAVE_PATH_GEN results/x2t_image

Key Parameters

Parameter	Default	Description
`MODEL_PATH`	`downloads/Lance_3B`	Path to weights (Lance_3B or Lance_3B_Video)
`NUM_GPUS`	1	Number of GPUs
`VALIDATION_NUM_TIMESTEPS`	30	Denoising steps
`CFG_TEXT_SCALE`	4.0	Classifier-Free Guidance scale
`NUM_FRAMES`	50	Max 121 for video
`ENHANCE_PROMPT`	false	Enable prompt rewriting (requires OpenAI API key)

Benchmark Performance

Lance achieves competitive results across multiple benchmarks:

Image Generation (GenEval)

Overall: 0.90 (matches TUNA-7B, beats Janus-Pro-7B's 0.80)
Colors: 0.97 (best among all models)
Position: 0.87 (ties with TUNA-7B)

Image Editing (GEdit-Bench)

Average: 7.30 (beats InternVL-U 1.7B's 6.66 and BAGEL's 6.52)

Video Generation (VBench)

Total Score: 85.11 (beats TUNA 1.5B's 84.06 and Hunyuan Video's 83.43)

Gradio Demo

Lance includes a local Gradio interface for interactive use:

python lance_gradio.py --server-name 0.0.0.0 --server-port 7860

This provides a web UI for all supported tasks.

Roadmap

The team plans to release fine-tuning code, enabling customization for specific domains or tasks.

Limitations

Research project, not a polished product
Output quality varies across prompts, resolutions, and motion complexity
Trained up to 768×768 images and 480p video
Requires 40GB+ VRAM GPU

Conclusion

Lance represents a significant step toward efficient unified multimodal models. At 3B active parameters, it demonstrates that smaller models can compete with much larger ones when trained with a well-designed multi-task recipe. For developers and researchers looking to experiment with unified image/video understanding and generation, Lance provides a practical, open-source starting point.

Get started: GitHub Repository | Hugging Face Model | Technical Report

Source

bytedance/Lance: A 3B-active-parameter native unified multimodal model for image and video understanding, generation, and editing.